Comparing demographic data, clinical data, and outcomes across clusters.

Published

November 24, 2023

Introduction

Code
set.seed(123)
library(nnet) # Multinomial logistic regression
library(tidymodels) # Modelling framework
library(ranger) # Random forest
library(survival) # Kaplan-Meier
library(survminer) # Plot Kaplan-Meier curves
library(patchwork)
library(vip)
library(KONPsurv)

if (!require(plyr)) {
  install.packages("plyr")
}

########################
#-- Custom Functions --#
########################

#' @title Calculate descriptive statistics for non-normal continuous variables
#' @param vars Character vector of continuous variable names
#' @param data Dataframe object
#' @description
#' Finds the median, first quartile, third quartile, minimum and maximum
#'   statistics for non-normal continuous variables
#' @return A dataframe object with length(vars) rows and 7 columns
#' corresponding to the descriptive statistics calculated
describe_cont <- function(vars, data){
  if (tibble::is_tibble(data)) {
    data <- as.data.frame(data)
  }
  out_desc <- data.frame(Variable = character(),
                         n = double(),
                         Median = double(),
                         Q1 = double(),
                         Q3 = double(),
                         Min = double(),
                         Max = double()
  )
  for (i in 1:length(vars)) {
    var <- vars[i]
    med <- summary(data[, var])[c(3, 2, 5, 1, 6)]
    n <- sum(!is.na(data[, var]))
    out_desc[i, ] <- c(var, n, signif(med, 3))
  }
  out_desc
}
#' @title Calculate descriptive statistics for categorical variables
#' @param vars Character vector of categorical variable names
#' @param data Dataframe object
#' @description
#' Prints frequency and proportional tables for given categorical variables
#' @return A dataframe object with length(vars) rows and 3 columns
#' corresponding to the descriptive statistics calculated
describe_cat <- function(vars, data){
  if (tibble::is_tibble(data)) {
    data <- as.data.frame(data)
  }
  for (i in 1:length(vars)) {
    var <- vars[i]
    print(table(data[, var], useNA = "always"))
    print(prop.table(table(data[, var], useNA = "always")))
  }
}

kaplan_plot <- function(model, title, subtitle = TRUE, legend = TRUE){
  if (legend) {
    p <- ggsurvplot(model,
                    conf.int = TRUE,
                    surv.median.line = "none",
                    data = demographic,
                    pval = TRUE,
                    pval.method = TRUE,
                    ggtheme = theme_minimal() +
                      theme(plot.title = element_text(face = "bold",
                                                      size = 20)
                            ),
                    pval.size = 4,
                    legend = "bottom")
  } else {
        p <- ggsurvplot(model,
                    conf.int = TRUE,
                    surv.median.line = "none",
                    data = demographic,
                    pval = TRUE,
                    pval.method = TRUE,
                    ggtheme = theme_minimal() + 
                      theme(plot.title = element_text(face = "bold",
                                                      size = 20)
                            ),
                    pval.size = 4,
                    legend = "none")
  }
  if (subtitle) {
    p <- p + ggtitle(title, "Stratified by cluster") + xlab("Time (years)") 
  } else {
    p <- p + ggtitle(title) + xlab("Time (years)")
  }
  return(p)
} 

After deeming the four-cluster LCMM to be the most suitable, we will now explore potential associations with cluster membership. We will consider variables typically available at diagnosis, outcomes usually indicative of a poor disease course, and treatments prescribed within one year of diagnosis. For all categorical variables, frequency tables for the study population and clusters have been generated in addition to either chi-squared test or (if a cell has < 5 observations) Fisher’s exact test results. For continuous variables, the median, first quartile, and third quartile have been reported for the study population and clusters in addition to ANOVA results.

To streamline the analysis, descriptive functions have been created, describe_cat() and describe_cont() which generate the aforementioned results

We also attempt to predict cluster membership based on these variables to determine if cluster membership can be accurately predicted ahead of time using these variables. We consider both multinomial logistic regression and random forest predictive models for this purpose.

Characteristics available at diagnosis

We have found two variables available at diagnosis to be associated with cluster membership: smoking at diagnosis (p = 0.015) and upper gastrointestinal inflammation (Montreal L4; p < 0.001).

Code
FCcumulative <- readRDS(paste0("/Volumes/igmm/cvallejo-predicct/cdi/processed/",
                               "FCcumulativeLongInc.RDS"))
model <- readRDS("cache/cubicbf.fits.RDS")[[4]]
model$pprob$class <- plyr::mapvalues(model$pprob$class,
                                     from = c(4, 3, 2, 1),
                                     to = c(2, 1, 3, 4))
demographic <- read.csv(paste0("/Volumes/igmm/cvallejo-predicct/cdi/data/",
                               "20220328/",
                               "Nathan_FCprogressioncohort.csv")
                         )
demographic <- subset(demographic, Number %in% model$pprob$id)
colnames(demographic)[1] <- "id"
classes <- model$pprob[, 1:2]
demographic <- merge(demographic, classes, by = "id")
demographic$cluster <- demographic$class

Population

Code
describe_cat("SEX", demographic)

   F    M <NA> 
 173  183    0 

        F         M      <NA> 
0.4859551 0.5140449 0.0000000 

Cluster = 1

Code
describe_cat("SEX", subset(demographic, class == 1))

   F    M <NA> 
  49   43    0 

        F         M      <NA> 
0.5326087 0.4673913 0.0000000 

Cluster = 2

Code
describe_cat("SEX", subset(demographic, class == 2))

   F    M <NA> 
  84  107    0 

        F         M      <NA> 
0.4397906 0.5602094 0.0000000 

Cluster = 3

Code
describe_cat("SEX", subset(demographic, class == 3))

   F    M <NA> 
  34   24    0 

        F         M      <NA> 
0.5862069 0.4137931 0.0000000 

Cluster = 4

Code
describe_cat("SEX", subset(demographic, class == 4))

   F    M <NA> 
   6    9    0 

   F    M <NA> 
 0.4  0.6  0.0 

Chi-squared test

Code
chisq.test(demographic$SEX, demographic$cluster)

    Pearson's Chi-squared test

data:  demographic$SEX and demographic$cluster
X-squared = 5.2083, df = 3, p-value = 0.1572

Population

Code
describe_cont("age", demographic)
  Variable   n Median Q1   Q3  Min  Max
1      age 356   27.3 16 48.7 3.75 87.1

Cluster = 1

Code
describe_cont("age", subset(demographic, class == 1))
  Variable  n Median   Q1   Q3  Min  Max
1      age 92   29.6 20.4 45.3 6.87 87.1

Cluster = 2

Code
describe_cont("age", subset(demographic, class == 2))
  Variable   n Median   Q1   Q3  Min  Max
1      age 191   26.4 15.3 49.9 3.75 81.3

Cluster = 3

Code
describe_cont("age", subset(demographic, class == 3))
  Variable  n Median   Q1   Q3  Min  Max
1      age 58   26.8 15.1 50.9 4.98 78.4

Cluster = 4

Code
describe_cont("age", subset(demographic, class == 4))
  Variable  n Median   Q1   Q3  Min  Max
1      age 15   21.7 14.5 51.2 10.5 71.3

ANOVA

Code
summary(aov(age ~ class, data = demographic))
             Df Sum Sq Mean Sq F value Pr(>F)
class         1     14    14.5   0.036  0.851
Residuals   354 144163   407.2               

If a participant has reported to have not smoked in the past, then it is assumed the participant did not smoke at IBD diagnosis. Otherwise smoking status is taken from the participant’s response to being asked if they were smoking when diagnosed with IBD.

Code
colnames(demographic)[colnames(demographic) == "SMOKER..Y.N."] <- "smoking"
demographic$smoking <- factor(demographic$smoking, labels = c("no", "yes"))

Population

Code
describe_cat("smoking", demographic)

  no  yes <NA> 
 286   70    0 

       no       yes      <NA> 
0.8033708 0.1966292 0.0000000 

Cluster = 1

Code
describe_cat("smoking", subset(demographic, class == 1))

  no  yes <NA> 
  70   22    0 

       no       yes      <NA> 
0.7608696 0.2391304 0.0000000 

Cluster = 2

Code
describe_cat("smoking", subset(demographic, class == 2))

  no  yes <NA> 
 148   43    0 

       no       yes      <NA> 
0.7748691 0.2251309 0.0000000 

Cluster = 3

Code
describe_cat("smoking", subset(demographic, class == 3))

  no  yes <NA> 
  54    4    0 

        no        yes       <NA> 
0.93103448 0.06896552 0.00000000 

Cluster = 4

Code
describe_cat("smoking", subset(demographic, class == 4))

  no  yes <NA> 
  14    1    0 

        no        yes       <NA> 
0.93333333 0.06666667 0.00000000 

Fisher’s exact test

Code
fisher.test(demographic$cluster, demographic$smoking)

    Fisher's Exact Test for Count Data

data:  demographic$cluster and demographic$smoking
p-value = 0.01455
alternative hypothesis: two.sided

Population

Code
FCcumulative <- readRDS(paste0("/Volumes/igmm/cvallejo-predicct/processed-data/cdi/",
                               "FCcumulativeLongInc.RDS"))
FCcumulative <- FCcumulative[order(FCcumulative$time), ]
DiagFC <- rep(NA, nrow(demographic))
for (i in 1:nrow(demographic)) {
  subject <- demographic[i, "id"]
  subject.data <- subset(FCcumulative, id == subject)
  DiagFC[i] <- subject.data[1, "value"] # Already sorted by time
}
demographic$DiagFC <- DiagFC

describe_cont("DiagFC", demographic)
  Variable   n Median  Q1   Q3 Min  Max
1   DiagFC 356    820 590 1140 250 2500

Cluster = 1

Code
describe_cont("DiagFC", subset(demographic, class == 1))
  Variable  n Median  Q1  Q3 Min  Max
1   DiagFC 92    725 500 986 253 2500

Cluster = 2

Code
describe_cont("DiagFC", subset(demographic, class == 2))
  Variable   n Median  Q1   Q3 Min  Max
1   DiagFC 191    900 610 1270 250 2500

Cluster = 3

Code
describe_cont("DiagFC", subset(demographic, class == 3))
  Variable  n Median  Q1   Q3 Min  Max
1   DiagFC 58    825 592 1180 310 2500

Cluster = 4

Code
describe_cont("DiagFC", subset(demographic, class == 4))
  Variable  n Median  Q1   Q3 Min  Max
1   DiagFC 15    660 630 1160 470 2040

ANOVA

Code
summary(aov(DiagFC ~ class, data = demographic))
             Df    Sum Sq Mean Sq F value Pr(>F)
class         1    676441  676441   2.295  0.131
Residuals   354 104332274  294724               
Code
colnames(demographic)[colnames(demographic) == "BEHAVIOUR.AT.DIAGNOSIS"] <- "behaviour"
demographic$behaviour <- as.factor(demographic$behaviour)

Population

Code
describe_cat("behaviour", demographic)

   1    2    3 <NA> 
 323   30    3    0 

          1           2           3        <NA> 
0.907303371 0.084269663 0.008426966 0.000000000 

Cluster = 1

Code
describe_cat("behaviour", subset(demographic, class == 1))

   1    2    3 <NA> 
  80   10    2    0 

         1          2          3       <NA> 
0.86956522 0.10869565 0.02173913 0.00000000 

Cluster = 2

Code
describe_cat("behaviour", subset(demographic, class == 2))

   1    2    3 <NA> 
 175   15    1    0 

          1           2           3        <NA> 
0.916230366 0.078534031 0.005235602 0.000000000 

Cluster = 3

Code
describe_cat("behaviour", subset(demographic, class == 3))

   1    2    3 <NA> 
  55    3    0    0 

         1          2          3       <NA> 
0.94827586 0.05172414 0.00000000 0.00000000 

Cluster = 4

Code
describe_cat("behaviour", subset(demographic, class == 4))

   1    2    3 <NA> 
  13    2    0    0 

        1         2         3      <NA> 
0.8666667 0.1333333 0.0000000 0.0000000 

Fisher’s exact test

Code
fisher.test(demographic$cluster, demographic$behaviour)

    Fisher's Exact Test for Count Data

data:  demographic$cluster and demographic$behaviour
p-value = 0.4937
alternative hypothesis: two.sided
Code
colnames(demographic)[colnames(demographic) == "PERIANAL.DISEASE..Y.N."] <- "peri"
demographic$peri <- factor(demographic$peri, labels = c("no", "yes"))

Population

Code
describe_cat("peri", demographic)

  no  yes <NA> 
 299   57    0 

       no       yes      <NA> 
0.8398876 0.1601124 0.0000000 

Cluster = 1

Code
describe_cat("peri", subset(demographic, class == 1))

  no  yes <NA> 
  75   17    0 

       no       yes      <NA> 
0.8152174 0.1847826 0.0000000 

Cluster = 2

Code
describe_cat("peri", subset(demographic, class == 2))

  no  yes <NA> 
 163   28    0 

       no       yes      <NA> 
0.8534031 0.1465969 0.0000000 

Cluster = 3

Code
describe_cat("peri", subset(demographic, class == 3))

  no  yes <NA> 
  49    9    0 

       no       yes      <NA> 
0.8448276 0.1551724 0.0000000 

Cluster = 4

Code
describe_cat("peri", subset(demographic, class == 4))

  no  yes <NA> 
  12    3    0 

  no  yes <NA> 
 0.8  0.2  0.0 

Fisher’s exact test

Code
fisher.test(demographic$cluster, demographic$peri)

    Fisher's Exact Test for Count Data

data:  demographic$cluster and demographic$peri
p-value = 0.7757
alternative hypothesis: two.sided
Code
demographic$LOCATION <- factor(demographic$LOCATION, labels = c("L1", "L2", "L3"))

Population

Code
describe_cat("LOCATION", demographic)

  L1   L2   L3 <NA> 
  95  140  121    0 

       L1        L2        L3      <NA> 
0.2668539 0.3932584 0.3398876 0.0000000 

Cluster = 1

Code
describe_cat("LOCATION", subset(demographic, class == 1))

  L1   L2   L3 <NA> 
  22   39   31    0 

       L1        L2        L3      <NA> 
0.2391304 0.4239130 0.3369565 0.0000000 

Cluster = 2

Code
describe_cat("LOCATION", subset(demographic, class == 2))

  L1   L2   L3 <NA> 
  58   68   65    0 

       L1        L2        L3      <NA> 
0.3036649 0.3560209 0.3403141 0.0000000 

Cluster = 3

Code
describe_cat("LOCATION", subset(demographic, class == 3))

  L1   L2   L3 <NA> 
   9   30   19    0 

       L1        L2        L3      <NA> 
0.1551724 0.5172414 0.3275862 0.0000000 
Cluster = 4
Code
describe_cat("LOCATION", subset(demographic, class == 4))

  L1   L2   L3 <NA> 
   6    3    6    0 

  L1   L2   L3 <NA> 
 0.4  0.2  0.4  0.0 

Fisher’s exact test

Code
fisher.test(demographic$cluster, demographic$LOCATION, workspace = 800000)

    Fisher's Exact Test for Count Data

data:  demographic$cluster and demographic$LOCATION
p-value = 0.125
alternative hypothesis: two.sided
Code
colnames(demographic)[colnames(demographic) == "L4.Modifier"] <- "L4"
demographic$L4 <- factor(demographic$L4, labels = c("no", "yes"))

Population

Code
describe_cat("L4", demographic)

  no  yes <NA> 
 272   84    0 

       no       yes      <NA> 
0.7640449 0.2359551 0.0000000 

Cluster = 1

Code
describe_cat("L4", subset(demographic, class == 1))

  no  yes <NA> 
  84    8    0 

        no        yes       <NA> 
0.91304348 0.08695652 0.00000000 

Cluster = 2

Code
describe_cat("L4", subset(demographic, class == 2))

  no  yes <NA> 
 140   51    0 

       no       yes      <NA> 
0.7329843 0.2670157 0.0000000 

Cluster = 3

Code
describe_cat("L4", subset(demographic, class == 3))

  no  yes <NA> 
  38   20    0 

       no       yes      <NA> 
0.6551724 0.3448276 0.0000000 

Cluster = 4

Code
describe_cat("L4", subset(demographic, class == 4))

  no  yes <NA> 
  10    5    0 

       no       yes      <NA> 
0.6666667 0.3333333 0.0000000 

Fisher’s exact test

Code
fisher.test(demographic$cluster, demographic$L4)

    Fisher's Exact Test for Count Data

data:  demographic$cluster and demographic$L4
p-value = 0.0002175
alternative hypothesis: two.sided

Associations with treatment information

To explore the potential presence of treatment effects, we have tested for associations between cluster membership and common IBD prescriptions. All treatments were prescribed within one year of diagnosis unless otherwise stated. Thioprine use (p = 0.02), biologic use within three months of diagnosis (p < 0.001), and biologic use within one year of diagnosis (p < 0.001) were found to be significantly associated with cluster membership.

Population

Code
describe_cat("X5ASA", demographic)

   N    Y <NA> 
 280   76    0 

        N         Y      <NA> 
0.7865169 0.2134831 0.0000000 

Cluster = 1

Code
describe_cat("X5ASA", subset(demographic, class == 1))

   N    Y <NA> 
  77   15    0 

        N         Y      <NA> 
0.8369565 0.1630435 0.0000000 

Cluster = 2

Code
describe_cat("X5ASA", subset(demographic, class == 2))

   N    Y <NA> 
 143   48    0 

        N         Y      <NA> 
0.7486911 0.2513089 0.0000000 

Cluster = 3

Code
describe_cat("X5ASA", subset(demographic, class == 3))

   N    Y <NA> 
  47   11    0 

        N         Y      <NA> 
0.8103448 0.1896552 0.0000000 

Cluster = 4

Code
describe_cat("X5ASA", subset(demographic, class == 4))

   N    Y <NA> 
  13    2    0 

        N         Y      <NA> 
0.8666667 0.1333333 0.0000000 

Fisher’s exact test

Code
fisher.test(demographic$cluster, demographic$X5ASA)

    Fisher's Exact Test for Count Data

data:  demographic$cluster and demographic$X5ASA
p-value = 0.3261
alternative hypothesis: two.sided
Code
colnames(demographic)[colnames(demographic) == "THIOPURINE.WITHIN.1st.YEAR..Y.N."] <- "use_thio"

Population

Code
describe_cat("use_thio", demographic)

   N    Y <NA> 
 106  250    0 

        N         Y      <NA> 
0.2977528 0.7022472 0.0000000 

Cluster = 1

Code
describe_cat("use_thio", subset(demographic, class == 1))

   N    Y <NA> 
  30   62    0 

       N        Y     <NA> 
0.326087 0.673913 0.000000 

Cluster = 2

Code
describe_cat("use_thio", subset(demographic, class == 2))

   N    Y <NA> 
  64  127    0 

        N         Y      <NA> 
0.3350785 0.6649215 0.0000000 

Cluster = 3

Code
describe_cat("use_thio", subset(demographic, class == 3))

   N    Y <NA> 
   8   50    0 

       N        Y     <NA> 
0.137931 0.862069 0.000000 

Cluster = 4

Code
describe_cat("use_thio", subset(demographic, class == 4))

   N    Y <NA> 
   4   11    0 

        N         Y      <NA> 
0.2666667 0.7333333 0.0000000 

Fisher’s exact test

Code
fisher.test(demographic$cluster, demographic$use_thio)

    Fisher's Exact Test for Count Data

data:  demographic$cluster and demographic$use_thio
p-value = 0.02305
alternative hypothesis: two.sided
Code
colnames(demographic)[colnames(demographic) == "CORTICOSTEROIDS.WITHIN.1st.YEAR...Y.N."] <- "use_cortico"
demographic$use_cortico <- plyr::mapvalues(demographic$use_cortico, "y", "Y")

Population

Code
describe_cat("use_cortico", demographic)

   N    Y <NA> 
  58  298    0 

        N         Y      <NA> 
0.1629213 0.8370787 0.0000000 

Cluster = 1

Code
describe_cat("use_cortico", subset(demographic, class == 1))

   N    Y <NA> 
  14   78    0 

        N         Y      <NA> 
0.1521739 0.8478261 0.0000000 

Cluster = 2

Code
describe_cat("use_cortico", subset(demographic, class == 2))

   N    Y <NA> 
  32  159    0 

        N         Y      <NA> 
0.1675393 0.8324607 0.0000000 

Cluster = 3

Code
describe_cat("use_cortico", subset(demographic, class == 3))

   N    Y <NA> 
  10   48    0 

        N         Y      <NA> 
0.1724138 0.8275862 0.0000000 

Cluster = 4

Code
describe_cat("use_cortico", subset(demographic, class == 4))

   N    Y <NA> 
   2   13    0 

        N         Y      <NA> 
0.1333333 0.8666667 0.0000000 

Fisher’s exact test

Code
fisher.test(demographic$cluster, demographic$use_cortico)

    Fisher's Exact Test for Count Data

data:  demographic$cluster and demographic$use_cortico
p-value = 0.9829
alternative hypothesis: two.sided
Code
colnames(demographic)[colnames(demographic) == "MTX.WITHIN.1st.YEAR..Y.N."] <- "use_mtx"

Population

Code
describe_cat("use_mtx", demographic)

   N    Y <NA> 
 341   15    0 

         N          Y       <NA> 
0.95786517 0.04213483 0.00000000 

Cluster = 1

Code
describe_cat("use_mtx", subset(demographic, class == 1))

   N    Y <NA> 
  84    8    0 

         N          Y       <NA> 
0.91304348 0.08695652 0.00000000 

Cluster = 2

Code
describe_cat("use_mtx", subset(demographic, class == 2))

   N    Y <NA> 
 185    6    0 

         N          Y       <NA> 
0.96858639 0.03141361 0.00000000 

Cluster = 3

Code
describe_cat("use_mtx", subset(demographic, class == 3))

   N    Y <NA> 
  57    1    0 

         N          Y       <NA> 
0.98275862 0.01724138 0.00000000 

Cluster = 4

Code
describe_cat("use_mtx", subset(demographic, class == 4))

   N <NA> 
  15    0 

   N <NA> 
   1    0 

Fisher’s exact test

Code
fisher.test(demographic$cluster, demographic$use_mtx)

    Fisher's Exact Test for Count Data

data:  demographic$cluster and demographic$use_mtx
p-value = 0.1393
alternative hypothesis: two.sided
Code
colnames(demographic)[colnames(demographic) == "EEN..Y.N."] <- "EEN"
demographic$EEN <- na_if(demographic$EEN, "")

Population

Code
describe_cat("EEN", demographic)

   N    Y <NA> 
 213   80   63 

        N         Y      <NA> 
0.5983146 0.2247191 0.1769663 

Cluster = 1

Code
describe_cat("EEN", subset(demographic, class == 1))

   N    Y <NA> 
  54   22   16 

        N         Y      <NA> 
0.5869565 0.2391304 0.1739130 

Cluster = 2

Code
describe_cat("EEN", subset(demographic, class == 2))

   N    Y <NA> 
 115   39   37 

        N         Y      <NA> 
0.6020942 0.2041885 0.1937173 

Cluster = 3

Code
describe_cat("EEN", subset(demographic, class == 3))

   N    Y <NA> 
  35   14    9 

        N         Y      <NA> 
0.6034483 0.2413793 0.1551724 

Cluster = 4

Code
describe_cat("EEN", subset(demographic, class == 4))

   N    Y <NA> 
   9    5    1 

         N          Y       <NA> 
0.60000000 0.33333333 0.06666667 

Fisher’s exact test

Code
fisher.test(demographic$cluster, demographic$EEN)

    Fisher's Exact Test for Count Data

data:  demographic$cluster and demographic$EEN
p-value = 0.7785
alternative hypothesis: two.sided
Code
colnames(demographic)[colnames(demographic) == "BIOLOGIC.MONO...COMBO.WITHIN.3M"] <- "use_bio3"

Population

Code
describe_cat("use_bio3", demographic)

   0    1 <NA> 
 312   44    0 

        0         1      <NA> 
0.8764045 0.1235955 0.0000000 

Cluster = 1

Code
describe_cat("use_bio3", subset(demographic, class == 1))

   0    1 <NA> 
  69   23    0 

   0    1 <NA> 
0.75 0.25 0.00 

Cluster = 2

Code
describe_cat("use_bio3", subset(demographic, class == 2))

   0    1 <NA> 
 177   14    0 

         0          1       <NA> 
0.92670157 0.07329843 0.00000000 

Cluster = 3

Code
describe_cat("use_bio3", subset(demographic, class == 3))

   0    1 <NA> 
  53    5    0 

        0         1      <NA> 
0.9137931 0.0862069 0.0000000 

Cluster = 4

Code
describe_cat("use_bio3", subset(demographic, class == 4))

   0    1 <NA> 
  13    2    0 

        0         1      <NA> 
0.8666667 0.1333333 0.0000000 

Fisher’s exact test

Code
fisher.test(demographic$cluster, demographic$use_bio3)

    Fisher's Exact Test for Count Data

data:  demographic$cluster and demographic$use_bio3
p-value = 0.0005305
alternative hypothesis: two.sided
Code
colnames(demographic)[colnames(demographic) == "BIOLOGIC.WITHIN.1st.YEAR..Y.N."] <- "use_bio"

Population

Code
describe_cat("use_bio", demographic)

   N    Y <NA> 
 262   94    0 

        N         Y      <NA> 
0.7359551 0.2640449 0.0000000 

Cluster = 1

Code
describe_cat("use_bio", subset(demographic, class == 1))

   N    Y <NA> 
  50   42    0 

        N         Y      <NA> 
0.5434783 0.4565217 0.0000000 

Cluster = 2

Code
describe_cat("use_bio", subset(demographic, class == 2))

   N    Y <NA> 
 156   35    0 

        N         Y      <NA> 
0.8167539 0.1832461 0.0000000 

Cluster = 3

Code
describe_cat("use_bio", subset(demographic, class == 3))

   N    Y <NA> 
  46   12    0 

        N         Y      <NA> 
0.7931034 0.2068966 0.0000000 

Cluster = 4

Code
describe_cat("use_bio", subset(demographic, class == 4))

   N    Y <NA> 
  10    5    0 

        N         Y      <NA> 
0.6666667 0.3333333 0.0000000 

Fisher’s exact test

Code
fisher.test(demographic$cluster, demographic$use_bio)

    Fisher's Exact Test for Count Data

data:  demographic$cluster and demographic$use_bio
p-value = 1.823e-05
alternative hypothesis: two.sided

Here, we consider the number of days from the start of the study period (2005-01-1) to when a subject was diagnosed to determine if when a subject was diagnosed has a statistically significant impact on which cluster a subject was assigned to.

Code
demographic <- datefixR::fix_dates(demographic, "DATE.OF.DIAGNOSIS")
Warning: `fix_dates()` was deprecated in datefixR 1.0.0.
ℹ Please use `fix_date_df()` instead.
Code
demographic$Days.From.Start <- as.numeric(demographic$DATE.OF.DIAGNOSIS) -
  as.numeric(as.Date("2005-01-01"))

describe_cont("Days.From.Start", demographic)
         Variable   n Median   Q1   Q3 Min  Max
1 Days.From.Start 356   2980 2280 3840   0 4710

Cluster = 1

Code
describe_cont("Days.From.Start", subset(demographic, class == 1))
         Variable  n Median   Q1   Q3 Min  Max
1 Days.From.Start 92   3060 2480 4000 150 4700

Cluster = 2

Code
describe_cont("Days.From.Start", subset(demographic, class == 2))
         Variable   n Median   Q1   Q3 Min  Max
1 Days.From.Start 191   3050 2110 3850   0 4710

Cluster = 3

Code
describe_cont("Days.From.Start", subset(demographic, class == 3))
         Variable  n Median   Q1   Q3 Min  Max
1 Days.From.Start 58   2840 2260 3790 424 4410

Cluster = 4

Code
describe_cont("Days.From.Start", subset(demographic, class == 4))
         Variable  n Median   Q1   Q3  Min  Max
1 Days.From.Start 15   2820 2360 3430 1370 4300

ANOVA

Code
summary(aov(Days.From.Start ~ class, data = demographic))
             Df    Sum Sq Mean Sq F value Pr(>F)
class         1   2713998 2713998   2.419  0.121
Residuals   354 397129240 1121834               

Predictive models

Here, we consider whether we can accurately predict cluster membership using age, sex, and the variables we found to be significantly associated with cluster membership: smoking at diagnosis, upper gastrointestinal inflammation (Montreal L4), and being prescribed a biologic therapeutic within one year of diagnosis. We consider two types of models: multinomial logistic regression models and a random forest classifiers using the nnet and ranger packages. We use the tidymodels framework for fitting these models.

The data are split into a 75:25 train:test split and k-fold cross validation is used. Reported metrics and the confusion matrix are based on the test data split.

We attempt to predict cluster membership for all four clusters.

Code
demographic$cluster <- as.factor(demographic$cluster)
demographic$cluster = relevel(demographic$cluster, ref = 2) # 

data_split <- initial_split(demographic, prop = 0.75, strata = cluster)
train_data <- training(data_split)
test_data  <- testing(data_split)

folds <- vfold_cv(train_data, v = 4, strata = cluster)

class_rec <- recipe(cluster ~ age +
                      SEX  +
                      smoking +
                      LOCATION +
                      L4 +
                      behaviour +
                      peri +
                      DiagFC +
                      use_bio +
                      use_bio3,
                    data = test_data)

Multinomial logistic regression

Code
mlr_mod <- multinom_reg(penalty = tune(),
                        mode = "classification") %>%
  set_engine("nnet")

class_wflow <- 
  workflow() %>% 
  add_model(mlr_mod) %>% 
  add_recipe(class_rec)

mlr_grid <- grid_regular(
                         penalty(),
                         levels = 5)

class_fit <- class_wflow %>% 
    tune_grid(
    resamples = folds,
    grid = mlr_grid
    )

best_mlr <- class_fit %>%
  select_best("accuracy")

final_wf <- 
  class_wflow %>% 
  finalize_workflow(best_mlr)


final_fit <- 
  final_wf %>%
  last_fit(data_split) 

temp <- final_fit %>% extract_fit_engine()

temp$call <- quote(nnet::multinom(formula = cluster ~ age +
                                      SEX  +
                                      smoking +
                                      LOCATION +
                                      L4 +
                                      behaviour +
                                      peri +
                                      DiagFC +
                                      use_bio +
                                      use_bio3,
                                  data = train_data,
                                  decay = ~0.00316227766016838, 
                                  trace = FALSE))

hmm <- summary(temp)

Odd ratios

Point estimates:

Code
knitr::kable(round(exp(hmm$coefficients), 2)) 
(Intercept) age SEXM smokingyes LOCATIONL2 LOCATIONL3 L4yes behaviour2 behaviour3 periyes DiagFC use_bioY use_bio3
1 0.98 0.99 0.89 1.01 2.09 2.45 0.17 1.44 544306.31 1.22 1 2.50 2.82
3 0.44 1.00 0.54 0.16 2.05 1.36 1.46 0.87 0.58 0.72 1 1.25 1.68
4 0.05 1.02 1.22 0.21 0.33 0.90 1.38 1.04 0.72 2.24 1 2.31 1.05

Lower 95% confidence interval:

Code
knitr::kable(round(exp(hmm$coefficients - qnorm(0.975) * hmm$standard.errors), 2))
(Intercept) age SEXM smokingyes LOCATIONL2 LOCATIONL3 L4yes behaviour2 behaviour3 periyes DiagFC use_bioY use_bio3
1 0.27 0.97 0.46 0.46 0.87 0.98 0.06 0.46 544306.03 0.49 1 1.13 0.94
3 0.10 0.98 0.26 0.04 0.76 0.46 0.57 0.17 0.58 0.25 1 0.46 0.42
4 0.03 1.00 0.37 0.05 0.09 0.24 0.31 0.14 0.72 0.49 1 0.67 0.52

Upper 95% confidence interval:

Code
knitr::kable(round(exp(hmm$coefficients + qnorm(0.975) * hmm$standard.errors), 2))
(Intercept) age SEXM smokingyes LOCATIONL2 LOCATIONL3 L4yes behaviour2 behaviour3 periyes DiagFC use_bioY use_bio3
1 3.60 1.01 1.71 2.23 5.06 6.14 0.49 4.49 544306.58 3.05 1 5.54 8.44
3 1.86 1.02 1.14 0.73 5.53 4.03 3.74 4.42 0.58 2.09 1 3.39 6.69
4 0.07 1.05 3.98 0.94 1.30 3.37 6.24 7.87 0.72 10.35 1 7.93 2.14

Metrics

Code
knitr::kable(
  final_fit %>%
    collect_metrics(), 
  col.names = c("Metric", "Estimator", "Estimate", "Config"),
  align = "cccc") 
Metric Estimator Estimate Config
accuracy multiclass 0.5111111 Preprocessor1_Model1
roc_auc hand_till 0.6591539 Preprocessor1_Model1

Confusion matrix

Code
results <- final_fit %>%
  collect_predictions() 

knitr::kable(table(results$cluster, results$.pred_class), row.names = TRUE)
2 1 3 4
2 39 9 0 0
1 17 7 0 0
3 13 1 0 0
4 3 1 0 0

Random forest

Metrics

Code
rf_mod <- rand_forest(mtry = tune(), 
                      trees = tune(),
                      mode = "classification") %>%
  set_engine("ranger",
             num.threads = parallel::detectCores(),
             importance = "permutation")

class_wflow <- 
  workflow() %>% 
  add_model(rf_mod) %>% 
  add_recipe(class_rec)

tree_grid <- grid_regular(mtry(range(c(1,8))),
                          trees(),
                          levels = 5)

class_fit <- class_wflow %>% 
    tune_grid(
    resamples = folds,
    grid = tree_grid
    )

best_tree <- class_fit %>%
  select_best("accuracy")

final_wf <- 
  class_wflow %>% 
  finalize_workflow(best_tree)

final_fit <- 
  final_wf %>%
  last_fit(data_split) 

knitr::kable(
  final_fit %>% 
    extract_fit_parsnip() %>% 
    vi()
)
Variable Importance
use_bio 0.0086542
use_bio3 0.0084335
L4 0.0051902
smoking 0.0031569
LOCATION 0.0021937
DiagFC 0.0016417
behaviour 0.0003184
age 0.0000993
peri -0.0006497
SEX -0.0010218
Code
knitr::kable(
  final_fit %>%
    collect_metrics(), 
  col.names = c("Metric", "Estimator", "Estimate", "Config"),
  align = "cccc") 
Metric Estimator Estimate Config
accuracy multiclass 0.5666667 Preprocessor1_Model1
roc_auc hand_till 0.6458333 Preprocessor1_Model1

Confusion matrix

Code
results <- final_fit %>%
  collect_predictions() 
knitr::kable(table(results$cluster, results$.pred_class), row.names = TRUE)
2 1 3 4
2 45 3 0 0
1 18 6 0 0
3 14 0 0 0
4 3 1 0 0

Model performance for both the multinomial logistic regression model and the random forest classifier is very poor.

Session information

R version 4.3.2 (2023-10-31)

Platform: aarch64-apple-darwin20 (64-bit)

locale: en_GB.UTF-8||en_GB.UTF-8||en_GB.UTF-8||C||en_GB.UTF-8||en_GB.UTF-8

attached base packages: stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: plyr(v.1.8.9), KONPsurv(v.1.0.4), vip(v.0.4.1), patchwork(v.1.1.3), survminer(v.0.4.9), ggpubr(v.0.6.0), survival(v.3.5-7), ranger(v.0.16.0), yardstick(v.1.2.0), workflowsets(v.1.0.1), workflows(v.1.1.3), tune(v.1.1.2), tidyr(v.1.3.0), tibble(v.3.2.1), rsample(v.1.2.0), recipes(v.1.0.8), purrr(v.1.0.2), parsnip(v.1.1.1), modeldata(v.1.2.0), infer(v.1.0.5), ggplot2(v.3.4.4), dplyr(v.1.1.4), dials(v.1.2.0), scales(v.1.2.1), broom(v.1.0.5), tidymodels(v.1.1.1) and nnet(v.7.3-19)

loaded via a namespace (and not attached): gridExtra(v.2.3), httr2(v.1.0.0), rlang(v.1.1.2), magrittr(v.2.0.3), furrr(v.0.3.1), compiler(v.4.3.2), vctrs(v.0.6.4), stringr(v.1.5.1), lhs(v.1.1.6), rvest(v.1.0.3), pkgconfig(v.2.0.3), fastmap(v.1.1.1), ellipsis(v.0.3.2), backports(v.1.4.1), pander(v.0.6.5), KMsurv(v.0.1-5), utf8(v.1.2.4), datefixR(v.1.6.0.9000), rmarkdown(v.2.25), prodlim(v.2023.08.28), xfun(v.0.41), jsonlite(v.1.8.7), parallel(v.4.3.2), R6(v.2.5.1), stringi(v.1.8.1), parallelly(v.1.36.0), car(v.3.1-2), rpart(v.4.1.21), lubridate(v.1.9.3), Rcpp(v.1.0.11), iterators(v.1.0.14), knitr(v.1.45), future.apply(v.1.11.0), zoo(v.1.8-12), clisymbols(v.1.2.0), Matrix(v.1.6-3), splines(v.4.3.2), timechange(v.0.2.0), tidyselect(v.1.2.0), rstudioapi(v.0.15.0), abind(v.1.4-5), yaml(v.2.3.7), timeDate(v.4022.108), codetools(v.0.2-19), listenv(v.0.9.0), lattice(v.0.21-9), withr(v.2.5.2), evaluate(v.0.23), future(v.1.33.0), xml2(v.1.3.5), survMisc(v.0.5.6), pillar(v.1.9.0), carData(v.3.0-5), foreach(v.1.5.2), generics(v.0.1.3), munsell(v.0.5.0), globals(v.0.16.2), xtable(v.1.8-4), class(v.7.3-22), glue(v.1.6.2), tools(v.4.3.2), data.table(v.1.14.8), gower(v.1.0.1), ggsignif(v.0.6.4), grid(v.4.3.2), ipred(v.0.9-14), colorspace(v.2.1-0), cli(v.3.6.1), DiceDesign(v.1.9), rappdirs(v.0.3.3), km.ci(v.0.5-6), fansi(v.1.0.5), lava(v.1.7.3), gtable(v.0.3.4), GPfit(v.1.0-8), rstatix(v.0.7.2), digest(v.0.6.33), htmlwidgets(v.1.6.3), htmltools(v.0.5.7), lifecycle(v.1.0.4), hardhat(v.1.3.0), httr(v.1.4.7) and MASS(v.7.3-60)

MRC Human Genetics Unit logoCentre for Genomic & Experimental Medicine logo

Acknowledgments

This work is funded by the Medical Research Council & University of Edinburgh via a Precision Medicine PhD studentship (MR/N013166/1, to NC-C)

Author contributions

NC-C wrote the analysis. KM and CAV performed code review and contributed suggestions. RM contributed functions for calculating descriptive statistics. KM, RM and CAV provided feedback.

Reuse

Licensed by CC BY except for the MRC Human Genetics Unit, The University of Edinburgh, and Centre for Genomic & Experimental Medicine logos or unless otherwise stated.

Citation

BibTeX citation:
@article{constantine-cooke2023,
  author = {Nathan Constantine-Cooke and Karla Monterrubio-Gómez and
    Nikolas Plevris and Lauranne A.A.P. Derikx and Beatriz Gros and
    Gareth-Rhys Jones and Riccardo E. Marioni and Charlie W. Lees and
    Catalina A. Vallejos},
  publisher = {Elsevier},
  title = {Longitudinal {Fecal} {Calprotectin} {Profiles} {Characterize}
    {Disease} {Course} {Heterogeneity} in {Crohn’s} {Disease}},
  journal = {Clinical Gastroenterology and Hepatology},
  date = {2023-11-24},
  url = {https://www.cghjournal.org/article/S1542-3565(23)00234-3/},
  doi = {10.1016/j.cgh.2023.03.026},
  issn = {1542-3565},
  langid = {en}
}
For attribution, please cite this work as:
Nathan Constantine-Cooke, Karla Monterrubio-Gómez, Nikolas Plevris, Lauranne A.A.P. Derikx, Beatriz Gros, Gareth-Rhys Jones, Riccardo E. Marioni, Charlie W. Lees, and Catalina A. Vallejos. 2023. “Longitudinal Fecal Calprotectin Profiles Characterize Disease Course Heterogeneity in Crohn’s Disease.” Clinical Gastroenterology and Hepatology, November. https://doi.org/10.1016/j.cgh.2023.03.026.