Faecal calprotectin

Published

December 13, 2024

Introduction

Code
set.seed(123)
if (file.exists("/.dockerenv")) { # Check if running in Docker
  # Assume igmm/cvallejo-predicct/libdr/ is passed to the data volume
  prefix <- "data/"
} else {
  # Assume running outside of a Docker container and the IGC(/IGMM) datastore is
  # mounted at /Volumes
  prefix <- "/Volumes/igmm/cvallejo-predicct/libdr/"
}


##########################
#--     Packages       --#
##########################

library(tidyverse)
# Support package (source found in libdr/)
library(libdr)
## Modelling ##
library(lcmm)
library(kml) # K-means
## Presentation ##
library(patchwork)
library(ggdist)
library(ggalluvial)
library(pander)
library(qqplotr)

##########################
#--     Data read      --#
##########################

dict <- readRDS(paste0(prefix, "processed/dict.RDS"))
fcal <- readRDS(paste0(prefix, "processed/fcal.RDS"))

dk.fcal <- read.csv(paste0(prefix, "Denmark/2024-11-29/Fcal_8_models.csv"),
                    sep = ";")

This page describes how the latent class mixed models (LCMMs) were fitted to faecal calprotectin (FC). A separate page is dedicated to C-reactive protein.

In order to not present subject-level data, subjects within each cluster have been collated into groups of six.

Model specification

LCMMs are an extension of linear mixed effects models with an added cluster-specific fixed effect component. We use LCMMs with a natural cubic spline formulation for the fixed effects component. Random effects are specified as intercepts and the multinomial logistic regression model which assigns cluster membership uses IBD type (CD, UC, or IBDU) as a covariate.

For full formal definitions of the models and statistics we have used in the work, please refer to the supplementary material for our paper.

Model statistics

Model loading

Fitting the LCMMs discussed in this report takes multiple days in some cases and uses 25 CPU cores per model. We have fitted these models using Eddie, the University of Edinburgh’s high performance computing solution. The code in the dropdown box below was used to fit the models.

## Modelling ##
library(lcmm)
library(splines)


fcal <- readRDS("fcal.RDS")
fcal$calpro_result <- log(fcal$calpro_result)

if(!dir.exists("cache")) dir.create("cache")

if(!file.exists("m1.RDS")) {
  m1 <-  hlme(fixed =  calpro_result ~ ns(calpro_time,
                                          df = 4,
                                          Boundary.knots = c(0, 7)),
              random = ~  1,
              subject = "ids",
              data = fcal,
              verbose = TRUE,
              var.time = "calpro_time",
              maxiter = 8000)
  if (m1$conv != 1) stop("LME did not converge \n")
  saveRDS(m1, "m1.RDS")
  message("Finished fitting m1")
}
## Modelling ##
library(lcmm)
library(splines)

args <- commandArgs(trailingOnly = TRUE)
fcal <- readRDS("fcal.RDS")
fcal$calpro_result <- log(fcal$calpro_result)
if(!dir.exists("cache")) dir.create("cache")

rep <- 50
# set the maximum number of iterations
maxiter <- 10

if(file.exists("m1.RDS")) {
  m1 <- readRDS("m1.RDS")
  ng <- as.numeric(args[1])
  # create a cluster
  cl <- parallel::makeForkCluster(parallel::detectCores())
  # export the number of groups to the cluster
  parallel::clusterExport(cl, "ng")
  # run gridsearch
  hlme.fit <- gridsearch(
    rep = rep,
    maxiter = maxiter,
    minit = m1,
    cl = cl,
    hlme(calpro_result ~ ns(calpro_time,
                            df = 4,
                            Boundary.knots = c(0, 7)),
         mixture = ~ ns(calpro_time,
                        df = 4,
                        Boundary.knots = c(0, 7)),
         random = ~  1,
         subject = "ids",
     classmb = ~ 1 + diagnosis,
         ng = ng,
         data = fcal,
         maxiter = 24000,
         verbose = TRUE,
         partialH = FALSE)
  )
  # stop the cluster
  parallel::stopCluster(cl)

  if (hlme.fit$conv == 1) {
    message("Convergence achieved for ", ng, " subgroups ✅ \n")
  } else {
    stop("Convergence NOT achieved for ", ng, " subgroups ⚠️ \n")
  }
  saveRDS(hlme.fit, paste0("cache/fcal-", ng, ".RDS"))
} else {
  stop("m1 not found!")
}

As recommended by Proust-Lima, Philipps, and Liquet (2017), a linear mixed effects model is first fitted to generate initial starting values, (the “Initial LME” tab). A grid search approach is then used to converge the LCMMs towards a global maximum for each assumed number of clusters based on maximum likelihood (the “LCMM” tab).

After the above code is run, the resultant models are saved in cvallejo-predicct/libdr/cache/ for use in this report.

When an LCMM is fitted, the assumed number of clusters (or “classes”) must be specified a priori. Here we consider 4-8 clusters as 4 clusters were already found in our previous work (Constantine-Cooke et al. 2023), and we expect to find at least as many clusters given our inclusion criteria is more relaxed.

A posteriori model statistics and visual investigations must be used to decide upon the optimum number of classes.

Fixed effects specified via Natural Cubic Splines

Code
# set the number of groups
G.fcal <- numeric()
models.fcal <- list()
G.cands <- seq(2, 11)
for (G.cand in G.cands) {
  file.name <- paste0(prefix, "/cache/fcal/ncs/fcal-", G.cand, ".RDS")
  if (file.exists(file.name)) {
    G.fcal <- c(G.fcal, G.cand)
    models.fcal[[G.cand]] <- readRDS(file.name)
  }
}
rm(G.cand)

We firstly consider specifying fixed effects using NCS which have a few notable advantages (Elhakeem et al. 2022):

  1. Less parameters need to be estimated than either a GRBF function regression model or a polynomial regression model with the same flexibility. This reduces the time complexity when fitting the model and in the future may also make extensions more practically feasible.
  2. NCS enforce linearity between t_0 and the first knot and between the last knot and t_\text{max} which ensures the model does not behave erratically in these sometimes problematic areas.
  3. NCS are not highly sensitive to a continuous parameter and instead requires only K, the number of knots to be tuned. NCS are typically robust to where the knots themselves are placed.

Knot choice

To determine the most appropriate number of knots for the natural cubic splines, we considered two, three, and four knots. We only use six-cluster models for this analysis as this reasonably captures heterogeneity without being too expensive computationally.

Our previous work used three knots across a five-year period. As we are now modelling across a seven-year period, we may need to increase the number of knots to four to ensure the model remains flexible. We also considered two knots] in case a more smoooth model is required.

We place the knots at their default location which is at quantiles. We note that the number of knots is reported by Harrell (2015) to be much more important than knot location.

Model trajectories

One knot
Code
knots.1 <- readRDS(paste0(prefix, "/cache/fcal/ncs/1-knots.RDS"))
knots.1.list <- list()
knots.1.list[[6]] <- knots.1
cairo_pdf("paper/1-knots.pdf",
  width = 9,
  height = 9
)
spaghettiPlot(fcal,
  knots.1.list,
  G = 6,
  log = TRUE,
  tmax = 7,
  sizes = FALSE,
  knots = TRUE,
  n.knots = 1,
  var.time = "calpro_time"
)
invisible(dev.off())
spaghettiPlot(fcal,
  knots.1.list,
  G = 6,
  log = TRUE,
  tmax = 7,
  sizes = FALSE,
  knots = TRUE,
  n.knots = 1,
  var.time = "calpro_time"
)
Figure 1: Cluster profiles for the six-cluster model assuming one knot. The vertical teal lines indicate knot location. Mean cluster profiles are denoted by the red curves
Two knots

From Figure 2, we can see specifying only two knots results in very smooth curves for the mean trajectory of each cluster. However, this model appears to be too smooth which results in a relatively poor fit.

Code
knots.2 <- readRDS(paste0(prefix, "/cache/fcal/ncs/2-knots.RDS"))
knots.2.list <- list()
knots.2.list[[6]] <- knots.2
cairo_pdf("paper/2-knots.pdf",
  width = 9,
  height = 9
)
spaghettiPlot(fcal,
  knots.2.list,
  G = 6,
  log = TRUE,
  tmax = 7,
  sizes = FALSE,
  knots = TRUE,
  n.knots = 2,
  var.time = "calpro_time"
)
invisible(dev.off())
spaghettiPlot(fcal,
  knots.2.list,
  G = 6,
  log = TRUE,
  tmax = 7,
  sizes = FALSE,
  knots = TRUE,
  n.knots = 2,
  var.time = "calpro_time"
)
Figure 2: Cluster profiles for the six-cluster model assuming two knots. The vertical teal lines indicate knot location. Mean cluster profiles are denoted by the red curves
Three knots

Three knots appears to be the most appropriate specification (Figure 3). Whilst there is perhaps some evidence of overfitting, this appears to be minimal.

Code
knots.3 <- readRDS(paste0(prefix, "cache/fcal/ncs/fcal-6.RDS"))
knots.3.list <- list()
knots.3.list[[6]] <- knots.3
cairo_pdf("paper/3-knots.pdf",
  width = 9,
  height = 9
)
spaghettiPlot(fcal,
  knots.3.list,
  G = 6,
  log = TRUE,
  tmax = 7,
  sizes = FALSE,
  knots = TRUE,
  n.knots = 3,
  var.time = "calpro_time"
)
invisible(dev.off())
spaghettiPlot(fcal,
  knots.3.list,
  G = 6,
  log = TRUE,
  tmax = 7,
  sizes = FALSE,
  knots = TRUE,
  n.knots = 3,
  var.time = "calpro_time"
)
Figure 3: Cluster profiles for the six-cluster model assuming three knots. The vertical teal lines indicate knot location. Mean cluster profiles are denoted by the red curves
Four knots

Four knots appears to be too many. Figure 4 presents a minor improvement to model fit over the three knot specification, and there are signs of overfitting which is likely driven by the small distances between knots.

Code
knots.4 <- readRDS(paste0(prefix, "/cache/fcal/ncs/4-knots.RDS"))
knots.4.list <- list()
knots.4.list[[6]] <- knots.4
cairo_pdf("paper/4-knots.pdf",
  width = 9,
  height = 9
)
spaghettiPlot(fcal,
  knots.4.list,
  G = 6,
  log = TRUE,
  tmax = 7,
  sizes = FALSE,
  knots = TRUE,
  n.knots = 4,
  var.time = "calpro_time"
)
Warning: Removed 24 rows containing missing values or values outside the scale range
(`geom_line()`).
Code
invisible(dev.off())
spaghettiPlot(fcal,
  knots.4.list,
  G = 6,
  log = TRUE,
  tmax = 7,
  sizes = FALSE,
  knots = TRUE,
  n.knots = 4,
  var.time = "calpro_time"
)
Warning: Removed 24 rows containing missing values or values outside the scale range
(`geom_line()`).
Figure 4: Cluster profiles for the six-cluster model assuming four knots. The vertical teal lines indicate knot location. Mean cluster profiles are denoted by the red curves

Model statistics

AIC and BIC both suggest a four-knot specification to be optimal (Table 1). However, visual inspection of the mean cluster trajectories (see above) suggests the four-knot approach results in overfitting, especially around knot locations. This will also become more of an issue as the number of clusters increase and additional sparsity is exhibited for some clusters.

As such, we have elected to use natural cubic splines with three knots placed at quantiles.

Code
knots.df <- tibble(
  Knots = c("One knot", "Two knots", "Three knots", "Four knots"),
  loglik = c(
    knots.1$loglik,
    knots.2$loglik,
    knots.3$loglik,
    knots.4$loglik
  ),
  AIC = c(
    knots.1$AIC,
    knots.2$AIC,
    knots.3$AIC,
    knots.4$AIC
  ),
  BIC = c(
    knots.1$BIC,
    knots.2$BIC,
    knots.3$BIC,
    knots.4$BIC
  )
)
knitr::kable(knots.df,
  col.names = c(
    "Knots",
    "Maxmum log likelihood",
    "AIC",
    "BIC"
  )
)
rm(
  knots.1,
  knots.1.list,
  knots.3,
  knots.3.list,
  knots.4,
  knots.4.list
)
Table 1: Model statistics for differing numbers of knots for natural cubic splines
Knots Maxmum log likelihood AIC BIC
One knot -16448.75 32967.49 33140.50
Two knots -16230.02 32542.04 32744.71
Three knots -16141.82 32377.63 32609.96
Four knots -16125.38 32356.75 32618.74

Number of clusters

After determining the most appriopate number of knots, we must now decide upon the most appropiate number of clusters.

From the alluvial plot comparing FC models (Figure 5), we can see newly formed clusters are quite robust- remaining consistent as the number of clusters increases. However, new clusters are formed from multiple clusters which implies these clusters do not have a high degree of separation from one another. From this figure, six clusters appear to be a reasonable choice.

Code
alluvial.df <- matrix(nrow = 0, ncol = 3)
colnames(alluvial.df) <- c("ids", "class", "G")
for (G in G.fcal) {
  alluvial.df <- rbind(alluvial.df, cbind(models.fcal[[G]]$pprob[, 1:2], G = G))
}
alluvial.df <- as.data.frame(alluvial.df)

alluvial.df$ids <- as.character(alluvial.df$ids)
alluvial.df$class <- as.factor(alluvial.df$class)

# eliminate label switching

alluvial.df[alluvial.df[, "G"] == 3, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 3, "class"],
    from = c(seq(1, 3)),
    to = c(3, 1, 2)
  )

alluvial.df[alluvial.df[, "G"] == 4, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 4, "class"],
    from = c(seq(1, 4)),
    to = c(3, 4, 1, 2)
  )

alluvial.df[alluvial.df[, "G"] == 5, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 5, "class"],
    from = c(seq(1, 5)),
    to = c(2, 1, 5, 3, 4)
  )

alluvial.df[alluvial.df[, "G"] == 6, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 6, "class"],
    from = c(seq(1, 6)),
    to = c(6, 4, 5, 2, 3, 1)
  )

alluvial.df[alluvial.df[, "G"] == 7, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 7, "class"],
    from = c(seq(1, 7)),
    to = c(4, 1, 5, 3, 2, 7, 6)
  )

alluvial.df[alluvial.df[, "G"] == 8, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 8, "class"],
    from = c(seq(1, 8)),
    to = c(7, 5, 1, 3, 6, 4, 8, 2)
  )


alluvial.df[alluvial.df[, "G"] == 9, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 9, "class"],
    from = c(seq(1, 9)),
    to = c(1, 5, 8, 9, 6, 4, 7, 2, 3)
  )


alluvial.df[alluvial.df[, "G"] == 10, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 10, "class"],
    from = c(seq(1, 10)),
    to = c(2, 4, 9, 1, 7, 10, 6, 8, 5, 3)
  )


alluvial.df[alluvial.df[, "G"] == 11, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 11, "class"],
    from = c(seq(1, 11)),
    to = c(6, 10, 7, 2, 11, 3, 9, 4, 5, 1, 8)
  )


p <- ggplot(
  alluvial.df,
  aes(
    x = G,
    stratum = class,
    alluvium = ids,
    fill = class,
    label = class
  )
) +
  scale_x_discrete(expand = c(.1, .1)) +
  geom_flow() +
  geom_stratum(alpha = 0.5) +
  geom_text(stat = "stratum", size = 3) +
  theme_minimal() +
  theme(legend.position = "none") +
  scale_fill_manual(values = rainbow(11)) +
  xlab("Assumed number of clusters") +
  ylab("Frequency")
print(p)

p <- p + ggtitle("", "")
ggsave("paper/alluvial-FCAL-all.png",
  p,
  width = 12,
  height = 6.75,
  units = "in"
)
ggsave("paper/alluvial-FCAL-all.pdf",
  p,
  width = 12,
  height = 6.75,
  units = "in"
)
Figure 5: Alluvial plot of cluster membership across G for FC
Code
alluvial.df <- matrix(nrow = 0, ncol = 3)
colnames(alluvial.df) <- c("ids", "class", "G")
for (G in G.fcal) {
  alluvial.df <- rbind(alluvial.df, cbind(models.fcal[[G]]$pprob[, 1:2], G = G))
}
alluvial.df <- as.data.frame(alluvial.df)

alluvial.df$ids <- as.character(alluvial.df$ids)
alluvial.df$class <- as.factor(alluvial.df$class)

p <- ggplot(
  alluvial.df,
  aes(
    x = G,
    stratum = class,
    alluvium = ids,
    fill = class,
    label = class
  )
) +
  scale_x_discrete(expand = c(.1, .1)) +
  geom_flow() +
  geom_stratum(alpha = 0.5) +
  geom_text(stat = "stratum", size = 3) +
  theme_minimal() +
  theme(legend.position = "none") +
  scale_fill_manual(values = rainbow(11)) +
  xlab("Assumed number of clusters") +
  ylab("Frequency")
print(p)
Figure 6: Alluvial plot of cluster membership across G for FC

Posterior classifications

From the below data, we can see how posterior probabilities change as the number of assumed clusters increase.

  class1 class2
N 505 531
% 48.75 51.25
  prob1 prob2
class1 0.805 0.195
class2 0.1986 0.8014
  class1 class2
prob>0.7 68.71 69.49
prob>0.8 54.26 50.47
prob>0.9 37.23 37.1
  class1 class2 class3
N 501 298 237
% 48.36 28.76 22.88
  prob1 prob2 prob3
class1 0.8072 0.0856 0.1072
class2 0.1203 0.787 0.0927
class3 0.1259 0.1043 0.7698
  class1 class2 class3
prob>0.7 70.66 66.78 63.71
prob>0.8 57.68 54.7 50.63
prob>0.9 43.51 39.6 35.86
  class1 class2 class3 class4
N 324 200 274 238
% 31.27 19.31 26.45 22.97
  prob1 prob2 prob3 prob4
class1 0.7547 0.1307 0.0465 0.0681
class2 0.1486 0.6844 0.0645 0.1025
class3 0.0651 0.0609 0.7858 0.0881
class4 0.0701 0.1005 0.0904 0.7389
  class1 class2 class3 class4
prob>0.7 60.19 43 68.61 58.4
prob>0.8 45.99 32 56.2 47.06
prob>0.9 36.42 21.5 41.24 34.87
  class1 class2 class3 class4 class5
N 255 258 42 337 144
% 24.61 24.9 4.05 32.53 13.9
  prob1 prob2 prob3 prob4 prob5
class1 0.7292 0.0724 0.0374 0.0776 0.0834
class2 0.0901 0.7656 0.0432 0.0502 0.0509
class3 0.0344 0.0776 0.8022 0.0782 0.0075
class4 0.0811 0.0315 0.037 0.747 0.1035
class5 0.0967 0.0679 0.0059 0.1416 0.6878
  class1 class2 class3 class4 class5
prob>0.7 54.9 63.95 71.43 59.05 43.75
prob>0.8 45.1 51.55 59.52 46.88 32.64
prob>0.9 31.76 37.98 42.86 34.42 22.92
  class1 class2 class3 class4 class5 class6
N 141 144 48 234 312 157
% 13.61 13.9 4.63 22.59 30.12 15.15
  prob1 prob2 prob3 prob4 prob5 prob6
class1 0.7231 0.0302 0.028 0.097 0.0124 0.1093
class2 0.0374 0.6684 0.009 0.0911 0.126 0.0681
class3 0.061 0.0094 0.7606 0.0477 0.0726 0.0487
class4 0.0727 0.1064 0.034 0.6842 0.0851 0.0175
class5 0.0068 0.094 0.0393 0.083 0.7422 0.0346
class6 0.1158 0.0547 0.0412 0.04 0.0646 0.6836
  class1 class2 class3 class4 class5 class6
prob>0.7 55.32 42.36 66.67 48.29 58.65 48.41
prob>0.8 44.68 31.94 54.17 37.61 47.76 33.76
prob>0.9 30.5 22.22 39.58 21.79 33.33 21.66
  class1 class2 class3 class4 class5 class6 class7
N 49 177 59 205 191 229 126
% 4.73 17.08 5.69 19.79 18.44 22.1 12.16
  prob1 prob2 prob3 prob4 prob5 prob6 prob7
class1 0.6775 0.0669 0.0022 0.0379 0.0546 0.1242 0.0367
class2 0.0346 0.6893 0.0388 0.034 0.0378 0.0546 0.1108
class3 0.0027 0.0598 0.7169 0.0618 0.0617 0.0363 0.0607
class4 0.0076 0.0305 0.0458 0.7318 0.0358 0.1458 0.0029
class5 0.0256 0.0222 0.0383 0.0342 0.6878 0.1125 0.0794
class6 0.0428 0.0383 0.0226 0.144 0.1246 0.6117 0.016
class7 0.016 0.1025 0.0297 0.0068 0.0838 0.0202 0.741
  class1 class2 class3 class4 class5 class6 class7
prob>0.7 46.94 48.59 54.24 58.54 50.26 31.44 58.73
prob>0.8 40.82 36.16 49.15 44.39 38.22 21.83 50
prob>0.9 30.61 20.9 35.59 27.32 26.18 10.92 34.13
  class1 class2 class3 class4 class5 class6 class7 class8
N 244 64 103 194 140 67 67 157
% 23.55 6.18 9.94 18.73 13.51 6.47 6.47 15.15
  prob1 prob2 prob3 prob4 prob5 prob6 prob7 prob8
class1 0.6255 0.0238 0.0339 0.127 0.0156 0.0661 0.0167 0.0914
class2 0.0312 0.7158 0.0553 0.0692 0.0406 0.006 0.0169 0.0649
class3 0.0548 0.0357 0.7109 0.0362 0.0734 0.0232 0.0573 0.0085
class4 0.1397 0.037 0.021 0.7495 0.0021 0.0189 0.0097 0.0222
class5 0.0175 0.0232 0.1004 0.0055 0.7182 0.0176 0.0432 0.0744
class6 0.1193 0.0137 0.0437 0.0543 0.0271 0.6537 0.0239 0.0643
class7 0.0373 0.0164 0.1082 0.0123 0.0775 0.0298 0.6788 0.0397
class8 0.0929 0.043 0.0102 0.0232 0.0752 0.0436 0.0275 0.6844
  class1 class2 class3 class4 class5 class6 class7 class8
prob>0.7 34.84 54.69 55.34 59.28 53.57 47.76 44.78 49.68
prob>0.8 24.18 48.44 38.83 46.39 47.86 37.31 37.31 39.49
prob>0.9 15.16 28.12 25.24 32.47 31.43 28.36 25.37 29.94
Table continues below
  class1 class2 class3 class4 class5 class6 class7 class8
N 102 60 75 33 129 45 235 164
% 9.85 5.79 7.24 3.19 12.45 4.34 22.68 15.83
  class9
N 193
% 18.63
Table continues below
  prob1 prob2 prob3 prob4 prob5 prob6 prob7
class1 0.7052 0.0337 0.0567 0.002 0.0747 0.0322 0.0527
class2 0.055 0.7411 0.0163 0.0077 0.0229 0.0015 0.0377
class3 0.1003 0.0155 0.6609 0.0053 0.0687 0.0225 0.055
class4 0.0023 0.0124 0.0023 0.7246 3e-04 0.0191 0.1115
class5 0.0985 0.0183 0.0463 0.0011 0.7203 0.0174 0.0145
class6 0.0562 0.0022 0.0205 0.0276 0.029 0.6901 0.1125
class7 0.0333 0.0248 0.0177 0.0688 0.0122 0.0332 0.6131
class8 0.0131 0.051 0.031 0.0309 0.0821 0.0252 0.0812
class9 0.0185 0.0375 0.0046 0.0174 0.0018 0.009 0.1455
  prob8 prob9
class1 0.0116 0.0312
class2 0.0536 0.0642
class3 0.0524 0.0194
class4 0.0864 0.0412
class5 0.0783 0.0052
class6 0.0363 0.0256
class7 0.0824 0.1145
class8 0.6679 0.0177
class9 0.0192 0.7466
Table continues below
  class1 class2 class3 class4 class5 class6 class7
prob>0.7 51.96 56.67 44 57.58 56.59 46.67 37.02
prob>0.8 40.2 53.33 33.33 45.45 47.29 42.22 23.83
prob>0.9 22.55 31.67 26.67 30.3 32.56 33.33 14.89
  class8 class9
prob>0.7 45.12 59.07
prob>0.8 37.2 47.15
prob>0.9 28.66 31.09
Table continues below
  class1 class2 class3 class4 class5 class6 class7 class8
N 102 43 36 94 118 188 133 59
% 9.85 4.15 3.47 9.07 11.39 18.15 12.84 5.69
  class9 class10
N 54 209
% 5.21 20.17
Table continues below
  prob1 prob2 prob3 prob4 prob5 prob6 prob7
class1 0.73 0.0125 0.0207 0.0026 0.0149 0.099 0.0706
class2 0.0206 0.6965 0.0302 0.0467 0.046 0.0962 0.0253
class3 0.0818 0.0227 0.7056 0.0027 0.0599 0.0816 0.006
class4 0.005 0.0265 8e-04 0.7029 0.0226 0.0629 0.0653
class5 0.0291 0.0232 0.0871 0.0133 0.5886 0.0965 0.0037
class6 0.0975 0.0353 0.0398 0.0463 0.09 0.5336 0.0356
class7 0.0636 0.0162 0.0015 0.0791 0.004 0.0549 0.7225
class8 0.0314 0.0205 0.0023 0.084 0.0063 0.0772 0.0692
class9 0.0505 0.0023 0.0088 0.046 0.0046 0.0922 0.03
class10 0.0191 0.0108 0.0163 0.0158 0.1163 0.0539 0.0018
  prob8 prob9 prob10
class1 0.016 0.0267 0.0069
class2 0.0102 3e-04 0.028
class3 8e-04 0.0074 0.0316
class4 0.0501 0.0306 0.0333
class5 0.0032 0.0107 0.1445
class6 0.0309 0.037 0.0541
class7 0.0385 0.0149 0.0048
class8 0.6971 0.007 0.0049
class9 0.004 0.7179 0.0437
class10 0.0041 0.0389 0.7229
Table continues below
  class1 class2 class3 class4 class5 class6 class7
prob>0.7 53.92 48.84 52.78 52.13 30.51 21.28 54.89
prob>0.8 46.08 41.86 47.22 38.3 25.42 13.83 47.37
prob>0.9 36.27 32.56 33.33 25.53 16.95 4.79 33.83
  class8 class9 class10
prob>0.7 49.15 59.26 54.55
prob>0.8 42.37 53.7 43.06
prob>0.9 28.81 31.48 32.54
Table continues below
  class1 class2 class3 class4 class5 class6 class7 class8
N 133 189 118 103 10 199 36 42
% 12.84 18.24 11.39 9.94 0.97 19.21 3.47 4.05
  class9 class10 class11
N 51 97 58
% 4.92 9.36 5.6
Table continues below
  prob1 prob2 prob3 prob4 prob5 prob6 prob7
class1 0.7214 0.0564 0.004 0.0625 5e-04 0.0047 0.0015
class2 0.036 0.5301 0.0906 0.0972 0.012 0.0519 0.0387
class3 0.0037 0.0953 0.5825 0.0285 0.019 0.1379 0.0849
class4 0.0695 0.102 0.016 0.7222 0.005 0.0084 0.0204
class5 0 0.0287 0.087 0 0.7155 0.0924 0
class6 0.0019 0.051 0.1141 0.0173 0.0313 0.6994 0.0161
class7 0.006 0.0824 0.0613 0.0805 0.003 0.0308 0.7028
class8 0.0255 0.0972 0.0482 0.0204 4e-04 0.0194 0.0297
class9 0.0264 0.0875 0.0044 0.048 0.0246 0.043 0.0086
class10 0.0647 0.0672 0.0219 0.0048 0.0057 0.0322 8e-04
class11 0.0655 0.0793 0.0065 0.0309 7e-04 0.0049 0.0023
  prob8 prob9 prob10 prob11
class1 0.0161 0.015 0.0809 0.037
class2 0.0345 0.0346 0.0442 0.0303
class3 0.0227 0.009 0.0133 0.0032
class4 0.0122 0.0258 0.0025 0.0158
class5 0 0.0395 0.0368 0
class6 0.0132 0.0362 0.0151 0.0045
class7 0.0225 0.0072 0.0028 7e-04
class8 0.6994 3e-04 0.0493 0.0101
class9 0.0024 0.7202 0.0308 0.0041
class10 0.0259 0.0263 0.7003 0.0504
class11 0.0203 0.0065 0.0835 0.6997
Table continues below
  class1 class2 class3 class4 class5 class6 class7
prob>0.7 54.89 21.69 29.66 52.43 50 50.75 52.78
prob>0.8 47.37 13.76 23.73 45.63 40 41.21 44.44
prob>0.9 33.83 4.23 16.95 34.95 30 29.15 33.33
  class8 class9 class10 class11
prob>0.7 47.62 56.86 52.58 48.28
prob>0.8 42.86 50.98 40.21 43.1
prob>0.9 33.33 33.33 25.77 29.31
Code
rm(output)
Code
pprobs.2 <- c()
groups.2 <- models.fcal[[2]]
pprobs.3 <- c()
groups.3 <- models.fcal[[3]]
pprobs.4 <- c()
groups.4 <- models.fcal[[4]]
pprobs.5 <- c()
groups.5 <- models.fcal[[5]]
pprobs.6 <- c()
groups.6 <- models.fcal[[6]]
pprobs.7 <- c()
groups.7 <- models.fcal[[7]]
pprobs.8 <- c()
groups.8 <- models.fcal[[8]]
pprobs.9 <- c()
groups.9 <- models.fcal[[9]]
pprobs.10 <- c()
groups.10 <- models.fcal[[10]]
pprobs.11 <- c()
groups.11 <- models.fcal[[11]]


for (i in 1:nrow(models.fcal[[2]]$pprob)) {
  class.2 <- groups.2$pprob[i, 2]
  pprobs.2 <- c(pprobs.2, groups.2$pprob[i, class.2 + 2])
  class.3 <- groups.3$pprob[i, 2]
  pprobs.3 <- c(pprobs.3, groups.3$pprob[i, class.3 + 2])
  class.4 <- groups.4$pprob[i, 2]
  pprobs.4 <- c(pprobs.4, groups.4$pprob[i, class.4 + 2])
  class.5 <- groups.5$pprob[i, 2]
  pprobs.5 <- c(pprobs.5, groups.5$pprob[i, class.5 + 2])
  class.6 <- groups.6$pprob[i, 2]
  pprobs.6 <- c(pprobs.6, groups.6$pprob[i, class.6 + 2])
  class.7 <- groups.7$pprob[i, 2]
  pprobs.7 <- c(pprobs.7, groups.7$pprob[i, class.7 + 2])
  class.8 <- groups.8$pprob[i, 2]
  pprobs.8 <- c(pprobs.8, groups.8$pprob[i, class.8 + 2])
  class.9 <- groups.9$pprob[i, 2]
  pprobs.9 <- c(pprobs.9, groups.9$pprob[i, class.9 + 2])
  class.10 <- groups.10$pprob[i, 2]
  pprobs.10 <- c(pprobs.10, groups.10$pprob[i, class.10 + 2])
  class.11 <- groups.11$pprob[i, 2]
  pprobs.11 <- c(pprobs.11, groups.11$pprob[i, class.11 + 2])
}
pprobs.2 <- tibble(prob = pprobs.2)
pprobs.3 <- tibble(prob = pprobs.3)
pprobs.4 <- tibble(prob = pprobs.4)
pprobs.5 <- tibble(prob = pprobs.5)
pprobs.6 <- tibble(prob = pprobs.6)
pprobs.7 <- tibble(prob = pprobs.7)
pprobs.8 <- tibble(prob = pprobs.8)
pprobs.9 <- tibble(prob = pprobs.9)
pprobs.10 <- tibble(prob = pprobs.10)
pprobs.11 <- tibble(prob = pprobs.11)

pprobs.2$Model <- as.factor(rep("Two clusters", nrow(pprobs.2)))
pprobs.3$Model <- as.factor(rep("Three clusters", nrow(pprobs.3)))
pprobs.4$Model <- as.factor(rep("Four clusters", nrow(pprobs.4)))
pprobs.5$Model <- as.factor(rep("Five clusters", nrow(pprobs.5)))
pprobs.6$Model <- as.factor(rep("Six clusters", nrow(pprobs.6)))
pprobs.7$Model <- as.factor(rep("Seven clusters", nrow(pprobs.7)))
pprobs.8$Model <- as.factor(rep("Eight clusters", nrow(pprobs.8)))
pprobs.9$Model <- as.factor(rep("Nine clusters", nrow(pprobs.9)))
pprobs.10$Model <- as.factor(rep("Ten clusters", nrow(pprobs.10)))
pprobs.11$Model <- as.factor(rep("Eleven clusters", nrow(pprobs.11)))
pprobs <- rbind(
  pprobs.2,
  pprobs.3,
  pprobs.4,
  pprobs.5,
  pprobs.6,
  pprobs.7,
  pprobs.8,
  pprobs.9,
  pprobs.10,
  pprobs.11
)

p <- pprobs %>%
  ggplot(aes(x = prob, y = Model)) +
  stat_slab(
    fill = "#5D5179", color = "gray",
    size = 0.8,
    alpha = 0.7
  ) +
  geom_dots(color = "#4F759B", dotsize = 1) +
  xlab("Posterior probability for cluster membership") +
  ylab("") +
  ggtitle(
    "Distribution of Posterior Probabilities Across Models",
    "Subject-specific posterior probabilities for assigned cluster"
  ) +
  theme_minimal() +
  scale_y_discrete(limits = rev)
ggsave("plots/pprob-distribution.png",
  p,
  width = 8.5,
  height = 7.5,
  units = "in"
)
ggsave("plots/pprob-distribution.pdf",
  p,
  width = 8.5,
  height = 7.5,
  units = "in"
)
print(p)

Residual plots

##### G = 3 ##### G = 4 ##### G = 5 ##### G = 6 ##### G = 7 ##### G = 8 ##### G = 9 ##### G = 10 ##### G = 11

Chosen model
Code
p1 <- data.frame(residuals = resid(models.fcal[[8]])) %>%
  ggplot(aes(x = residuals)) +
  geom_histogram(aes(y = after_stat(density)),
    fill = "#D8829D",
    color = "#AF6A80",
    bins = 30
  ) +
  geom_density(color = "#023777", linewidth = 1.2) +
  theme_classic() +
  theme(axis.line = element_line(colour = "gray")) +
  ylab("Density") +
  xlab("Residuals") +
  ggtitle("A")

p2 <- data.frame(residuals = resid(models.fcal[[8]])) %>%
  ggplot(aes(sample = residuals)) +
  stat_qq_band() +
  stat_qq_line(color = "#D8829D") +
  stat_qq_point(color = "#023777") +
  theme_classic() +
  theme(axis.line = element_line(colour = "gray")) +
  ylab("Theoretical Quantiles") +
  xlab("Sample Quantiles") +
  ggtitle("B")
ggsave("paper/Residual-plot-fcal-ncs.pdf",
  plot = p1 / p2,
  width = 8,
  height = 8,
  units = "in"
)
print(p1 / p2)
Figure 7: Plots assessing normality for the distribution of residuals for the chosen NCS model for FCAL. (A) Histrogram; (B) Q-Q plot.
Code
predClass <- models.fcal[[8]]$pprob[, c("ids", "class")]
temp <- merge(models.fcal[[8]]$pred, predClass, by = "ids")
labels <- c("A", "B", "C", "D", "E", "F", "G", "H")
p <- list()

for (i in 1:8) {
  p[[i]] <- subset(temp, class == i) %>%
    ggplot(aes(sample = resid_ss)) +
    stat_qq_band() +
    stat_qq_line(color = "#D8829D") +
    stat_qq_point(color = "#023777") +
    theme_classic() +
    theme(axis.line = element_line(colour = "gray")) +
    ylab("Theoretical Quantiles") +
    xlab("Sample Quantiles") +
    ggtitle(labels[i])
}

ArrangedPlot <- (p[[1]] + p[[2]]) /
  (p[[3]] + p[[4]]) /
  (p[[5]] + p[[6]]) /
  (p[[7]] + p[[8]])

ggsave("paper/cluster-resids-ncs.pdf",
  ArrangedPlot,
  width = 8,
  height = 8,
  units = "in"
)
print(ArrangedPlot)
Figure 8: Q-Q plots assessing normality for the distribution of residuals stratifed by cluster assignment.

Model metrics

Code
summaryplot(models.fcal[[4]],
  models.fcal[[5]],
  models.fcal[[6]],
  models.fcal[[7]],
  models.fcal[[8]],
  models.fcal[[9]],
  models.fcal[[10]],
  models.fcal[[11]],
  which = c("loglik", "AIC", "BIC", "entropy", "ICL"),
  mfrow = c(2, 3),
  axis = "Class"
)
Figure 9: Model metrics for FCAL for G = 4-9
Code
fcal.metrics <- makeMetrics(G.fcal, models.fcal)
buildDT(fcal.metrics)
Clusters Maximum log-likelihood AIC BIC
2 -16532.10 33094.20 33168.35
3 -16354.39 32754.78 32868.47
4 -16260.21 32582.42 32735.66
5 -16201.86 32481.71 32674.49
6 -16141.82 32377.63 32609.96
7 -16109.77 32329.54 32601.42
8 -16076.89 32279.77 32591.19
9 -16039.66 32221.33 32572.29
10 -16015.29 32188.59 32579.09
11 -16005.55 32185.10 32615.16

The optimum AIC is 32185.1 with cluster 13. The optimum BIC is 32572.29 with cluster 11

Spaghetti plots per cluster

Code
for (G in G.fcal) {
  # Data frame to hold processed data
  new.fc <- data.frame(
    ids = numeric(),
    calpro_result = numeric(),
    calpro_time = numeric(),
    class = numeric()
  )

  for (clust in 1:G) {
    ids.clust <- subset(models.fcal[[G]]$pprob, class == clust)$ids
    n.clust <- length(ids.clust)
    rand <- sample(n.clust, n.clust) # Randomise the order of the ids
    iters <- floor(n.clust / 6) # How many groups of six are there?

    # Matrix to hold the smoothed data
    fcal.ma <- matrix(NA, nrow = iters, ncol = 7)
    fcal.time <- matrix(NA, nrow = iters, ncol = 7)

    for (i in 0:(iters - 1)) {
      # Find ids for group of five
      ids.select <- ids.clust[rand[((i * 6) + 1):((i * 6) + 6)]]
      fcal.subset <- subset(fcal, ids %in% ids.select)
      # Median process as per CRP preprocessing
      for (j in seq(0, 6)) {
        if (j == 6) {
          sub.obs <- subset(
            fcal.subset,
            calpro_time >= j - 0.5 & calpro_time <= j + 1
          )
        } else {
          sub.obs <- subset(
            fcal.subset,
            calpro_time >= j - 0.5 & calpro_time < j + 0.5
          )
        }
        if (nrow(sub.obs) > 0) {
          fcal.ma[i + 1, j + 1] <- median(sub.obs$calpro_result)
          fcal.time[i + 1, j + 1] <- median(sub.obs$calpro_time)
        }
      }
    }

    rownames(fcal.ma) <- 1:iters
    fcal.ma <- reshape2::melt(t(fcal.ma),
      id.vars = row.names(fcal.ma),
      na.rm = TRUE
    )
    colnames(fcal.ma) <- c("calpro_time", "ids", "calpro_result")
    fcal.ma <- fcal.ma[, c(2, 3, 1)] # Make ids first column

    rownames(fcal.time) <- 1:iters
    fcal.time <- reshape2::melt(t(fcal.time),
      id.vars = row.names(fcal.time),
      na.rm = TRUE
    )
    colnames(fcal.time) <- c("calpro_time", "ids", "pred_time")
    fcal.time <- fcal.time[, c(2, 3, 1)] # Make ids first column

    fcal.ma <- merge(fcal.ma, fcal.time, by = c("ids", "calpro_time"))
    fcal.ma <- fcal.ma %>%
      mutate(calpro_time = pred_time) %>%
      select(-pred_time)

    fcal.ma$class <- clust # Identify cluster assignment
    new.fc <- rbind(new.fc, fcal.ma)
  }

  if (!dir.exists("plots/spaghetti")) dir.create("plots/spaghetti")
  png(paste0("plots/spaghetti/fcal-ncs-", G, ".png"),
    width = 10,
    height = 17.5,
    units = "in",
    res = 300
  )
  grid::grid.newpage()
  spaghettiPlot(new.fc,
    models.fcal,
    G = G,
    log = TRUE,
    tmax = 7,
    sizes = TRUE,
    knots = FALSE,
    var.time = "calpro_time",
    clusters = TRUE
  )
  invisible(dev.off())

  cairo_pdf(paste0("plots/spaghetti/fcal-ncs-", G, ".pdf"),
    width = 10,
    height = 17.5
  )
  grid::grid.newpage()
  spaghettiPlot(new.fc,
    models.fcal,
    G = G,
    log = TRUE,
    tmax = 7,
    sizes = TRUE,
    knots = FALSE,
    var.time = "calpro_time",
    clusters = TRUE
  )
  invisible(dev.off())

  grid::grid.newpage()
  print(spaghettiPlot(new.fc,
    models.fcal,
    G = G,
    log = TRUE,
    tmax = 7,
    sizes = TRUE,
    knots = FALSE,
    var.time = "calpro_time",
    clusters = TRUE
  ))
}
NULL

NULL

NULL

NULL

NULL

NULL

NULL

NULL

NULL

NULL

Chosen model

Code
new.fc <- data.frame(
  ids = numeric(),
  calpro_result = numeric(),
  calpro_time = numeric(),
  class = numeric()
)

for (clust in 1:8) {
  ids.clust <- subset(models.fcal[[8]]$pprob, class == clust)$ids
  n.clust <- length(ids.clust)
  rand <- sample(n.clust, n.clust) # Randomise the order of the ids
  iters <- floor(n.clust / 6) # How many groups of six are there?

  # Matrix to hold the smoothed data
  fcal.ma <- matrix(NA, nrow = iters, ncol = 7)
  fcal.time <- matrix(NA, nrow = iters, ncol = 7)


  for (i in 0:(iters - 1)) {
    # Find ids for group of five
    ids.select <- ids.clust[rand[((i * 6) + 1):((i * 6) + 6)]]
    fcal.subset <- subset(fcal, ids %in% ids.select)
    # Median process as per CRP preprocessing
    for (j in seq(0, 6)) {
      if (j == 6) {
        sub.obs <- subset(
          fcal.subset,
          calpro_time >= j - 0.5 & calpro_time <= j + 1
        )
      } else {
        sub.obs <- subset(
          fcal.subset,
          calpro_time >= j - 0.5 & calpro_time < j + 0.5
        )
      }
      if (nrow(sub.obs) > 0) {
        fcal.ma[i + 1, j + 1] <- median(sub.obs$calpro_result)
        fcal.time[i + 1, j + 1] <- median(sub.obs$calpro_time)
      }
    }
  }

  rownames(fcal.ma) <- 1:iters
  fcal.ma <- reshape2::melt(t(fcal.ma),
    id.vars = row.names(fcal.ma),
    na.rm = TRUE
  )
  colnames(fcal.ma) <- c("calpro_time", "ids", "calpro_result")
  fcal.ma <- fcal.ma[, c(2, 3, 1)] # Make ids first column


  rownames(fcal.time) <- 1:iters
  fcal.time <- reshape2::melt(t(fcal.time),
    id.vars = row.names(fcal.time),
    na.rm = TRUE
  )
  colnames(fcal.time) <- c("calpro_time", "ids", "pred_time")
  fcal.time <- fcal.time[, c(2, 3, 1)] # Make ids first column

  fcal.ma <- merge(fcal.ma, fcal.time, by = c("ids", "calpro_time"))
  fcal.ma <- fcal.ma %>%
    mutate(calpro_time = pred_time) %>%
    select(-pred_time)

  fcal.ma$class <- clust # Identify cluster assignment
  new.fc <- rbind(new.fc, fcal.ma)
}

As. a parsimonious choice, we have elected to focus on the 8-cluster model as we believe this strikes a reasonable balance between characterising the LIBDR cohort whilst still presenting trends which likely reflect the IBD population in Western Europe.

Labelling

The labels/order of the clusters generated above are random. To improve the readability and assist with interpretation of associations, it would be beneficial to order clusters by some analogue for disease severity. This section explores possible solutions for this task.

Cumulative inflammation across follow-up

One possible solution is to order clusters by cumulative inflammation which is given by the area under the curve for each mean cluster trajectory.

Code
rank.full <- rankCumulative(models.fcal[[8]], tmax = 7)

rank.full %>%
  ggplot(aes(x = paste0("FC", New), y = Area)) +
  geom_bar(stat = "identity", fill = "#745C97", color = "#39375B") +
  ylab("Cumulative inflammation (Area)") +
  xlab("Cluster") +
  theme_minimal()

Code
png(paste0("plots/spaghetti/fcal-ncs-reordered.png"),
  width = 10,
  height = 16,
  units = "in",
  res = 300
)
grid::grid.newpage()
spaghettiPlot(new.fc,
  models.fcal,
  G = 8,
  log = TRUE,
  tmax = 7,
  sizes = TRUE,
  knots = FALSE,
  var.time = "calpro_time",
  clusters = TRUE,
  mapping = rank.full$Original
)
invisible(dev.off())

cairo_pdf(paste0("plots/spaghetti/fcal-ncs-reordered.pdf"),
  width = 10,
  height = 16
)
grid::grid.newpage()
spaghettiPlot(new.fc,
  models.fcal,
  G = 8,
  log = TRUE,
  tmax = 7,
  sizes = TRUE,
  knots = FALSE,
  var.time = "calpro_time",
  clusters = TRUE,
  mapping = rank.full$Original
)
invisible(dev.off())

spaghettiPlot(new.fc,
  models.fcal,
  G = 8,
  log = TRUE,
  tmax = 7,
  sizes = TRUE,
  knots = FALSE,
  var.time = "calpro_time",
  clusters = TRUE,
  mapping = rank.full$Original
)

With Danish predictions

Code
dk.fcal <- dk.fcal %>%
  mutate(cluster = str_replace_all(cluster, "Ypred_class", ""))
Code
dk.fcal$cluster <- plyr::mapvalues(dk.fcal$cluster,
                                   from = seq(1, 8),
                                   to = c(3, 1, 7, 5, 8, 2, 4, 6))
Code
cairo_pdf(paste0("plots/spaghetti/fcal-scot-dk.pdf"),
  width = 10,
  height = 16
)
grid::grid.newpage()
spaghettiPlot(new.fc,
  models.fcal,
  G = 8,
  log = TRUE,
  tmax = 7,
  sizes = TRUE,
  knots = FALSE,
  var.time = "calpro_time",
  clusters = TRUE,
  mapping = rank.full$Original,
  external = dk.fcal
)
invisible(dev.off())

png(paste0("plots/spaghetti/fcal-scot-dk.png"),
  width = 10,
  height = 16,
  units = "in",
  res = 300
)
grid::grid.newpage()
spaghettiPlot(new.fc,
  models.fcal,
  G = 8,
  log = TRUE,
  tmax = 7,
  sizes = TRUE,
  knots = FALSE,
  var.time = "calpro_time",
  clusters = TRUE,
  mapping = rank.full$Original,
  external = dk.fcal
)
invisible(dev.off())

grid::grid.newpage()
spaghettiPlot(new.fc,
  models.fcal,
  G = 8,
  log = TRUE,
  tmax = 7,
  sizes = TRUE,
  knots = FALSE,
  var.time = "calpro_time",
  clusters = TRUE,
  mapping = rank.full$Original,
  external = dk.fcal
)

Code
for (G in (9:10)) {
  new.fc <- data.frame(
    ids = numeric(),
    calpro_result = numeric(), 
    calpro_time = numeric(),
    class = numeric()
  )

  for (clust in 1:G) {
    ids.clust <- subset(models.fcal[[G]]$pprob, class == clust)$ids
    n.clust <- length(ids.clust)
    rand <- sample(n.clust, n.clust) # Randomise the order of the ids
    iters <- floor(n.clust / 6) # How many groups of six are there?

    # Matrix to hold the smoothed data
    fcal.ma <- matrix(NA, nrow = iters, ncol = 7)
    fcal.time <- matrix(NA, nrow = iters, ncol = 7)

    for (i in 0:(iters - 1)) {
      # Find ids for group of five
      ids.select <- ids.clust[rand[((i * 6) + 1):((i * 6) + 6)]]
      fcal.subset <- subset(fcal, ids %in% ids.select)
      # Median process as per CRP preprocessing
      for (j in seq(0, 6)) {
        if (j == 6) {
          sub.obs <- subset(
            fcal.subset,
            calpro_time >= j - 0.5 & calpro_time <= j + 1
          )
        } else {
          sub.obs <- subset(
            fcal.subset,
            calpro_time >= j - 0.5 & calpro_time < j + 0.5
          )
        }
        if (nrow(sub.obs) > 0) {
          fcal.ma[i + 1, j + 1] <- median(sub.obs$calpro_result)
          fcal.time[i + 1, j + 1] <- median(sub.obs$calpro_time)
        }
      }
    }

    rownames(fcal.ma) <- 1:iters
    fcal.ma <- reshape2::melt(t(fcal.ma),
      id.vars = row.names(fcal.ma),
      na.rm = TRUE
    )
    colnames(fcal.ma) <- c("calpro_time", "ids", "calpro_result")
    fcal.ma <- fcal.ma[, c(2, 3, 1)] # Make ids first column

    rownames(fcal.time) <- 1:iters
    fcal.time <- reshape2::melt(t(fcal.time),
      id.vars = row.names(fcal.time),
      na.rm = TRUE
    )
    colnames(fcal.time) <- c("calpro_time", "ids", "pred_time")
    fcal.time <- fcal.time[, c(2, 3, 1)] # Make ids first column

    fcal.ma <- merge(fcal.ma, fcal.time, by = c("ids", "calpro_time"))
    fcal.ma <- fcal.ma %>%
      mutate(calpro_time = pred_time) %>%
      select(-pred_time)

    fcal.ma$class <- clust # Identify cluster assignment
    new.fc <- rbind(new.fc, fcal.ma)
  }


  rank.full <- rankCumulative(models.fcal[[G]], tmax = 7)
  png(paste0("plots/spaghetti/fcal-ncs-reordered-", G, ".png"),
    width = 10,
    height = 16,
    units = "in",
    res = 300
  )
  grid::grid.newpage()
  spaghettiPlot(new.fc,
    models.fcal,
    G = G,
    log = TRUE,
    tmax = 7,
    sizes = TRUE,
    knots = FALSE,
    var.time = "calpro_time",
    clusters = TRUE,
    mapping = rank.full$Original
  )
  invisible(dev.off())

  cairo_pdf(paste0("plots/spaghetti/fcal-ncs-reordered-", G, ".pdf"),
    width = 10,
    height = 16
  )
  grid::grid.newpage()
  spaghettiPlot(new.fc,
    models.fcal,
    G = G,
    log = TRUE,
    tmax = 7,
    sizes = TRUE,
    knots = FALSE,
    var.time = "calpro_time",
    clusters = TRUE,
    mapping = rank.full$Original
  )
  invisible(dev.off())
}

Alluvial anchored plot

Having chosen the 8-cluster model to be the most appropriate, we have produced a version of Figure 6 using the labels assigned for the 8-cluster model.

Code
alluvial.df <- matrix(nrow = 0, ncol = 3)
colnames(alluvial.df) <- c("ids", "class", "G")
for (G in 2:10) {
  alluvial.df <- rbind(alluvial.df, cbind(models.fcal[[G]]$pprob[, 1:2], G = G))
}
alluvial.df <- as.data.frame(alluvial.df)

alluvial.df$ids <- as.character(alluvial.df$ids)
alluvial.df$class <- as.factor(alluvial.df$class)

# eliminate label switching

alluvial.df[alluvial.df[, "G"] == 2, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 2, "class"],
    from = c(1, 2),
    to = c(4, 8)
  )

alluvial.df[alluvial.df[, "G"] == 3, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 3, "class"],
    from = c(seq(1, 3)),
    to = c(8, 4, 3)
  )

alluvial.df[alluvial.df[, "G"] == 4, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 4, "class"],
    from = c(seq(1, 4)),
    to = c(8, 7, 4, 3)
  )

alluvial.df[alluvial.df[, "G"] == 5, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 5, "class"],
    from = c(seq(1, 5)),
    to = c(3, 4, 6, 8, 7)
  )

alluvial.df[alluvial.df[, "G"] == 6, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 6, "class"],
    from = c(seq(1, 6)),
    to = c(1, 7, 6, 3, 8, 4)
  )

alluvial.df[alluvial.df[, "G"] == 7, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 7, "class"],
    from = c(seq(1, 7)),
    to = c(5, 4, 6, 8, 3, 7, 1)
  )

alluvial.df[alluvial.df[, "G"] == 8, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 8, "class"],
    from = rank.full$Original,
    to = rank.full$New
  )


alluvial.df[alluvial.df[, "G"] == 9, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 9, "class"],
    from = c(seq(1, 9)),
    to = c(4, 6, 2, 9, 1, 5, 7, 3, 8)
  )


alluvial.df[alluvial.df[, "G"] == 10, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 10, "class"],
    from = c(seq(1, 10)),
    to = c(3, 5, 9, 4, 7, 10, 1, 2, 6, 8)
  )

p <- ggplot(
  alluvial.df,
  aes(
    x = G,
    stratum = class,
    alluvium = ids,
    fill = class,
    label = class
  )
) +
  scale_x_discrete(expand = c(.1, .1)) +
  geom_flow() +
  geom_stratum(alpha = 0.5) +
  geom_text(stat = "stratum", size = 3) +
  theme_minimal() +
  theme(legend.position = "none") +
  scale_fill_manual(values = c(viridis::inferno(8), "#F2F3C9", "white")) +
  xlab("Assumed number of clusters") +
  ylab("Frequency")
print(p)

Code
p <- p + ggtitle("", "")
ggsave("paper/alluvial-FCAL-anchored.png",
  p,
  width = 12,
  height = 6.75,
  units = "in"
)
ggsave("paper/alluvial-FCAL-anchored.pdf",
  p,
  width = 12,
  height = 6.75,
  units = "in"
)
saveRDS(p, paste0(prefix, "processed/plots/fc-alluvial.RDS"))

Fixed effects specified via Gaussian radial basis functions

Code
G.grbf <- numeric()
models.grbf <- list()
for (G.cand in 1:8) {
  file.name <- paste0(prefix, "/cache/fcal/grbf/final/fcal-", G.cand, ".RDS")
  if (file.exists(file.name)) {
    G.grbf <- c(G.grbf, G.cand)
    models.grbf[[G.cand]] <- readRDS(file.name)
  }
}
rm(G.cand)

Number of Gaussian radial basis functions

Two GRBFs

Code
knots.grbf.2 <- readRDS(paste0(prefix, "cache/fcal/grbf/k=2/grbf-1.2.RDS"))
knots.grbf.2.list <- list()
knots.grbf.2.list[[6]] <- knots.grbf.2
png("paper/grbf-plots-2.png", width = 8, height = 7, units = "in", res = 300)
spaghettiPlot(fcal,
  knots.grbf.2.list,
  G = 6,
  log = TRUE,
  tmax = 7,
  sizes = FALSE,
  knots = TRUE,
  knot.type = "equal",
  n.knots = 2,
  grbf = TRUE,
  l = 1.2,
  var.time = "calpro_time"
)
invisible(dev.off())
spaghettiPlot(fcal,
  knots.grbf.2.list,
  G = 6,
  log = TRUE,
  tmax = 7,
  sizes = FALSE,
  knots = TRUE,
  knot.type = "equal",
  n.knots = 2,
  grbf = TRUE,
  l = 1.2,
  var.time = "calpro_time"
)
Figure 10: Cluster profiles for the six-cluster model assuming two knots. The vertical teal lines indicate knot location. Mean cluster profiles are denoted by the red curves.

Three GRBFs

Code
knots.grbf.3 <- readRDS(paste0(prefix, "/cache/fcal/grbf/k=3/grbf-1.2.RDS"))
knots.grbf.3.list <- list()
knots.grbf.3.list[[6]] <- knots.grbf.3
png("paper/grbf-plots-3.png", width = 8, height = 7, units = "in", res = 300)
spaghettiPlot(fcal,
  knots.grbf.3.list,
  G = 6,
  log = TRUE,
  tmax = 7,
  sizes = FALSE,
  knots = TRUE,
  knot.type = "equal",
  n.knots = 3,
  grbf = TRUE,
  l = 1.2,
  var.time = "calpro_time"
)
invisible(dev.off())
spaghettiPlot(fcal,
  knots.grbf.3.list,
  G = 6,
  log = TRUE,
  tmax = 7,
  sizes = FALSE,
  knots = TRUE,
  knot.type = "equal",
  n.knots = 3,
  grbf = TRUE,
  l = 1.2,
  var.time = "calpro_time"
)
Figure 11: Cluster profiles for the six-cluster model assuming three knots. The vertical teal lines indicate knot location. Mean cluster profiles are denoted by the red curves.
Code
alluvial.df <- matrix(nrow = 0, ncol = 3)
colnames(alluvial.df) <- c("ids", "class", "G")
for (G in G.grbf) {
  alluvial.df <- rbind(alluvial.df, cbind(models.grbf[[G]]$pprob[, 1:2], G = G))
}
alluvial.df <- as.data.frame(alluvial.df)

alluvial.df$ids <- as.character(alluvial.df$ids)
alluvial.df$class <- as.factor(alluvial.df$class)

# eliminate label switching

alluvial.df[alluvial.df[, "G"] == 3, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 3, "class"],
    from = c(seq(1, 3)),
    to = c(1, 3, 2)
  )
#
alluvial.df[alluvial.df[, "G"] == 4, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 4, "class"],
    from = c(seq(1, 4)),
    to = c(3, 2, 1, 4)
  )

alluvial.df[alluvial.df[, "G"] == 5, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 5, "class"],
    from = c(seq(1, 5)),
    to = c(2, 4, 5, 1, 3)
  )

alluvial.df[alluvial.df[, "G"] == 6, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 6, "class"],
    from = c(seq(1, 6)),
    to = c(5, 3, 2, 4, 1, 6)
  )

alluvial.df[alluvial.df[, "G"] == 7, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 7, "class"],
    from = c(seq(1, 7)),
    to = c(5, 1, 4, 7, 3, 6, 2)
  )
#
alluvial.df[alluvial.df[, "G"] == 8, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 8, "class"],
    from = c(seq(1, 8)),
    to = c(6, 2, 4, 5, 1, 3, 7, 8)
  )


p <- ggplot(
  alluvial.df,
  aes(
    x = G,
    stratum = class,
    alluvium = ids,
    fill = class,
    label = class
  )
) +
  scale_x_discrete(expand = c(.1, .1)) +
  geom_flow() +
  geom_stratum(alpha = 0.5) +
  geom_text(stat = "stratum", size = 3) +
  theme_minimal() +
  theme(legend.position = "none") +
  scale_fill_manual(values = rainbow(8)) +
  xlab("Assumed number of clusters") +
  ylab("Frequency")
print(p)

p <- p + ggtitle("", "")
ggsave("paper/alluvial-FCAL-grbf.png",
  p,
  width = 12,
  height = 6.75,
  units = "in"
)
ggsave("paper/alluvial-FCAL-grbf.pdf",
  p,
  width = 12,
  height = 6.75,
  units = "in"
)
Figure 12: Alluvial plot of cluster membership across G for FCAL
Code
knots.df <- tibble(
  Knots = c("2 GRBFs", "3 GRBFs"),
  loglik = c(
    knots.grbf.2$loglik,
    knots.grbf.3$loglik
  ),
  AIC = c(
    knots.grbf.2$AIC,
    knots.grbf.3$AIC
  ),
  BIC = c(
    knots.grbf.2$BIC,
    knots.grbf.3$BIC
  )
)
knitr::kable(knots.df,
  col.names = c(
    "Knots",
    "Maxmum log likelihood",
    "AIC",
    "BIC"
  )
)
rm(
  knots.grbf.2.list,
  knots.grbf.3.list
)
Table 2: Model statistics for differing numbers of knots for natural cubic splines
Knots Maxmum log likelihood AIC BIC
2 GRBFs -16421.02 32924.03 33126.70
3 GRBFs -16237.73 32569.45 32801.78

Number of clusters

Posterior classifications

From the below data, we can see how posterior probabilities change as the number of assumed clusters increase

  class1 class2
N 496 540
% 47.88 52.12
  prob1 prob2
class1 0.8024 0.1976
class2 0.2024 0.7976
  class1 class2
prob>0.7 68.35 67.59
prob>0.8 55.24 51.48
prob>0.9 36.09 34.81
  class1 class2 class3
N 310 497 229
% 29.92 47.97 22.1
  prob1 prob2 prob3
class1 0.7767 0.1262 0.0971
class2 0.0985 0.7971 0.1044
class3 0.1083 0.13 0.7617
  class1 class2 class3
prob>0.7 66.13 68.21 60.26
prob>0.8 52.58 55.73 48.03
prob>0.9 34.19 39.64 35.81
  class1 class2 class3 class4
N 277 195 311 253
% 26.74 18.82 30.02 24.42
  prob1 prob2 prob3 prob4
class1 0.7524 0.0494 0.0663 0.1319
class2 0.0553 0.7415 0.0991 0.1041
class3 0.0648 0.0963 0.7746 0.0643
class4 0.1593 0.1106 0.0696 0.6606
  class1 class2 class3 class4
prob>0.7 59.57 57.44 67.2 39.13
prob>0.8 48.01 47.69 55.31 27.27
prob>0.9 32.85 34.36 35.69 19.37
  class1 class2 class3 class4 class5
N 114 268 204 166 284
% 11 25.87 19.69 16.02 27.41
  prob1 prob2 prob3 prob4 prob5
class1 0.7472 0.0951 0.0911 0.0093 0.0572
class2 0.0985 0.6457 0.0624 0.0462 0.1472
class3 0.0937 0.0659 0.7086 0.1028 0.029
class4 0.0181 0.0503 0.1333 0.7194 0.0788
class5 0.0527 0.13 0.0236 0.0526 0.7411
  class1 class2 class3 class4 class5
prob>0.7 57.02 35.82 55.39 53.01 59.51
prob>0.8 49.12 25 40.69 39.76 47.89
prob>0.9 33.33 19.78 27.45 27.71 32.04
  class1 class2 class3 class4 class5 class6
N 186 256 117 281 150 46
% 17.95 24.71 11.29 27.12 14.48 4.44
  prob1 prob2 prob3 prob4 prob5 prob6
class1 0.6971 0.0182 0.0826 0.0555 0.0902 0.0564
class2 0.0114 0.7392 0.0467 0.1273 0.0372 0.0382
class3 0.0834 0.0391 0.7249 0.1022 0.0093 0.0411
class4 0.0578 0.1309 0.1022 0.6399 0.0436 0.0256
class5 0.1135 0.054 0.0201 0.053 0.713 0.0464
class6 0.0639 0.0941 0.0575 0.0176 0.063 0.7039
  class1 class2 class3 class4 class5 class6
prob>0.7 52.15 58.2 53.85 36.3 52 50
prob>0.8 38.17 46.09 47.86 24.56 42.67 41.3
prob>0.9 26.88 33.59 34.19 18.86 28 19.57
  class1 class2 class3 class4 class5 class6 class7
N 139 149 274 33 255 60 126
% 13.42 14.38 26.45 3.19 24.61 5.79 12.16
  prob1 prob2 prob3 prob4 prob5 prob6 prob7
class1 0.7097 0.0839 0.0394 0.0193 0.0117 0.0553 0.0807
class2 0.0956 0.6875 0.059 0.0554 0.0344 0.0357 0.0325
class3 0.0391 0.0405 0.6494 0.0112 0.1266 0.027 0.1062
class4 0.0458 0.1192 0.0446 0.7189 0.0409 0.0301 5e-04
class5 0.0067 0.025 0.1272 0.012 0.7453 0.0377 0.046
class6 0.0685 0.059 0.031 0.0247 0.088 0.6833 0.0455
class7 0.0703 0.0331 0.093 4e-04 0.033 0.0447 0.7255
  class1 class2 class3 class4 class5 class6 class7
prob>0.7 56.12 48.32 40.51 51.52 61.18 46.67 52.38
prob>0.8 41.01 32.21 26.64 45.45 47.45 40 46.83
prob>0.9 27.34 18.79 18.98 36.36 33.73 26.67 33.33
  class1 class2 class3 class4 class5 class6 class7 class8
N 70 131 140 166 114 208 66 141
% 6.76 12.64 13.51 16.02 11 20.08 6.37 13.61
  prob1 prob2 prob3 prob4 prob5 prob6 prob7 prob8
class1 0.644 0.067 0.0342 0.0562 0.0262 0.0728 0.0586 0.041
class2 0.0451 0.6683 0.0792 0.0712 0.0344 0.0258 0.0059 0.0702
class3 0.0283 0.0741 0.5651 0.0419 0.0306 0.0622 0.0347 0.1632
class4 0.0474 0.0777 0.033 0.6829 0.0674 0.0119 0.056 0.0237
class5 0.0297 0.0314 0.038 0.1 0.6569 0.0177 0.1012 0.0251
class6 0.0385 0.0292 0.0608 0.0029 0.0087 0.7125 0.021 0.1264
class7 0.0443 0.0015 0.0264 0.036 0.1159 0.0453 0.697 0.0337
class8 0.0291 0.0683 0.1686 0.0174 0.0117 0.1331 0.0165 0.5553
  class1 class2 class3 class4 class5 class6 class7 class8
prob>0.7 40 47.33 30 51.2 45.61 52.4 51.52 24.11
prob>0.8 32.86 41.22 22.86 42.77 29.82 42.79 34.85 19.86
prob>0.9 21.43 31.3 12.14 28.31 13.16 29.33 28.79 10.64
Code
rm(output)
Code
pprobs.2 <- c()
groups.2 <- models.grbf[[2]]
pprobs.3 <- c()
groups.3 <- models.grbf[[3]]
pprobs.4 <- c()
groups.4 <- models.grbf[[4]]
pprobs.5 <- c()
groups.5 <- models.grbf[[5]]
pprobs.6 <- c()
groups.6 <- models.grbf[[6]]
pprobs.7 <- c()
groups.7 <- models.grbf[[7]]
pprobs.8 <- c()
groups.8 <- models.grbf[[8]]


for (i in 1:nrow(models.grbf[[2]]$pprob)) {
  class.2 <- groups.2$pprob[i, 2]
  pprobs.2 <- c(pprobs.2, groups.2$pprob[i, class.2 + 2])
  class.3 <- groups.3$pprob[i, 2]
  pprobs.3 <- c(pprobs.3, groups.3$pprob[i, class.3 + 2])
  class.4 <- groups.4$pprob[i, 2]
  pprobs.4 <- c(pprobs.4, groups.4$pprob[i, class.4 + 2])
  class.5 <- groups.5$pprob[i, 2]
  pprobs.5 <- c(pprobs.5, groups.5$pprob[i, class.5 + 2])
  class.6 <- groups.6$pprob[i, 2]
  pprobs.6 <- c(pprobs.6, groups.6$pprob[i, class.6 + 2])
  class.7 <- groups.7$pprob[i, 2]
  pprobs.7 <- c(pprobs.7, groups.7$pprob[i, class.7 + 2])
  class.8 <- groups.8$pprob[i, 2]
  pprobs.8 <- c(pprobs.8, groups.8$pprob[i, class.8 + 2])
}
pprobs.2 <- tibble(prob = pprobs.2)
pprobs.3 <- tibble(prob = pprobs.3)
pprobs.4 <- tibble(prob = pprobs.4)
pprobs.5 <- tibble(prob = pprobs.5)
pprobs.6 <- tibble(prob = pprobs.6)
pprobs.7 <- tibble(prob = pprobs.7)
pprobs.8 <- tibble(prob = pprobs.8)

pprobs.2$Model <- as.factor(rep("Two clusters", nrow(pprobs.2)))
pprobs.3$Model <- as.factor(rep("Three clusters", nrow(pprobs.3)))
pprobs.4$Model <- as.factor(rep("Four clusters", nrow(pprobs.4)))
pprobs.5$Model <- as.factor(rep("Five clusters", nrow(pprobs.5)))
pprobs.6$Model <- as.factor(rep("Six clusters", nrow(pprobs.6)))
pprobs.7$Model <- as.factor(rep("Seven clusters", nrow(pprobs.7)))
pprobs.8$Model <- as.factor(rep("Eight clusters", nrow(pprobs.8)))
pprobs <- rbind(
  pprobs.2,
  pprobs.3,
  pprobs.4,
  pprobs.5,
  pprobs.6,
  pprobs.7,
  pprobs.8
)

p <- pprobs %>%
  ggplot(aes(x = prob, y = Model)) +
  # geom_histogram(bins = 40, fill = NA, position="identity")
  stat_slab(
    fill = "#5D5179", color = "gray",
    size = 0.8,
    alpha = 0.7
  ) +
  geom_dots(color = "#4F759B", dotsize = 1) +
  xlab("Posterior probability for cluster membership") +
  ylab("") +
  ggtitle(
    "Distribution of Posterior Probabilities Across Models",
    "Subject-specific posterior probabilities for assigned cluster"
  ) +
  theme_minimal() +
  scale_y_discrete(limits = rev)
ggsave("plots/pprob-distribution-grbf.png",
  p,
  width = 8.5,
  height = 7.5,
  units = "in"
)
ggsave("plots/pprob-distribution-grbf.pdf",
  p,
  width = 8.5,
  height = 7.5,
  units = "in"
)
print(p)

Residual plots

##### G = 3 ##### G = 4 ##### G = 5 ##### G = 6 ##### G = 7 ##### G = 8

Chosen model
Code
p1 <- data.frame(residuals = resid(models.grbf[[6]])) %>%
  ggplot(aes(x = residuals)) +
  geom_histogram(aes(y = after_stat(density)),
    fill = "#D8829D",
    color = "#AF6A80",
    bins = 30
  ) +
  geom_density(color = "#023777", size = 1.2) +
  theme_classic() +
  theme(axis.line = element_line(colour = "gray")) +
  ylab("Density") +
  xlab("Residuals") +
  ggtitle("A")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Code
p2 <- data.frame(residuals = resid(models.grbf[[6]])) %>%
  ggplot(aes(sample = residuals)) +
  stat_qq_band() +
  stat_qq_line(color = "#D8829D") +
  stat_qq_point(color = "#023777") +
  theme_classic() +
  theme(axis.line = element_line(colour = "gray")) +
  ylab("Theoretical Quantiles") +
  xlab("Sample Quantiles") +
  ggtitle("B")
ggsave("paper/Residual-plot-fcal-grbf.pdf",
  plot = p1 / p2,
  width = 8,
  height = 8,
  units = "in"
)
print(p1 / p2)
Figure 13: Plots assessing normality for the distribution of residuals for the chosen GRBF model for FCAL. (A) Histrogram; (B) Q-Q plot.
Code
predClass <- models.grbf[[6]]$pprob[, c("ids", "class")]
temp <- merge(models.grbf[[6]]$pred, predClass, by = "ids")
labels <- c("A", "B", "C", "D", "E", "F")
p <- list()

for (i in 1:6) {
  p[[i]] <- subset(temp, class == i) %>%
    ggplot(aes(sample = resid_ss)) +
    stat_qq_band() +
    stat_qq_line(color = "#D8829D") +
    stat_qq_point(color = "#023777") +
    theme_classic() +
    theme(axis.line = element_line(colour = "gray")) +
    ylab("Theoretical Quantiles") +
    xlab("Sample Quantiles") +
    ggtitle(labels[i])
}

ArrangedPlot <- (p[[1]] + p[[2]]) / (p[[3]] + p[[4]]) / (p[[5]] + p[[6]])

ggsave("paper/cluster-resids-grbf.pdf",
  ArrangedPlot,
  width = 8,
  height = 8,
  units = "in"
)
print(ArrangedPlot)
Figure 14: Q-Q plots assessing normality for the distribution of residuals stratifed by cluster assignment.

Model metrics

Code
summaryplot(models.grbf[[2]],
  models.grbf[[3]],
  models.grbf[[4]],
  models.grbf[[5]],
  models.grbf[[6]],
  models.grbf[[7]],
  #  models.grbf[[8]],
  which = c("loglik", "AIC", "BIC", "entropy", "ICL"),
  mfrow = c(2, 3),
  axis = "Class"
)
Figure 15: Model metrics for FCAL for G = 2-8
Code
grbf.metrics <- makeMetrics(G.grbf, models.grbf)
buildDT(grbf.metrics)
Clusters Maximum log-likelihood AIC BIC
2 -16590.47 33210.94 33285.09
3 -16422.90 32891.80 33005.49
4 -16340.79 32743.58 32896.81
5 -16272.49 32622.98 32815.76
6 -16237.73 32569.45 32801.78
7 -16205.04 32520.07 32791.94
8 -16187.21 32500.41 32811.83

The optimum AIC is 32500.41 with cluster 10. The optimum BIC is 32791.94 with cluster 9

Spaghetti plots per cluster

Code
for (G in G.grbf) {
  cat(paste0("##### G = ", G, "\n\n"))
  cat("###### Log-scale, all subjects \n\n")
  grid::grid.newpage()
  spaghettiPlot(fcal,
    models.grbf,
    G = G,
    log = TRUE,
    tmax = 7,
    sizes = TRUE,
    knots = TRUE,
    knot.type = "equal",
    grbf = TRUE,
    var.time = "calpro_time"
  )
}
##### G = 2

###### Log-scale, all subjects 

##### G = 3

###### Log-scale, all subjects 

##### G = 4

###### Log-scale, all subjects 

##### G = 5

###### Log-scale, all subjects 

##### G = 6

###### Log-scale, all subjects 

##### G = 7

###### Log-scale, all subjects 

##### G = 8

###### Log-scale, all subjects 

IBD Associations

Code
cluster <- numeric()

dict.grbf <- subset(dict, ids %in% unique(fcal$ids))

for (id in dict.grbf$ids) {
  cluster <- c(
    cluster,
    subset(models.grbf[[6]]$pprob, ids == id)$class
  )
}


dict.grbf$diagnosis <- as.factor(dict.grbf$diagnosis)
dict.grbf$cluster <- as.factor(cluster)
rm(cluster)
Code
describe_cat("diagnosis", dict.grbf)

   Crohn's Disease Ulcerative Colitis               IBDU               <NA> 
               544                380                112                  0 

   Crohn's Disease Ulcerative Colitis               IBDU               <NA> 
         0.5250965          0.3667954          0.1081081          0.0000000 
Code
describe_cat("diagnosis", subset(dict.grbf, cluster == 1))

   Crohn's Disease Ulcerative Colitis               IBDU               <NA> 
               117                 45                 24                  0 

   Crohn's Disease Ulcerative Colitis               IBDU               <NA> 
         0.6290323          0.2419355          0.1290323          0.0000000 
Code
describe_cat("diagnosis", subset(dict.grbf, cluster == 2))

   Crohn's Disease Ulcerative Colitis               IBDU               <NA> 
               173                 48                 35                  0 

   Crohn's Disease Ulcerative Colitis               IBDU               <NA> 
         0.6757812          0.1875000          0.1367188          0.0000000 
Code
describe_cat("diagnosis", subset(dict.grbf, cluster == 3))

   Crohn's Disease Ulcerative Colitis               IBDU               <NA> 
                72                 33                 12                  0 

   Crohn's Disease Ulcerative Colitis               IBDU               <NA> 
         0.6153846          0.2820513          0.1025641          0.0000000 
Code
describe_cat("diagnosis", subset(dict.grbf, cluster == 4))

   Crohn's Disease Ulcerative Colitis               IBDU               <NA> 
                70                193                 18                  0 

   Crohn's Disease Ulcerative Colitis               IBDU               <NA> 
        0.24911032         0.68683274         0.06405694         0.00000000 
Code
describe_cat("diagnosis", subset(dict.grbf, cluster == 5))

   Crohn's Disease Ulcerative Colitis               IBDU               <NA> 
                83                 44                 23                  0 

   Crohn's Disease Ulcerative Colitis               IBDU               <NA> 
         0.5533333          0.2933333          0.1533333          0.0000000 
Code
describe_cat("diagnosis", subset(dict.grbf, cluster == 6))

   Crohn's Disease Ulcerative Colitis               IBDU               <NA> 
                29                 17                  0                  0 

   Crohn's Disease Ulcerative Colitis               IBDU               <NA> 
         0.6304348          0.3695652          0.0000000          0.0000000 
Fisher’s exact test
Code
fisher.test(dict.grbf$diagnosis,
  dict.grbf$cluster,
  simulate.p.value = TRUE, # calculate p-values using Monte Carlo
  B = 1e5
) # replicates for Monte Carlo test

    Fisher's Exact Test for Count Data with simulated p-value (based on
    1e+05 replicates)

data:  dict.grbf$diagnosis and dict.grbf$cluster
p-value = 1e-05
alternative hypothesis: two.sided

NCS vs GRBF comparison

Code
pred.df <- data.frame(
  calpro_time = seq(0, 7, by = 0.01),
  diagnosis = "Crohn's Disease"
)
pred.df <- rbind(
  pred.df,
  data.frame(
    calpro_time = seq(0, 7, by = 0.01),
    diagnosis = "Ulcerative colitis"
  )
)
pred.df <- rbind(
  pred.df,
  data.frame(
    calpro_time = seq(0, 7, by = 0.01),
    diagnosis = "IBDU"
  )
)

p.list <- list()

l <- 1.2
centers <- c(7 / 4, 14 / 4, 21 / 4)
pred.df$grbf1 <- exp(-1 * ((pred.df$calpro_time - centers[1])^2) / (2 * l^2))
pred.df$grbf2 <- exp(-1 * ((pred.df$calpro_time - centers[2])^2) / (2 * l^2))
pred.df$grbf3 <- exp(-1 * ((pred.df$calpro_time - centers[3])^2) / (2 * l^2))

NCS.pred <- predictY(models.fcal[[6]], pred.df, "calpro_time", draws = TRUE)$pred
GRBF.pred <- predictY(models.grbf[[6]], pred.df, "calpro_time", draws = TRUE)$pred

NCS.switch <- c(4, 1, 6, 2, 5, 3)

for (g in 1:6) {
  plot.df <- data.frame(
    time = pred.df$calpro_time,
    pred = NCS.pred[, NCS.switch[g]],
    upper = NCS.pred[, NCS.switch[g] + 6],
    lower = NCS.pred[, NCS.switch[g] + 12],
    Model = "NCS"
  )
  plot.df <- rbind(
    plot.df,
    data.frame(
      time = pred.df$calpro_time,
      pred = GRBF.pred[, g],
      upper = GRBF.pred[, g + 6],
      lower = GRBF.pred[, g + 12],
      Model = "GRBF"
    )
  )

  p <- plot.df %>%
    ggplot(aes(x = time, color = Model)) +
    geom_line(aes(y = pred)) +
    geom_line(aes(y = upper), lty = 2) +
    geom_line(aes(y = lower), lty = 2) +
    xlab("Time") +
    ylab("Predicted cluster mean") +
    ylim(3, 8) +
    theme_minimal() +
    scale_color_manual(values = c("#0E6BA8", "#EF6461"))
  p.list[[g]] <- p
}

final.plot <- (p.list[[1]] + p.list[[2]]) /
  (p.list[[3]] + p.list[[4]]) /
  (p.list[[5]] + p.list[[6]]) +
  plot_annotation(tag_levels = "A") +
  plot_layout(guides = "collect") & theme(legend.position = "bottom")


ggsave("paper/comparison.pdf",
  final.plot,
  width = 8.3,
  height = 11.7,
  units = "in"
)
ggsave("paper/comparison.png",
  final.plot,
  width = 8.3,
  height = 11.7,
  units = "in"
)
print(final.plot)

Acknowledgments

This work is funded by the Medical Research Council & University of Edinburgh via a Precision Medicine PhD studentship (MR/N013166/1, to NC-C).

This work has made use of the resources provided by the Edinburgh Compute and Data Facility (ECDF).

Author contributions

References

Constantine-Cooke, Nathan, Karla Monterrubio-Gómez, Nikolas Plevris, Lauranne A. A. P. Derikx, Beatriz Gros, Gareth-Rhys Jones, Riccardo E. Marioni, Charlie W. Lees, and Catalina A. Vallejos. 2023. “Longitudinal Fecal Calprotectin Profiles Characterize Disease Course Heterogeneity in Crohn’s Disease.” Clinical Gastroenterology and Hepatology, March, S1542356523002343. https://doi.org/10.1016/j.cgh.2023.03.026.
Elhakeem, Ahmed, Rachael A. Hughes, Kate Tilling, Diana L. Cousminer, Stefan A. Jackowski, Tim J. Cole, Alex S. F. Kwong, et al. 2022. “Using Linear and Natural Cubic Splines, SITAR, and Latent Trajectory Models to Characterise Nonlinear Longitudinal Growth Trajectories in Cohort Studies.” BMC Medical Research Methodology 22 (1): 68. https://doi.org/10.1186/s12874-022-01542-8.
Harrell, Frank E. 2015. “General Aspects of Fitting Regression Models.” In Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis, edited by Jr. Harrell Frank E., 13–44. Springer Series in Statistics. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-19425-7_2.
Proust-Lima, Cécile, Viviane Philipps, and Benoit Liquet. 2017. “Estimation of Extended Mixed Models Using Latent Classes and Latent Processes: The R Package lcmm.” Journal of Statistical Software 78 (2): 1–56. https://doi.org/10.18637/jss.v078.i02.

Reuse

Licensed by CC BY unless otherwise stated.

MRC Human Genetics Unit logoCentre for Genomic & Experimental Medicine logo


Session info

Code

R version 4.4.2 (2024-10-31)

Platform: aarch64-apple-darwin20

locale: en_GB.UTF-8||en_GB.UTF-8||en_GB.UTF-8||C||en_GB.UTF-8||en_GB.UTF-8

attached base packages: splines, stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: qqplotr(v.0.0.6), pander(v.0.6.5), ggalluvial(v.0.12.5), ggdist(v.3.3.2), kml(v.2.5.0), longitudinalData(v.2.4.7), misc3d(v.0.9-1), rgl(v.1.3.16), clv(v.0.3-2.4), class(v.7.3-22), cluster(v.2.1.6), lcmm(v.2.1.0), libdr(v.1.0.1), magrittr(v.2.0.3), patchwork(v.1.3.0), lubridate(v.1.9.4), forcats(v.1.0.0), stringr(v.1.5.1), dplyr(v.1.1.4), purrr(v.1.0.2), readr(v.2.1.5), tidyr(v.1.3.1), tibble(v.3.2.1), ggplot2(v.3.5.1) and tidyverse(v.2.0.0)

loaded via a namespace (and not attached): tidyselect(v.1.2.1), viridisLite(v.0.4.2), farver(v.2.1.2), viridis(v.0.6.5), bitops(v.1.0-9), fastmap(v.1.2.0), pracma(v.2.4.4), digest(v.0.6.37), timechange(v.0.3.0), lifecycle(v.1.0.4), survival(v.3.7-0), compiler(v.4.4.2), rlang(v.1.1.4), tools(v.4.4.2), yaml(v.2.3.10), knitr(v.1.49), labeling(v.0.4.3), htmlwidgets(v.1.6.4), plyr(v.1.8.9), withr(v.3.0.2), numDeriv(v.2016.8-1.1), grid(v.4.4.2), opdisDownsampling(v.1.0.1), caTools(v.1.18.3), colorspace(v.2.1-1), extrafontdb(v.1.0), scales(v.1.3.0), iterators(v.1.0.14), MASS(v.7.3-61), cli(v.3.6.3), mvtnorm(v.1.3-2), rmarkdown(v.2.29), ragg(v.1.3.3), marqLevAlg(v.2.0.8), generics(v.0.1.3), twosamples(v.2.0.1), robustbase(v.0.99-4-1), reshape2(v.1.4.4), tzdb(v.0.4.0), parallel(v.4.4.2), base64enc(v.0.1-3), vctrs(v.0.6.5), Matrix(v.1.7-1), jsonlite(v.1.8.9), hms(v.1.1.3), pbmcapply(v.1.5.1), systemfonts(v.1.1.0), foreach(v.1.5.2), glue(v.1.8.0), DEoptimR(v.1.1-3-1), codetools(v.0.2-20), rngWELL(v.0.10-10), distributional(v.0.5.0), stringi(v.1.8.4), gtable(v.0.3.6), quadprog(v.1.5-8), extrafont(v.0.19), randtoolbox(v.2.0.5), munsell(v.0.5.1), pillar(v.1.10.1), htmltools(v.0.5.8.1), R6(v.2.5.1), textshaping(v.0.4.1), tcltk(v.4.4.2), doParallel(v.1.0.17), evaluate(v.1.0.1), lattice(v.0.22-6), qqconf(v.1.3.2), Rcpp(v.1.0.13-1), gridExtra(v.2.3), nlme(v.3.1-166), Rttf2pt1(v.1.3.12), xfun(v.0.50) and pkgconfig(v.2.0.3)