Model selection

Nathan Constantine-Cooke; Karla Monterrubio-Gómez; Catalina A. Vallejos

Model selection

Faecal calprotectin

Authors

Affiliations

Nathan Constantine-Cooke

Centre for Genomic and Experimental Medicine

MRC Human Genetics Unit

Karla Monterrubio-Gómez

MRC Human Genetics Unit

Catalina A. Vallejos

MRC Human Genetics Unit

Published

December 13, 2024

Introduction

Code

set.seed(123)
if (file.exists("/.dockerenv")) { # Check if running in Docker
  # Assume igmm/cvallejo-predicct/libdr/ is passed to the data volume
  prefix <- "data/"
} else {
  # Assume running outside of a Docker container and the IGC(/IGMM) datastore is
  # mounted at /Volumes
  prefix <- "/Volumes/igmm/cvallejo-predicct/libdr/"
}


##########################
#--     Packages       --#
##########################

library(tidyverse)
# Support package (source found in libdr/)
library(libdr)
## Modelling ##
library(lcmm)
library(kml) # K-means
## Presentation ##
library(patchwork)
library(ggdist)
library(ggalluvial)
library(pander)
library(qqplotr)

##########################
#--     Data read      --#
##########################

dict <- readRDS(paste0(prefix, "processed/dict.RDS"))
fcal <- readRDS(paste0(prefix, "processed/fcal.RDS"))

This page describes how the latent class mixed models (LCMMs) were fitted to faecal calprotectin (FC). A separate page is dedicated to C-reactive protein.

In order to not present subject-level data, subjects within each cluster have been collated into groups of six.

Model specification

LCMMs are an extension of linear mixed effects models with an added cluster-specific fixed effect component. We use LCMMs with a natural cubic spline formulation for the fixed effects component. Random effects are specified as intercepts and the multinomial logistic regression model which assigns cluster membership uses IBD type (CD, UC, or IBDU) as a covariate.

For full formal definitions of the models and statistics we have used in the work, please refer to the supplementary material for our paper.

Model statistics

Model loading

Fitting the LCMMs discussed in this report takes multiple days in some cases and uses 25 CPU cores per model. We have fitted these models using Eddie, the University of Edinburgh’s high performance computing solution. The code in the dropdown box below was used to fit the models.

Code used to fit models

Initial LME
LCMM

## Modelling ##
library(lcmm)
library(splines)


fcal <- readRDS("fcal.RDS")
fcal$calpro_result <- log(fcal$calpro_result)

if(!dir.exists("cache")) dir.create("cache")

if(!file.exists("m1.RDS")) {
  m1 <-  hlme(fixed =  calpro_result ~ ns(calpro_time,
                                          df = 4,
                                          Boundary.knots = c(0, 7)),
              random = ~  1,
              subject = "ids",
              data = fcal,
              verbose = TRUE,
              var.time = "calpro_time",
              maxiter = 8000)
  if (m1$conv != 1) stop("LME did not converge \n")
  saveRDS(m1, "m1.RDS")
  message("Finished fitting m1")
}

## Modelling ##
library(lcmm)
library(splines)

args <- commandArgs(trailingOnly = TRUE)
fcal <- readRDS("fcal.RDS")
fcal$calpro_result <- log(fcal$calpro_result)
if(!dir.exists("cache")) dir.create("cache")

rep <- 50
# set the maximum number of iterations
maxiter <- 10

if(file.exists("m1.RDS")) {
  m1 <- readRDS("m1.RDS")
  ng <- as.numeric(args[1])
  # create a cluster
  cl <- parallel::makeForkCluster(parallel::detectCores())
  # export the number of groups to the cluster
  parallel::clusterExport(cl, "ng")
  # run gridsearch
  hlme.fit <- gridsearch(
    rep = rep,
    maxiter = maxiter,
    minit = m1,
    cl = cl,
    hlme(calpro_result ~ ns(calpro_time,
                            df = 4,
                            Boundary.knots = c(0, 7)),
         mixture = ~ ns(calpro_time,
                        df = 4,
                        Boundary.knots = c(0, 7)),
         random = ~  1,
         subject = "ids",
     classmb = ~ 1 + diagnosis,
         ng = ng,
         data = fcal,
         maxiter = 24000,
         verbose = TRUE,
         partialH = FALSE)
  )
  # stop the cluster
  parallel::stopCluster(cl)

  if (hlme.fit$conv == 1) {
    message("Convergence achieved for ", ng, " subgroups ✅ \n")
  } else {
    stop("Convergence NOT achieved for ", ng, " subgroups ⚠️ \n")
  }
  saveRDS(hlme.fit, paste0("cache/fcal-", ng, ".RDS"))
} else {
  stop("m1 not found!")
}

As recommended by Proust-Lima, Philipps, and Liquet (2017), a linear mixed effects model is first fitted to generate initial starting values, (the “Initial LME” tab). A grid search approach is then used to converge the LCMMs towards a global maximum for each assumed number of clusters based on maximum likelihood (the “LCMM” tab).

After the above code is run, the resultant models are saved in cvallejo-predicct/libdr/cache/ for use in this report.

When an LCMM is fitted, the assumed number of clusters (or “classes”) must be specified a priori. Here we consider 4-8 clusters as 4 clusters were already found in our previous work (Constantine-Cooke et al. 2023), and we expect to find at least as many clusters given our inclusion criteria is more relaxed.

A posteriori model statistics and visual investigations must be used to decide upon the optimum number of classes.

Fixed effects specified via Natural Cubic Splines

Code

# set the number of groups
G.fcal <- numeric()
models.fcal <- list()
G.cands <- seq(2, 11)
for (G.cand in G.cands) {
  file.name <- paste0(prefix, "/cache/fcal/ncs/fcal-", G.cand, ".RDS")
  if (file.exists(file.name)) {
    G.fcal <- c(G.fcal, G.cand)
    models.fcal[[G.cand]] <- readRDS(file.name)
  }
}
rm(G.cand)

We firstly consider specifying fixed effects using NCS which have a few notable advantages (Elhakeem et al. 2022):

Less parameters need to be estimated than either a GRBF function regression model or a polynomial regression model with the same flexibility. This reduces the time complexity when fitting the model and in the future may also make extensions more practically feasible.
NCS enforce linearity between t_0 and the first knot and between the last knot and t_\text{max} which ensures the model does not behave erratically in these sometimes problematic areas.
NCS are not highly sensitive to a continuous parameter and instead requires only K, the number of knots to be tuned. NCS are typically robust to where the knots themselves are placed.

Knot choice

To determine the most appropriate number of knots for the natural cubic splines, we considered two, three, and four knots. We only use six-cluster models for this analysis as this reasonably captures heterogeneity without being too expensive computationally.

Our previous work used three knots across a five-year period. As we are now modelling across a seven-year period, we may need to increase the number of knots to four to ensure the model remains flexible. We also considered two knots] in case a more smoooth model is required.

We place the knots at their default location which is at quantiles. We note that the number of knots is reported by Harrell (2015) to be much more important than knot location.

Model trajectories

One knot

Code

knots.1 <- readRDS(paste0(prefix, "/cache/fcal/ncs/1-knots.RDS"))
knots.1.list <- list()
knots.1.list[[6]] <- knots.1
cairo_pdf("paper/1-knots.pdf",
  width = 9,
  height = 9
)
spaghettiPlot(fcal,
  knots.1.list,
  G = 6,
  log = TRUE,
  tmax = 7,
  sizes = FALSE,
  knots = TRUE,
  n.knots = 1,
  var.time = "calpro_time"
)
invisible(dev.off())
spaghettiPlot(fcal,
  knots.1.list,
  G = 6,
  log = TRUE,
  tmax = 7,
  sizes = FALSE,
  knots = TRUE,
  n.knots = 1,
  var.time = "calpro_time"
)

Figure 1: Cluster profiles for the six-cluster model assuming one knot. The vertical teal lines indicate knot location. Mean cluster profiles are denoted by the red curves

Two knots

From Figure 2, we can see specifying only two knots results in very smooth curves for the mean trajectory of each cluster. However, this model appears to be too smooth which results in a relatively poor fit.

Code

knots.2 <- readRDS(paste0(prefix, "/cache/fcal/ncs/2-knots.RDS"))
knots.2.list <- list()
knots.2.list[[6]] <- knots.2
cairo_pdf("paper/2-knots.pdf",
  width = 9,
  height = 9
)
spaghettiPlot(fcal,
  knots.2.list,
  G = 6,
  log = TRUE,
  tmax = 7,
  sizes = FALSE,
  knots = TRUE,
  n.knots = 2,
  var.time = "calpro_time"
)
invisible(dev.off())
spaghettiPlot(fcal,
  knots.2.list,
  G = 6,
  log = TRUE,
  tmax = 7,
  sizes = FALSE,
  knots = TRUE,
  n.knots = 2,
  var.time = "calpro_time"
)

Figure 2: Cluster profiles for the six-cluster model assuming two knots. The vertical teal lines indicate knot location. Mean cluster profiles are denoted by the red curves

Three knots

Three knots appears to be the most appropriate specification (Figure 3). Whilst there is perhaps some evidence of overfitting, this appears to be minimal.

Code

knots.3 <- readRDS(paste0(prefix, "cache/fcal/ncs/fcal-6.RDS"))
knots.3.list <- list()
knots.3.list[[6]] <- knots.3
cairo_pdf("paper/3-knots.pdf",
  width = 9,
  height = 9
)
spaghettiPlot(fcal,
  knots.3.list,
  G = 6,
  log = TRUE,
  tmax = 7,
  sizes = FALSE,
  knots = TRUE,
  n.knots = 3,
  var.time = "calpro_time"
)
invisible(dev.off())
spaghettiPlot(fcal,
  knots.3.list,
  G = 6,
  log = TRUE,
  tmax = 7,
  sizes = FALSE,
  knots = TRUE,
  n.knots = 3,
  var.time = "calpro_time"
)

Figure 3: Cluster profiles for the six-cluster model assuming three knots. The vertical teal lines indicate knot location. Mean cluster profiles are denoted by the red curves

Four knots

Four knots appears to be too many. Figure 4 presents a minor improvement to model fit over the three knot specification, and there are signs of overfitting which is likely driven by the small distances between knots.

Code

knots.4 <- readRDS(paste0(prefix, "/cache/fcal/ncs/4-knots.RDS"))
knots.4.list <- list()
knots.4.list[[6]] <- knots.4
cairo_pdf("paper/4-knots.pdf",
  width = 9,
  height = 9
)
spaghettiPlot(fcal,
  knots.4.list,
  G = 6,
  log = TRUE,
  tmax = 7,
  sizes = FALSE,
  knots = TRUE,
  n.knots = 4,
  var.time = "calpro_time"
)

Warning: Removed 24 rows containing missing values or values outside the scale range
(`geom_line()`).

Code

invisible(dev.off())
spaghettiPlot(fcal,
  knots.4.list,
  G = 6,
  log = TRUE,
  tmax = 7,
  sizes = FALSE,
  knots = TRUE,
  n.knots = 4,
  var.time = "calpro_time"
)

Warning: Removed 24 rows containing missing values or values outside the scale range
(`geom_line()`).

Figure 4: Cluster profiles for the six-cluster model assuming four knots. The vertical teal lines indicate knot location. Mean cluster profiles are denoted by the red curves

Model statistics

AIC and BIC both suggest a four-knot specification to be optimal (Table 1). However, visual inspection of the mean cluster trajectories (see above) suggests the four-knot approach results in overfitting, especially around knot locations. This will also become more of an issue as the number of clusters increase and additional sparsity is exhibited for some clusters.

As such, we have elected to use natural cubic splines with three knots placed at quantiles.

Code

knots.df <- tibble(
  Knots = c("One knot", "Two knots", "Three knots", "Four knots"),
  loglik = c(
    knots.1$loglik,
    knots.2$loglik,
    knots.3$loglik,
    knots.4$loglik
  ),
  AIC = c(
    knots.1$AIC,
    knots.2$AIC,
    knots.3$AIC,
    knots.4$AIC
  ),
  BIC = c(
    knots.1$BIC,
    knots.2$BIC,
    knots.3$BIC,
    knots.4$BIC
  )
)
knitr::kable(knots.df,
  col.names = c(
    "Knots",
    "Maxmum log likelihood",
    "AIC",
    "BIC"
  )
)
rm(
  knots.1,
  knots.1.list,
  knots.3,
  knots.3.list,
  knots.4,
  knots.4.list
)

Table 1: Model statistics for differing numbers of knots for natural cubic splines

Knots	Maxmum log likelihood	AIC	BIC
One knot	-16448.75	32967.49	33140.50
Two knots	-16230.02	32542.04	32744.71
Three knots	-16141.82	32377.63	32609.96
Four knots	-16125.38	32356.75	32618.74

Number of clusters

After determining the most appriopate number of knots, we must now decide upon the most appropiate number of clusters.

From the alluvial plot comparing FC models (Figure 5), we can see newly formed clusters are quite robust- remaining consistent as the number of clusters increases. However, new clusters are formed from multiple clusters which implies these clusters do not have a high degree of separation from one another. From this figure, six clusters appear to be a reasonable choice.

Code

alluvial.df <- matrix(nrow = 0, ncol = 3)
colnames(alluvial.df) <- c("ids", "class", "G")
for (G in G.fcal) {
  alluvial.df <- rbind(alluvial.df, cbind(models.fcal[[G]]$pprob[, 1:2], G = G))
}
alluvial.df <- as.data.frame(alluvial.df)

alluvial.df$ids <- as.character(alluvial.df$ids)
alluvial.df$class <- as.factor(alluvial.df$class)

# eliminate label switching

alluvial.df[alluvial.df[, "G"] == 3, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 3, "class"],
    from = c(seq(1, 3)),
    to = c(3, 1, 2)
  )

alluvial.df[alluvial.df[, "G"] == 4, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 4, "class"],
    from = c(seq(1, 4)),
    to = c(3, 4, 1, 2)
  )

alluvial.df[alluvial.df[, "G"] == 5, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 5, "class"],
    from = c(seq(1, 5)),
    to = c(2, 1, 5, 3, 4)
  )

alluvial.df[alluvial.df[, "G"] == 6, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 6, "class"],
    from = c(seq(1, 6)),
    to = c(6, 4, 5, 2, 3, 1)
  )

alluvial.df[alluvial.df[, "G"] == 7, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 7, "class"],
    from = c(seq(1, 7)),
    to = c(4, 1, 5, 3, 2, 7, 6)
  )

alluvial.df[alluvial.df[, "G"] == 8, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 8, "class"],
    from = c(seq(1, 8)),
    to = c(7, 5, 1, 3, 6, 4, 8, 2)
  )


alluvial.df[alluvial.df[, "G"] == 9, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 9, "class"],
    from = c(seq(1, 9)),
    to = c(1, 5, 8, 9, 6, 4, 7, 2, 3)
  )


alluvial.df[alluvial.df[, "G"] == 10, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 10, "class"],
    from = c(seq(1, 10)),
    to = c(2, 4, 9, 1, 7, 10, 6, 8, 5, 3)
  )


alluvial.df[alluvial.df[, "G"] == 11, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 11, "class"],
    from = c(seq(1, 11)),
    to = c(6, 10, 7, 2, 11, 3, 9, 4, 5, 1, 8)
  )


p <- ggplot(
  alluvial.df,
  aes(
    x = G,
    stratum = class,
    alluvium = ids,
    fill = class,
    label = class
  )
) +
  scale_x_discrete(expand = c(.1, .1)) +
  geom_flow() +
  geom_stratum(alpha = 0.5) +
  geom_text(stat = "stratum", size = 3) +
  theme_minimal() +
  theme(legend.position = "none") +
  scale_fill_manual(values = rainbow(11)) +
  xlab("Assumed number of clusters") +
  ylab("Frequency")
print(p)

p <- p + ggtitle("", "")
ggsave("paper/alluvial-FCAL-all.png",
  p,
  width = 12,
  height = 6.75,
  units = "in"
)
ggsave("paper/alluvial-FCAL-all.pdf",
  p,
  width = 12,
  height = 6.75,
  units = "in"
)

Figure 5: Alluvial plot of cluster membership across G for FC

Code

alluvial.df <- matrix(nrow = 0, ncol = 3)
colnames(alluvial.df) <- c("ids", "class", "G")
for (G in G.fcal) {
  alluvial.df <- rbind(alluvial.df, cbind(models.fcal[[G]]$pprob[, 1:2], G = G))
}
alluvial.df <- as.data.frame(alluvial.df)

alluvial.df$ids <- as.character(alluvial.df$ids)
alluvial.df$class <- as.factor(alluvial.df$class)

p <- ggplot(
  alluvial.df,
  aes(
    x = G,
    stratum = class,
    alluvium = ids,
    fill = class,
    label = class
  )
) +
  scale_x_discrete(expand = c(.1, .1)) +
  geom_flow() +
  geom_stratum(alpha = 0.5) +
  geom_text(stat = "stratum", size = 3) +
  theme_minimal() +
  theme(legend.position = "none") +
  scale_fill_manual(values = rainbow(11)) +
  xlab("Assumed number of clusters") +
  ylab("Frequency")
print(p)

Figure 6: Alluvial plot of cluster membership across G for FC

Posterior classifications

From the below data, we can see how posterior probabilities change as the number of assumed clusters increase.

	class1	class2
N	505	531
%	48.75	51.25

	prob1	prob2
class1	0.805	0.195
class2	0.1986	0.8014

	class1	class2
prob>0.7	68.71	69.49
prob>0.8	54.26	50.47
prob>0.9	37.23	37.1

	class1	class2	class3
N	501	298	237
%	48.36	28.76	22.88

	prob1	prob2	prob3
class1	0.8072	0.0856	0.1072
class2	0.1203	0.787	0.0927
class3	0.1259	0.1043	0.7698

	class1	class2	class3
prob>0.7	70.66	66.78	63.71
prob>0.8	57.68	54.7	50.63
prob>0.9	43.51	39.6	35.86

	class1	class2	class3	class4
N	324	200	274	238
%	31.27	19.31	26.45	22.97

	prob1	prob2	prob3	prob4
class1	0.7547	0.1307	0.0465	0.0681
class2	0.1486	0.6844	0.0645	0.1025
class3	0.0651	0.0609	0.7858	0.0881
class4	0.0701	0.1005	0.0904	0.7389

	class1	class2	class3	class4
prob>0.7	60.19	43	68.61	58.4
prob>0.8	45.99	32	56.2	47.06
prob>0.9	36.42	21.5	41.24	34.87

	class1	class2	class3	class4	class5
N	255	258	42	337	144
%	24.61	24.9	4.05	32.53	13.9

	prob1	prob2	prob3	prob4	prob5
class1	0.7292	0.0724	0.0374	0.0776	0.0834
class2	0.0901	0.7656	0.0432	0.0502	0.0509
class3	0.0344	0.0776	0.8022	0.0782	0.0075
class4	0.0811	0.0315	0.037	0.747	0.1035
class5	0.0967	0.0679	0.0059	0.1416	0.6878

	class1	class2	class3	class4	class5
prob>0.7	54.9	63.95	71.43	59.05	43.75
prob>0.8	45.1	51.55	59.52	46.88	32.64
prob>0.9	31.76	37.98	42.86	34.42	22.92

	class1	class2	class3	class4	class5	class6
N	141	144	48	234	312	157
%	13.61	13.9	4.63	22.59	30.12	15.15

	prob1	prob2	prob3	prob4	prob5	prob6
class1	0.7231	0.0302	0.028	0.097	0.0124	0.1093
class2	0.0374	0.6684	0.009	0.0911	0.126	0.0681
class3	0.061	0.0094	0.7606	0.0477	0.0726	0.0487
class4	0.0727	0.1064	0.034	0.6842	0.0851	0.0175
class5	0.0068	0.094	0.0393	0.083	0.7422	0.0346
class6	0.1158	0.0547	0.0412	0.04	0.0646	0.6836

	class1	class2	class3	class4	class5	class6
prob>0.7	55.32	42.36	66.67	48.29	58.65	48.41
prob>0.8	44.68	31.94	54.17	37.61	47.76	33.76
prob>0.9	30.5	22.22	39.58	21.79	33.33	21.66

	class1	class2	class3	class4	class5	class6	class7
N	49	177	59	205	191	229	126
%	4.73	17.08	5.69	19.79	18.44	22.1	12.16

	prob1	prob2	prob3	prob4	prob5	prob6	prob7
class1	0.6775	0.0669	0.0022	0.0379	0.0546	0.1242	0.0367
class2	0.0346	0.6893	0.0388	0.034	0.0378	0.0546	0.1108
class3	0.0027	0.0598	0.7169	0.0618	0.0617	0.0363	0.0607
class4	0.0076	0.0305	0.0458	0.7318	0.0358	0.1458	0.0029
class5	0.0256	0.0222	0.0383	0.0342	0.6878	0.1125	0.0794
class6	0.0428	0.0383	0.0226	0.144	0.1246	0.6117	0.016
class7	0.016	0.1025	0.0297	0.0068	0.0838	0.0202	0.741

	class1	class2	class3	class4	class5	class6	class7
prob>0.7	46.94	48.59	54.24	58.54	50.26	31.44	58.73
prob>0.8	40.82	36.16	49.15	44.39	38.22	21.83	50
prob>0.9	30.61	20.9	35.59	27.32	26.18	10.92	34.13

	class1	class2	class3	class4	class5	class6	class7	class8
N	244	64	103	194	140	67	67	157
%	23.55	6.18	9.94	18.73	13.51	6.47	6.47	15.15

	prob1	prob2	prob3	prob4	prob5	prob6	prob7	prob8
class1	0.6255	0.0238	0.0339	0.127	0.0156	0.0661	0.0167	0.0914
class2	0.0312	0.7158	0.0553	0.0692	0.0406	0.006	0.0169	0.0649
class3	0.0548	0.0357	0.7109	0.0362	0.0734	0.0232	0.0573	0.0085
class4	0.1397	0.037	0.021	0.7495	0.0021	0.0189	0.0097	0.0222
class5	0.0175	0.0232	0.1004	0.0055	0.7182	0.0176	0.0432	0.0744
class6	0.1193	0.0137	0.0437	0.0543	0.0271	0.6537	0.0239	0.0643
class7	0.0373	0.0164	0.1082	0.0123	0.0775	0.0298	0.6788	0.0397
class8	0.0929	0.043	0.0102	0.0232	0.0752	0.0436	0.0275	0.6844

	class1	class2	class3	class4	class5	class6	class7	class8
prob>0.7	34.84	54.69	55.34	59.28	53.57	47.76	44.78	49.68
prob>0.8	24.18	48.44	38.83	46.39	47.86	37.31	37.31	39.49
prob>0.9	15.16	28.12	25.24	32.47	31.43	28.36	25.37	29.94

Table continues below
	class1	class2	class3	class4	class5	class6	class7	class8
N	102	60	75	33	129	45	235	164
%	9.85	5.79	7.24	3.19	12.45	4.34	22.68	15.83

	class9
N	193
%	18.63

Table continues below
	prob1	prob2	prob3	prob4	prob5	prob6	prob7
class1	0.7052	0.0337	0.0567	0.002	0.0747	0.0322	0.0527
class2	0.055	0.7411	0.0163	0.0077	0.0229	0.0015	0.0377
class3	0.1003	0.0155	0.6609	0.0053	0.0687	0.0225	0.055
class4	0.0023	0.0124	0.0023	0.7246	3e-04	0.0191	0.1115
class5	0.0985	0.0183	0.0463	0.0011	0.7203	0.0174	0.0145
class6	0.0562	0.0022	0.0205	0.0276	0.029	0.6901	0.1125
class7	0.0333	0.0248	0.0177	0.0688	0.0122	0.0332	0.6131
class8	0.0131	0.051	0.031	0.0309	0.0821	0.0252	0.0812
class9	0.0185	0.0375	0.0046	0.0174	0.0018	0.009	0.1455

	prob8	prob9
class1	0.0116	0.0312
class2	0.0536	0.0642
class3	0.0524	0.0194
class4	0.0864	0.0412
class5	0.0783	0.0052
class6	0.0363	0.0256
class7	0.0824	0.1145
class8	0.6679	0.0177
class9	0.0192	0.7466

Table continues below
	class1	class2	class3	class4	class5	class6	class7
prob>0.7	51.96	56.67	44	57.58	56.59	46.67	37.02
prob>0.8	40.2	53.33	33.33	45.45	47.29	42.22	23.83
prob>0.9	22.55	31.67	26.67	30.3	32.56	33.33	14.89

	class8	class9
prob>0.7	45.12	59.07
prob>0.8	37.2	47.15
prob>0.9	28.66	31.09

Table continues below
	class1	class2	class3	class4	class5	class6	class7	class8
N	102	43	36	94	118	188	133	59
%	9.85	4.15	3.47	9.07	11.39	18.15	12.84	5.69

	class9	class10
N	54	209
%	5.21	20.17

Table continues below
	prob1	prob2	prob3	prob4	prob5	prob6	prob7
class1	0.73	0.0125	0.0207	0.0026	0.0149	0.099	0.0706
class2	0.0206	0.6965	0.0302	0.0467	0.046	0.0962	0.0253
class3	0.0818	0.0227	0.7056	0.0027	0.0599	0.0816	0.006
class4	0.005	0.0265	8e-04	0.7029	0.0226	0.0629	0.0653
class5	0.0291	0.0232	0.0871	0.0133	0.5886	0.0965	0.0037
class6	0.0975	0.0353	0.0398	0.0463	0.09	0.5336	0.0356
class7	0.0636	0.0162	0.0015	0.0791	0.004	0.0549	0.7225
class8	0.0314	0.0205	0.0023	0.084	0.0063	0.0772	0.0692
class9	0.0505	0.0023	0.0088	0.046	0.0046	0.0922	0.03
class10	0.0191	0.0108	0.0163	0.0158	0.1163	0.0539	0.0018

	prob8	prob9	prob10
class1	0.016	0.0267	0.0069
class2	0.0102	3e-04	0.028
class3	8e-04	0.0074	0.0316
class4	0.0501	0.0306	0.0333
class5	0.0032	0.0107	0.1445
class6	0.0309	0.037	0.0541
class7	0.0385	0.0149	0.0048
class8	0.6971	0.007	0.0049
class9	0.004	0.7179	0.0437
class10	0.0041	0.0389	0.7229

Table continues below
	class1	class2	class3	class4	class5	class6	class7
prob>0.7	53.92	48.84	52.78	52.13	30.51	21.28	54.89
prob>0.8	46.08	41.86	47.22	38.3	25.42	13.83	47.37
prob>0.9	36.27	32.56	33.33	25.53	16.95	4.79	33.83

	class8	class9	class10
prob>0.7	49.15	59.26	54.55
prob>0.8	42.37	53.7	43.06
prob>0.9	28.81	31.48	32.54

Table continues below
	class1	class2	class3	class4	class5	class6	class7	class8
N	133	189	118	103	10	199	36	42
%	12.84	18.24	11.39	9.94	0.97	19.21	3.47	4.05

	class9	class10	class11
N	51	97	58
%	4.92	9.36	5.6

Table continues below
	prob1	prob2	prob3	prob4	prob5	prob6	prob7
class1	0.7214	0.0564	0.004	0.0625	5e-04	0.0047	0.0015
class2	0.036	0.5301	0.0906	0.0972	0.012	0.0519	0.0387
class3	0.0037	0.0953	0.5825	0.0285	0.019	0.1379	0.0849
class4	0.0695	0.102	0.016	0.7222	0.005	0.0084	0.0204
class5	0	0.0287	0.087	0	0.7155	0.0924	0
class6	0.0019	0.051	0.1141	0.0173	0.0313	0.6994	0.0161
class7	0.006	0.0824	0.0613	0.0805	0.003	0.0308	0.7028
class8	0.0255	0.0972	0.0482	0.0204	4e-04	0.0194	0.0297
class9	0.0264	0.0875	0.0044	0.048	0.0246	0.043	0.0086
class10	0.0647	0.0672	0.0219	0.0048	0.0057	0.0322	8e-04
class11	0.0655	0.0793	0.0065	0.0309	7e-04	0.0049	0.0023

	prob8	prob9	prob10	prob11
class1	0.0161	0.015	0.0809	0.037
class2	0.0345	0.0346	0.0442	0.0303
class3	0.0227	0.009	0.0133	0.0032
class4	0.0122	0.0258	0.0025	0.0158
class5	0	0.0395	0.0368	0
class6	0.0132	0.0362	0.0151	0.0045
class7	0.0225	0.0072	0.0028	7e-04
class8	0.6994	3e-04	0.0493	0.0101
class9	0.0024	0.7202	0.0308	0.0041
class10	0.0259	0.0263	0.7003	0.0504
class11	0.0203	0.0065	0.0835	0.6997

Table continues below
	class1	class2	class3	class4	class5	class6	class7
prob>0.7	54.89	21.69	29.66	52.43	50	50.75	52.78
prob>0.8	47.37	13.76	23.73	45.63	40	41.21	44.44
prob>0.9	33.83	4.23	16.95	34.95	30	29.15	33.33

	class8	class9	class10	class11
prob>0.7	47.62	56.86	52.58	48.28
prob>0.8	42.86	50.98	40.21	43.1
prob>0.9	33.33	33.33	25.77	29.31

Code

rm(output)

Code

pprobs.2 <- c()
groups.2 <- models.fcal[[2]]
pprobs.3 <- c()
groups.3 <- models.fcal[[3]]
pprobs.4 <- c()
groups.4 <- models.fcal[[4]]
pprobs.5 <- c()
groups.5 <- models.fcal[[5]]
pprobs.6 <- c()
groups.6 <- models.fcal[[6]]
pprobs.7 <- c()
groups.7 <- models.fcal[[7]]
pprobs.8 <- c()
groups.8 <- models.fcal[[8]]
pprobs.9 <- c()
groups.9 <- models.fcal[[9]]
pprobs.10 <- c()
groups.10 <- models.fcal[[10]]
pprobs.11 <- c()
groups.11 <- models.fcal[[11]]


for (i in 1:nrow(models.fcal[[2]]$pprob)) {
  class.2 <- groups.2$pprob[i, 2]
  pprobs.2 <- c(pprobs.2, groups.2$pprob[i, class.2 + 2])
  class.3 <- groups.3$pprob[i, 2]
  pprobs.3 <- c(pprobs.3, groups.3$pprob[i, class.3 + 2])
  class.4 <- groups.4$pprob[i, 2]
  pprobs.4 <- c(pprobs.4, groups.4$pprob[i, class.4 + 2])
  class.5 <- groups.5$pprob[i, 2]
  pprobs.5 <- c(pprobs.5, groups.5$pprob[i, class.5 + 2])
  class.6 <- groups.6$pprob[i, 2]
  pprobs.6 <- c(pprobs.6, groups.6$pprob[i, class.6 + 2])
  class.7 <- groups.7$pprob[i, 2]
  pprobs.7 <- c(pprobs.7, groups.7$pprob[i, class.7 + 2])
  class.8 <- groups.8$pprob[i, 2]
  pprobs.8 <- c(pprobs.8, groups.8$pprob[i, class.8 + 2])
  class.9 <- groups.9$pprob[i, 2]
  pprobs.9 <- c(pprobs.9, groups.9$pprob[i, class.9 + 2])
  class.10 <- groups.10$pprob[i, 2]
  pprobs.10 <- c(pprobs.10, groups.10$pprob[i, class.10 + 2])
  class.11 <- groups.11$pprob[i, 2]
  pprobs.11 <- c(pprobs.11, groups.11$pprob[i, class.11 + 2])
}
pprobs.2 <- tibble(prob = pprobs.2)
pprobs.3 <- tibble(prob = pprobs.3)
pprobs.4 <- tibble(prob = pprobs.4)
pprobs.5 <- tibble(prob = pprobs.5)
pprobs.6 <- tibble(prob = pprobs.6)
pprobs.7 <- tibble(prob = pprobs.7)
pprobs.8 <- tibble(prob = pprobs.8)
pprobs.9 <- tibble(prob = pprobs.9)
pprobs.10 <- tibble(prob = pprobs.10)
pprobs.11 <- tibble(prob = pprobs.11)

pprobs.2$Model <- as.factor(rep("Two clusters", nrow(pprobs.2)))
pprobs.3$Model <- as.factor(rep("Three clusters", nrow(pprobs.3)))
pprobs.4$Model <- as.factor(rep("Four clusters", nrow(pprobs.4)))
pprobs.5$Model <- as.factor(rep("Five clusters", nrow(pprobs.5)))
pprobs.6$Model <- as.factor(rep("Six clusters", nrow(pprobs.6)))
pprobs.7$Model <- as.factor(rep("Seven clusters", nrow(pprobs.7)))
pprobs.8$Model <- as.factor(rep("Eight clusters", nrow(pprobs.8)))
pprobs.9$Model <- as.factor(rep("Nine clusters", nrow(pprobs.9)))
pprobs.10$Model <- as.factor(rep("Ten clusters", nrow(pprobs.10)))
pprobs.11$Model <- as.factor(rep("Eleven clusters", nrow(pprobs.11)))
pprobs <- rbind(
  pprobs.2,
  pprobs.3,
  pprobs.4,
  pprobs.5,
  pprobs.6,
  pprobs.7,
  pprobs.8,
  pprobs.9,
  pprobs.10,
  pprobs.11
)

p <- pprobs %>%
  ggplot(aes(x = prob, y = Model)) +
  stat_slab(
    fill = "#5D5179", color = "gray",
    size = 0.8,
    alpha = 0.7
  ) +
  geom_dots(color = "#4F759B", dotsize = 1) +
  xlab("Posterior probability for cluster membership") +
  ylab("") +
  ggtitle(
    "Distribution of Posterior Probabilities Across Models",
    "Subject-specific posterior probabilities for assigned cluster"
  ) +
  theme_minimal() +
  scale_y_discrete(limits = rev)
ggsave("plots/pprob-distribution.png",
  p,
  width = 8.5,
  height = 7.5,
  units = "in"
)
ggsave("plots/pprob-distribution.pdf",
  p,
  width = 8.5,
  height = 7.5,
  units = "in"
)
print(p)

Residual plots

G = 2

##### G = 3 ##### G = 4 ##### G = 5 ##### G = 6 ##### G = 7 ##### G = 8 ##### G = 9 ##### G = 10 ##### G = 11

Chosen model

Code

p1 <- data.frame(residuals = resid(models.fcal[[8]])) %>%
  ggplot(aes(x = residuals)) +
  geom_histogram(aes(y = after_stat(density)),
    fill = "#D8829D",
    color = "#AF6A80",
    bins = 30
  ) +
  geom_density(color = "#023777", linewidth = 1.2) +
  theme_classic() +
  theme(axis.line = element_line(colour = "gray")) +
  ylab("Density") +
  xlab("Residuals") +
  ggtitle("A")

p2 <- data.frame(residuals = resid(models.fcal[[8]])) %>%
  ggplot(aes(sample = residuals)) +
  stat_qq_band() +
  stat_qq_line(color = "#D8829D") +
  stat_qq_point(color = "#023777") +
  theme_classic() +
  theme(axis.line = element_line(colour = "gray")) +
  ylab("Theoretical Quantiles") +
  xlab("Sample Quantiles") +
  ggtitle("B")
ggsave("paper/Residual-plot-fcal-ncs.pdf",
  plot = p1 / p2,
  width = 8,
  height = 8,
  units = "in"
)
print(p1 / p2)

Figure 7: Plots assessing normality for the distribution of residuals for the chosen NCS model for FCAL. (A) Histrogram; (B) Q-Q plot.

Code

predClass <- models.fcal[[8]]$pprob[, c("ids", "class")]
temp <- merge(models.fcal[[8]]$pred, predClass, by = "ids")
labels <- c("A", "B", "C", "D", "E", "F", "G", "H")
p <- list()

for (i in 1:8) {
  p[[i]] <- subset(temp, class == i) %>%
    ggplot(aes(sample = resid_ss)) +
    stat_qq_band() +
    stat_qq_line(color = "#D8829D") +
    stat_qq_point(color = "#023777") +
    theme_classic() +
    theme(axis.line = element_line(colour = "gray")) +
    ylab("Theoretical Quantiles") +
    xlab("Sample Quantiles") +
    ggtitle(labels[i])
}

ArrangedPlot <- (p[[1]] + p[[2]]) /
  (p[[3]] + p[[4]]) /
  (p[[5]] + p[[6]]) /
  (p[[7]] + p[[8]])

ggsave("paper/cluster-resids-ncs.pdf",
  ArrangedPlot,
  width = 8,
  height = 8,
  units = "in"
)
print(ArrangedPlot)

Figure 8: Q-Q plots assessing normality for the distribution of residuals stratifed by cluster assignment.

Model metrics

Code

summaryplot(models.fcal[[4]],
  models.fcal[[5]],
  models.fcal[[6]],
  models.fcal[[7]],
  models.fcal[[8]],
  models.fcal[[9]],
  models.fcal[[10]],
  models.fcal[[11]],
  which = c("loglik", "AIC", "BIC", "entropy", "ICL"),
  mfrow = c(2, 3),
  axis = "Class"
)

Figure 9: Model metrics for FCAL for G = 4-9

Code

fcal.metrics <- makeMetrics(G.fcal, models.fcal)
buildDT(fcal.metrics)

Clusters	Maximum log-likelihood	AIC	BIC
2	-16532.10	33094.20	33168.35
3	-16354.39	32754.78	32868.47
4	-16260.21	32582.42	32735.66
5	-16201.86	32481.71	32674.49
6	-16141.82	32377.63	32609.96
7	-16109.77	32329.54	32601.42
8	-16076.89	32279.77	32591.19
9	-16039.66	32221.33	32572.29
10	-16015.29	32188.59	32579.09
11	-16005.55	32185.10	32615.16

The optimum AIC is 32185.1 with cluster 13. The optimum BIC is 32572.29 with cluster 11

Spaghetti plots per cluster

Code

for (G in G.fcal) {
  # Data frame to hold processed data
  new.fc <- data.frame(
    ids = numeric(),
    calpro_result = numeric(),
    calpro_time = numeric(),
    class = numeric()
  )

  for (clust in 1:G) {
    ids.clust <- subset(models.fcal[[G]]$pprob, class == clust)$ids
    n.clust <- length(ids.clust)
    rand <- sample(n.clust, n.clust) # Randomise the order of the ids
    iters <- floor(n.clust / 6) # How many groups of six are there?

    # Matrix to hold the smoothed data
    fcal.ma <- matrix(NA, nrow = iters, ncol = 7)
    fcal.time <- matrix(NA, nrow = iters, ncol = 7)

    for (i in 0:(iters - 1)) {
      # Find ids for group of five
      ids.select <- ids.clust[rand[((i * 6) + 1):((i * 6) + 6)]]
      fcal.subset <- subset(fcal, ids %in% ids.select)
      # Median process as per CRP preprocessing
      for (j in seq(0, 6)) {
        if (j == 6) {
          sub.obs <- subset(
            fcal.subset,
            calpro_time >= j - 0.5 & calpro_time <= j + 1
          )
        } else {
          sub.obs <- subset(
            fcal.subset,
            calpro_time >= j - 0.5 & calpro_time < j + 0.5
          )
        }
        if (nrow(sub.obs) > 0) {
          fcal.ma[i + 1, j + 1] <- median(sub.obs$calpro_result)
          fcal.time[i + 1, j + 1] <- median(sub.obs$calpro_time)
        }
      }
    }

    rownames(fcal.ma) <- 1:iters
    fcal.ma <- reshape2::melt(t(fcal.ma),
      id.vars = row.names(fcal.ma),
      na.rm = TRUE
    )
    colnames(fcal.ma) <- c("calpro_time", "ids", "calpro_result")
    fcal.ma <- fcal.ma[, c(2, 3, 1)] # Make ids first column

    rownames(fcal.time) <- 1:iters
    fcal.time <- reshape2::melt(t(fcal.time),
      id.vars = row.names(fcal.time),
      na.rm = TRUE
    )
    colnames(fcal.time) <- c("calpro_time", "ids", "pred_time")
    fcal.time <- fcal.time[, c(2, 3, 1)] # Make ids first column

    fcal.ma <- merge(fcal.ma, fcal.time, by = c("ids", "calpro_time"))
    fcal.ma <- fcal.ma %>%
      mutate(calpro_time = pred_time) %>%
      select(-pred_time)

    fcal.ma$class <- clust # Identify cluster assignment
    new.fc <- rbind(new.fc, fcal.ma)
  }

  if (!dir.exists("plots/spaghetti")) dir.create("plots/spaghetti")
  png(paste0("plots/spaghetti/fcal-ncs-", G, ".png"),
    width = 10,
    height = 17.5,
    units = "in",
    res = 300
  )
  grid::grid.newpage()
  spaghettiPlot(new.fc,
    models.fcal,
    G = G,
    log = TRUE,
    tmax = 7,
    sizes = TRUE,
    knots = FALSE,
    var.time = "calpro_time",
    clusters = TRUE
  )
  invisible(dev.off())

  cairo_pdf(paste0("plots/spaghetti/fcal-ncs-", G, ".pdf"),
    width = 10,
    height = 17.5
  )
  grid::grid.newpage()
  spaghettiPlot(new.fc,
    models.fcal,
    G = G,
    log = TRUE,
    tmax = 7,
    sizes = TRUE,
    knots = FALSE,
    var.time = "calpro_time",
    clusters = TRUE
  )
  invisible(dev.off())

  grid::grid.newpage()
  print(spaghettiPlot(new.fc,
    models.fcal,
    G = G,
    log = TRUE,
    tmax = 7,
    sizes = TRUE,
    knots = FALSE,
    var.time = "calpro_time",
    clusters = TRUE
  ))
}

NULL

NULL

NULL

NULL

NULL

NULL

NULL

NULL

NULL

NULL

Chosen model

Code

new.fc <- data.frame(
  ids = numeric(),
  calpro_result = numeric(),
  calpro_time = numeric(),
  class = numeric()
)

for (clust in 1:8) {
  ids.clust <- subset(models.fcal[[8]]$pprob, class == clust)$ids
  n.clust <- length(ids.clust)
  rand <- sample(n.clust, n.clust) # Randomise the order of the ids
  iters <- floor(n.clust / 6) # How many groups of six are there?

  # Matrix to hold the smoothed data
  fcal.ma <- matrix(NA, nrow = iters, ncol = 7)
  fcal.time <- matrix(NA, nrow = iters, ncol = 7)


  for (i in 0:(iters - 1)) {
    # Find ids for group of five
    ids.select <- ids.clust[rand[((i * 6) + 1):((i * 6) + 6)]]
    fcal.subset <- subset(fcal, ids %in% ids.select)
    # Median process as per CRP preprocessing
    for (j in seq(0, 6)) {
      if (j == 6) {
        sub.obs <- subset(
          fcal.subset,
          calpro_time >= j - 0.5 & calpro_time <= j + 1
        )
      } else {
        sub.obs <- subset(
          fcal.subset,
          calpro_time >= j - 0.5 & calpro_time < j + 0.5
        )
      }
      if (nrow(sub.obs) > 0) {
        fcal.ma[i + 1, j + 1] <- median(sub.obs$calpro_result)
        fcal.time[i + 1, j + 1] <- median(sub.obs$calpro_time)
      }
    }
  }

  rownames(fcal.ma) <- 1:iters
  fcal.ma <- reshape2::melt(t(fcal.ma),
    id.vars = row.names(fcal.ma),
    na.rm = TRUE
  )
  colnames(fcal.ma) <- c("calpro_time", "ids", "calpro_result")
  fcal.ma <- fcal.ma[, c(2, 3, 1)] # Make ids first column


  rownames(fcal.time) <- 1:iters
  fcal.time <- reshape2::melt(t(fcal.time),
    id.vars = row.names(fcal.time),
    na.rm = TRUE
  )
  colnames(fcal.time) <- c("calpro_time", "ids", "pred_time")
  fcal.time <- fcal.time[, c(2, 3, 1)] # Make ids first column

  fcal.ma <- merge(fcal.ma, fcal.time, by = c("ids", "calpro_time"))
  fcal.ma <- fcal.ma %>%
    mutate(calpro_time = pred_time) %>%
    select(-pred_time)

  fcal.ma$class <- clust # Identify cluster assignment
  new.fc <- rbind(new.fc, fcal.ma)
}

As. a parsimonious choice, we have elected to focus on the 8-cluster model as we believe this strikes a reasonable balance between characterising the LIBDR cohort whilst still presenting trends which likely reflect the IBD population in Western Europe.

Labelling

The labels/order of the clusters generated above are random. To improve the readability and assist with interpretation of associations, it would be beneficial to order clusters by some analogue for disease severity. This section explores possible solutions for this task.

Cumulative inflammation across follow-up

One possible solution is to order clusters by cumulative inflammation which is given by the area under the curve for each mean cluster trajectory.

Code

rank.full <- rankCumulative(models.fcal[[8]], tmax = 7)

rank.full %>%
  ggplot(aes(x = paste0("FC", New), y = Area)) +
  geom_bar(stat = "identity", fill = "#745C97", color = "#39375B") +
  ylab("Cumulative inflammation (Area)") +
  xlab("Cluster") +
  theme_minimal()

Code

png(paste0("plots/spaghetti/fcal-ncs-reordered.png"),
  width = 10,
  height = 16,
  units = "in",
  res = 300
)
grid::grid.newpage()
spaghettiPlot(new.fc,
  models.fcal,
  G = 8,
  log = TRUE,
  tmax = 7,
  sizes = TRUE,
  knots = FALSE,
  var.time = "calpro_time",
  clusters = TRUE,
  mapping = rank.full$Original
)
invisible(dev.off())

cairo_pdf(paste0("plots/spaghetti/fcal-ncs-reordered.pdf"),
  width = 10,
  height = 16
)
grid::grid.newpage()
spaghettiPlot(new.fc,
  models.fcal,
  G = 8,
  log = TRUE,
  tmax = 7,
  sizes = TRUE,
  knots = FALSE,
  var.time = "calpro_time",
  clusters = TRUE,
  mapping = rank.full$Original
)
invisible(dev.off())

spaghettiPlot(new.fc,
  models.fcal,
  G = 8,
  log = TRUE,
  tmax = 7,
  sizes = TRUE,
  knots = FALSE,
  var.time = "calpro_time",
  clusters = TRUE,
  mapping = rank.full$Original
)

Code

for (G in (9:10)) {
  new.fc <- data.frame(
    ids = numeric(),
    calpro_result = numeric(),
    calpro_time = numeric(),
    class = numeric()
  )

  for (clust in 1:G) {
    ids.clust <- subset(models.fcal[[G]]$pprob, class == clust)$ids
    n.clust <- length(ids.clust)
    rand <- sample(n.clust, n.clust) # Randomise the order of the ids
    iters <- floor(n.clust / 6) # How many groups of six are there?

    # Matrix to hold the smoothed data
    fcal.ma <- matrix(NA, nrow = iters, ncol = 7)
    fcal.time <- matrix(NA, nrow = iters, ncol = 7)

    for (i in 0:(iters - 1)) {
      # Find ids for group of five
      ids.select <- ids.clust[rand[((i * 6) + 1):((i * 6) + 6)]]
      fcal.subset <- subset(fcal, ids %in% ids.select)
      # Median process as per CRP preprocessing
      for (j in seq(0, 6)) {
        if (j == 6) {
          sub.obs <- subset(
            fcal.subset,
            calpro_time >= j - 0.5 & calpro_time <= j + 1
          )
        } else {
          sub.obs <- subset(
            fcal.subset,
            calpro_time >= j - 0.5 & calpro_time < j + 0.5
          )
        }
        if (nrow(sub.obs) > 0) {
          fcal.ma[i + 1, j + 1] <- median(sub.obs$calpro_result)
          fcal.time[i + 1, j + 1] <- median(sub.obs$calpro_time)
        }
      }
    }

    rownames(fcal.ma) <- 1:iters
    fcal.ma <- reshape2::melt(t(fcal.ma),
      id.vars = row.names(fcal.ma),
      na.rm = TRUE
    )
    colnames(fcal.ma) <- c("calpro_time", "ids", "calpro_result")
    fcal.ma <- fcal.ma[, c(2, 3, 1)] # Make ids first column

    rownames(fcal.time) <- 1:iters
    fcal.time <- reshape2::melt(t(fcal.time),
      id.vars = row.names(fcal.time),
      na.rm = TRUE
    )
    colnames(fcal.time) <- c("calpro_time", "ids", "pred_time")
    fcal.time <- fcal.time[, c(2, 3, 1)] # Make ids first column

    fcal.ma <- merge(fcal.ma, fcal.time, by = c("ids", "calpro_time"))
    fcal.ma <- fcal.ma %>%
      mutate(calpro_time = pred_time) %>%
      select(-pred_time)

    fcal.ma$class <- clust # Identify cluster assignment
    new.fc <- rbind(new.fc, fcal.ma)
  }


  rank.full <- rankCumulative(models.fcal[[G]], tmax = 7)
  png(paste0("plots/spaghetti/fcal-ncs-reordered-", G, ".png"),
    width = 10,
    height = 16,
    units = "in",
    res = 300
  )
  grid::grid.newpage()
  spaghettiPlot(new.fc,
    models.fcal,
    G = G,
    log = TRUE,
    tmax = 7,
    sizes = TRUE,
    knots = FALSE,
    var.time = "calpro_time",
    clusters = TRUE,
    mapping = rank.full$Original
  )
  invisible(dev.off())

  cairo_pdf(paste0("plots/spaghetti/fcal-ncs-reordered-", G, ".pdf"),
    width = 10,
    height = 16
  )
  grid::grid.newpage()
  spaghettiPlot(new.fc,
    models.fcal,
    G = G,
    log = TRUE,
    tmax = 7,
    sizes = TRUE,
    knots = FALSE,
    var.time = "calpro_time",
    clusters = TRUE,
    mapping = rank.full$Original
  )
  invisible(dev.off())
}

Alluvial anchored plot

Having chosen the 8-cluster model to be the most appropriate, we have produced a version of Figure 6 using the labels assigned for the 8-cluster model.

Code

alluvial.df <- matrix(nrow = 0, ncol = 3)
colnames(alluvial.df) <- c("ids", "class", "G")
for (G in 2:10) {
  alluvial.df <- rbind(alluvial.df, cbind(models.fcal[[G]]$pprob[, 1:2], G = G))
}
alluvial.df <- as.data.frame(alluvial.df)

alluvial.df$ids <- as.character(alluvial.df$ids)
alluvial.df$class <- as.factor(alluvial.df$class)

# eliminate label switching

alluvial.df[alluvial.df[, "G"] == 2, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 2, "class"],
    from = c(1, 2),
    to = c(4, 8)
  )

alluvial.df[alluvial.df[, "G"] == 3, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 3, "class"],
    from = c(seq(1, 3)),
    to = c(8, 4, 3)
  )

alluvial.df[alluvial.df[, "G"] == 4, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 4, "class"],
    from = c(seq(1, 4)),
    to = c(8, 7, 4, 3)
  )

alluvial.df[alluvial.df[, "G"] == 5, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 5, "class"],
    from = c(seq(1, 5)),
    to = c(3, 4, 6, 8, 7)
  )

alluvial.df[alluvial.df[, "G"] == 6, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 6, "class"],
    from = c(seq(1, 6)),
    to = c(1, 7, 6, 3, 8, 4)
  )

alluvial.df[alluvial.df[, "G"] == 7, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 7, "class"],
    from = c(seq(1, 7)),
    to = c(5, 4, 6, 8, 3, 7, 1)
  )

alluvial.df[alluvial.df[, "G"] == 8, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 8, "class"],
    from = rank.full$Original,
    to = rank.full$New
  )


alluvial.df[alluvial.df[, "G"] == 9, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 9, "class"],
    from = c(seq(1, 9)),
    to = c(4, 6, 2, 9, 1, 5, 7, 3, 8)
  )


alluvial.df[alluvial.df[, "G"] == 10, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 10, "class"],
    from = c(seq(1, 10)),
    to = c(3, 5, 9, 4, 7, 10, 1, 2, 6, 8)
  )

p <- ggplot(
  alluvial.df,
  aes(
    x = G,
    stratum = class,
    alluvium = ids,
    fill = class,
    label = class
  )
) +
  scale_x_discrete(expand = c(.1, .1)) +
  geom_flow() +
  geom_stratum(alpha = 0.5) +
  geom_text(stat = "stratum", size = 3) +
  theme_minimal() +
  theme(legend.position = "none") +
  scale_fill_manual(values = c(viridis::inferno(8), "#F2F3C9", "white")) +
  xlab("Assumed number of clusters") +
  ylab("Frequency")
print(p)

Code

p <- p + ggtitle("", "")
ggsave("paper/alluvial-FCAL-anchored.png",
  p,
  width = 12,
  height = 6.75,
  units = "in"
)
ggsave("paper/alluvial-FCAL-anchored.pdf",
  p,
  width = 12,
  height = 6.75,
  units = "in"
)
saveRDS(p, paste0(prefix, "processed/plots/fc-alluvial.RDS"))

Fixed effects specified via Gaussian radial basis functions

Code

G.grbf <- numeric()
models.grbf <- list()
for (G.cand in 1:8) {
  file.name <- paste0(prefix, "/cache/fcal/grbf/final/fcal-", G.cand, ".RDS")
  if (file.exists(file.name)) {
    G.grbf <- c(G.grbf, G.cand)
    models.grbf[[G.cand]] <- readRDS(file.name)
  }
}
rm(G.cand)

Number of Gaussian radial basis functions

Two GRBFs

Code

knots.grbf.2 <- readRDS(paste0(prefix, "cache/fcal/grbf/k=2/grbf-1.2.RDS"))
knots.grbf.2.list <- list()
knots.grbf.2.list[[6]] <- knots.grbf.2
png("paper/grbf-plots-2.png", width = 8, height = 7, units = "in", res = 300)
spaghettiPlot(fcal,
  knots.grbf.2.list,
  G = 6,
  log = TRUE,
  tmax = 7,
  sizes = FALSE,
  knots = TRUE,
  knot.type = "equal",
  n.knots = 2,
  grbf = TRUE,
  l = 1.2,
  var.time = "calpro_time"
)
invisible(dev.off())
spaghettiPlot(fcal,
  knots.grbf.2.list,
  G = 6,
  log = TRUE,
  tmax = 7,
  sizes = FALSE,
  knots = TRUE,
  knot.type = "equal",
  n.knots = 2,
  grbf = TRUE,
  l = 1.2,
  var.time = "calpro_time"
)

Figure 10: Cluster profiles for the six-cluster model assuming two knots. The vertical teal lines indicate knot location. Mean cluster profiles are denoted by the red curves.

Three GRBFs

Code

knots.grbf.3 <- readRDS(paste0(prefix, "/cache/fcal/grbf/k=3/grbf-1.2.RDS"))
knots.grbf.3.list <- list()
knots.grbf.3.list[[6]] <- knots.grbf.3
png("paper/grbf-plots-3.png", width = 8, height = 7, units = "in", res = 300)
spaghettiPlot(fcal,
  knots.grbf.3.list,
  G = 6,
  log = TRUE,
  tmax = 7,
  sizes = FALSE,
  knots = TRUE,
  knot.type = "equal",
  n.knots = 3,
  grbf = TRUE,
  l = 1.2,
  var.time = "calpro_time"
)
invisible(dev.off())
spaghettiPlot(fcal,
  knots.grbf.3.list,
  G = 6,
  log = TRUE,
  tmax = 7,
  sizes = FALSE,
  knots = TRUE,
  knot.type = "equal",
  n.knots = 3,
  grbf = TRUE,
  l = 1.2,
  var.time = "calpro_time"
)

Figure 11: Cluster profiles for the six-cluster model assuming three knots. The vertical teal lines indicate knot location. Mean cluster profiles are denoted by the red curves.

Code

alluvial.df <- matrix(nrow = 0, ncol = 3)
colnames(alluvial.df) <- c("ids", "class", "G")
for (G in G.grbf) {
  alluvial.df <- rbind(alluvial.df, cbind(models.grbf[[G]]$pprob[, 1:2], G = G))
}
alluvial.df <- as.data.frame(alluvial.df)

alluvial.df$ids <- as.character(alluvial.df$ids)
alluvial.df$class <- as.factor(alluvial.df$class)

# eliminate label switching

alluvial.df[alluvial.df[, "G"] == 3, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 3, "class"],
    from = c(seq(1, 3)),
    to = c(1, 3, 2)
  )
#
alluvial.df[alluvial.df[, "G"] == 4, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 4, "class"],
    from = c(seq(1, 4)),
    to = c(3, 2, 1, 4)
  )

alluvial.df[alluvial.df[, "G"] == 5, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 5, "class"],
    from = c(seq(1, 5)),
    to = c(2, 4, 5, 1, 3)
  )

alluvial.df[alluvial.df[, "G"] == 6, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 6, "class"],
    from = c(seq(1, 6)),
    to = c(5, 3, 2, 4, 1, 6)
  )

alluvial.df[alluvial.df[, "G"] == 7, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 7, "class"],
    from = c(seq(1, 7)),
    to = c(5, 1, 4, 7, 3, 6, 2)
  )
#
alluvial.df[alluvial.df[, "G"] == 8, "class"] <-
  plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == 8, "class"],
    from = c(seq(1, 8)),
    to = c(6, 2, 4, 5, 1, 3, 7, 8)
  )


p <- ggplot(
  alluvial.df,
  aes(
    x = G,
    stratum = class,
    alluvium = ids,
    fill = class,
    label = class
  )
) +
  scale_x_discrete(expand = c(.1, .1)) +
  geom_flow() +
  geom_stratum(alpha = 0.5) +
  geom_text(stat = "stratum", size = 3) +
  theme_minimal() +
  theme(legend.position = "none") +
  scale_fill_manual(values = rainbow(8)) +
  xlab("Assumed number of clusters") +
  ylab("Frequency")
print(p)

p <- p + ggtitle("", "")
ggsave("paper/alluvial-FCAL-grbf.png",
  p,
  width = 12,
  height = 6.75,
  units = "in"
)
ggsave("paper/alluvial-FCAL-grbf.pdf",
  p,
  width = 12,
  height = 6.75,
  units = "in"
)

Figure 12: Alluvial plot of cluster membership across G for FCAL

Code

knots.df <- tibble(
  Knots = c("2 GRBFs", "3 GRBFs"),
  loglik = c(
    knots.grbf.2$loglik,
    knots.grbf.3$loglik
  ),
  AIC = c(
    knots.grbf.2$AIC,
    knots.grbf.3$AIC
  ),
  BIC = c(
    knots.grbf.2$BIC,
    knots.grbf.3$BIC
  )
)
knitr::kable(knots.df,
  col.names = c(
    "Knots",
    "Maxmum log likelihood",
    "AIC",
    "BIC"
  )
)
rm(
  knots.grbf.2.list,
  knots.grbf.3.list
)

Table 2: Model statistics for differing numbers of knots for natural cubic splines

Knots	Maxmum log likelihood	AIC	BIC
2 GRBFs	-16421.02	32924.03	33126.70
3 GRBFs	-16237.73	32569.45	32801.78

Number of clusters

Posterior classifications

From the below data, we can see how posterior probabilities change as the number of assumed clusters increase

	class1	class2
N	496	540
%	47.88	52.12

	prob1	prob2
class1	0.8024	0.1976
class2	0.2024	0.7976

	class1	class2
prob>0.7	68.35	67.59
prob>0.8	55.24	51.48
prob>0.9	36.09	34.81

	class1	class2	class3
N	310	497	229
%	29.92	47.97	22.1

	prob1	prob2	prob3
class1	0.7767	0.1262	0.0971
class2	0.0985	0.7971	0.1044
class3	0.1083	0.13	0.7617

	class1	class2	class3
prob>0.7	66.13	68.21	60.26
prob>0.8	52.58	55.73	48.03
prob>0.9	34.19	39.64	35.81

	class1	class2	class3	class4
N	277	195	311	253
%	26.74	18.82	30.02	24.42

	prob1	prob2	prob3	prob4
class1	0.7524	0.0494	0.0663	0.1319
class2	0.0553	0.7415	0.0991	0.1041
class3	0.0648	0.0963	0.7746	0.0643
class4	0.1593	0.1106	0.0696	0.6606

	class1	class2	class3	class4
prob>0.7	59.57	57.44	67.2	39.13
prob>0.8	48.01	47.69	55.31	27.27
prob>0.9	32.85	34.36	35.69	19.37

	class1	class2	class3	class4	class5
N	114	268	204	166	284
%	11	25.87	19.69	16.02	27.41

	prob1	prob2	prob3	prob4	prob5
class1	0.7472	0.0951	0.0911	0.0093	0.0572
class2	0.0985	0.6457	0.0624	0.0462	0.1472
class3	0.0937	0.0659	0.7086	0.1028	0.029
class4	0.0181	0.0503	0.1333	0.7194	0.0788
class5	0.0527	0.13	0.0236	0.0526	0.7411

	class1	class2	class3	class4	class5
prob>0.7	57.02	35.82	55.39	53.01	59.51
prob>0.8	49.12	25	40.69	39.76	47.89
prob>0.9	33.33	19.78	27.45	27.71	32.04

	class1	class2	class3	class4	class5	class6
N	186	256	117	281	150	46
%	17.95	24.71	11.29	27.12	14.48	4.44

	prob1	prob2	prob3	prob4	prob5	prob6
class1	0.6971	0.0182	0.0826	0.0555	0.0902	0.0564
class2	0.0114	0.7392	0.0467	0.1273	0.0372	0.0382
class3	0.0834	0.0391	0.7249	0.1022	0.0093	0.0411
class4	0.0578	0.1309	0.1022	0.6399	0.0436	0.0256
class5	0.1135	0.054	0.0201	0.053	0.713	0.0464
class6	0.0639	0.0941	0.0575	0.0176	0.063	0.7039

	class1	class2	class3	class4	class5	class6
prob>0.7	52.15	58.2	53.85	36.3	52	50
prob>0.8	38.17	46.09	47.86	24.56	42.67	41.3
prob>0.9	26.88	33.59	34.19	18.86	28	19.57

	class1	class2	class3	class4	class5	class6	class7
N	139	149	274	33	255	60	126
%	13.42	14.38	26.45	3.19	24.61	5.79	12.16

	prob1	prob2	prob3	prob4	prob5	prob6	prob7
class1	0.7097	0.0839	0.0394	0.0193	0.0117	0.0553	0.0807
class2	0.0956	0.6875	0.059	0.0554	0.0344	0.0357	0.0325
class3	0.0391	0.0405	0.6494	0.0112	0.1266	0.027	0.1062
class4	0.0458	0.1192	0.0446	0.7189	0.0409	0.0301	5e-04
class5	0.0067	0.025	0.1272	0.012	0.7453	0.0377	0.046
class6	0.0685	0.059	0.031	0.0247	0.088	0.6833	0.0455
class7	0.0703	0.0331	0.093	4e-04	0.033	0.0447	0.7255

	class1	class2	class3	class4	class5	class6	class7
prob>0.7	56.12	48.32	40.51	51.52	61.18	46.67	52.38
prob>0.8	41.01	32.21	26.64	45.45	47.45	40	46.83
prob>0.9	27.34	18.79	18.98	36.36	33.73	26.67	33.33

	class1	class2	class3	class4	class5	class6	class7	class8
N	70	131	140	166	114	208	66	141
%	6.76	12.64	13.51	16.02	11	20.08	6.37	13.61

	prob1	prob2	prob3	prob4	prob5	prob6	prob7	prob8
class1	0.644	0.067	0.0342	0.0562	0.0262	0.0728	0.0586	0.041
class2	0.0451	0.6683	0.0792	0.0712	0.0344	0.0258	0.0059	0.0702
class3	0.0283	0.0741	0.5651	0.0419	0.0306	0.0622	0.0347	0.1632
class4	0.0474	0.0777	0.033	0.6829	0.0674	0.0119	0.056	0.0237
class5	0.0297	0.0314	0.038	0.1	0.6569	0.0177	0.1012	0.0251
class6	0.0385	0.0292	0.0608	0.0029	0.0087	0.7125	0.021	0.1264
class7	0.0443	0.0015	0.0264	0.036	0.1159	0.0453	0.697	0.0337
class8	0.0291	0.0683	0.1686	0.0174	0.0117	0.1331	0.0165	0.5553

	class1	class2	class3	class4	class5	class6	class7	class8
prob>0.7	40	47.33	30	51.2	45.61	52.4	51.52	24.11
prob>0.8	32.86	41.22	22.86	42.77	29.82	42.79	34.85	19.86
prob>0.9	21.43	31.3	12.14	28.31	13.16	29.33	28.79	10.64

Code

rm(output)

Code

pprobs.2 <- c()
groups.2 <- models.grbf[[2]]
pprobs.3 <- c()
groups.3 <- models.grbf[[3]]
pprobs.4 <- c()
groups.4 <- models.grbf[[4]]
pprobs.5 <- c()
groups.5 <- models.grbf[[5]]
pprobs.6 <- c()
groups.6 <- models.grbf[[6]]
pprobs.7 <- c()
groups.7 <- models.grbf[[7]]
pprobs.8 <- c()
groups.8 <- models.grbf[[8]]


for (i in 1:nrow(models.grbf[[2]]$pprob)) {
  class.2 <- groups.2$pprob[i, 2]
  pprobs.2 <- c(pprobs.2, groups.2$pprob[i, class.2 + 2])
  class.3 <- groups.3$pprob[i, 2]
  pprobs.3 <- c(pprobs.3, groups.3$pprob[i, class.3 + 2])
  class.4 <- groups.4$pprob[i, 2]
  pprobs.4 <- c(pprobs.4, groups.4$pprob[i, class.4 + 2])
  class.5 <- groups.5$pprob[i, 2]
  pprobs.5 <- c(pprobs.5, groups.5$pprob[i, class.5 + 2])
  class.6 <- groups.6$pprob[i, 2]
  pprobs.6 <- c(pprobs.6, groups.6$pprob[i, class.6 + 2])
  class.7 <- groups.7$pprob[i, 2]
  pprobs.7 <- c(pprobs.7, groups.7$pprob[i, class.7 + 2])
  class.8 <- groups.8$pprob[i, 2]
  pprobs.8 <- c(pprobs.8, groups.8$pprob[i, class.8 + 2])
}
pprobs.2 <- tibble(prob = pprobs.2)
pprobs.3 <- tibble(prob = pprobs.3)
pprobs.4 <- tibble(prob = pprobs.4)
pprobs.5 <- tibble(prob = pprobs.5)
pprobs.6 <- tibble(prob = pprobs.6)
pprobs.7 <- tibble(prob = pprobs.7)
pprobs.8 <- tibble(prob = pprobs.8)

pprobs.2$Model <- as.factor(rep("Two clusters", nrow(pprobs.2)))
pprobs.3$Model <- as.factor(rep("Three clusters", nrow(pprobs.3)))
pprobs.4$Model <- as.factor(rep("Four clusters", nrow(pprobs.4)))
pprobs.5$Model <- as.factor(rep("Five clusters", nrow(pprobs.5)))
pprobs.6$Model <- as.factor(rep("Six clusters", nrow(pprobs.6)))
pprobs.7$Model <- as.factor(rep("Seven clusters", nrow(pprobs.7)))
pprobs.8$Model <- as.factor(rep("Eight clusters", nrow(pprobs.8)))
pprobs <- rbind(
  pprobs.2,
  pprobs.3,
  pprobs.4,
  pprobs.5,
  pprobs.6,
  pprobs.7,
  pprobs.8
)

p <- pprobs %>%
  ggplot(aes(x = prob, y = Model)) +
  # geom_histogram(bins = 40, fill = NA, position="identity")
  stat_slab(
    fill = "#5D5179", color = "gray",
    size = 0.8,
    alpha = 0.7
  ) +
  geom_dots(color = "#4F759B", dotsize = 1) +
  xlab("Posterior probability for cluster membership") +
  ylab("") +
  ggtitle(
    "Distribution of Posterior Probabilities Across Models",
    "Subject-specific posterior probabilities for assigned cluster"
  ) +
  theme_minimal() +
  scale_y_discrete(limits = rev)
ggsave("plots/pprob-distribution-grbf.png",
  p,
  width = 8.5,
  height = 7.5,
  units = "in"
)
ggsave("plots/pprob-distribution-grbf.pdf",
  p,
  width = 8.5,
  height = 7.5,
  units = "in"
)
print(p)

Residual plots

G = 2

##### G = 3 ##### G = 4 ##### G = 5 ##### G = 6 ##### G = 7 ##### G = 8

Chosen model

Code

p1 <- data.frame(residuals = resid(models.grbf[[6]])) %>%
  ggplot(aes(x = residuals)) +
  geom_histogram(aes(y = after_stat(density)),
    fill = "#D8829D",
    color = "#AF6A80",
    bins = 30
  ) +
  geom_density(color = "#023777", size = 1.2) +
  theme_classic() +
  theme(axis.line = element_line(colour = "gray")) +
  ylab("Density") +
  xlab("Residuals") +
  ggtitle("A")

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Code

p2 <- data.frame(residuals = resid(models.grbf[[6]])) %>%
  ggplot(aes(sample = residuals)) +
  stat_qq_band() +
  stat_qq_line(color = "#D8829D") +
  stat_qq_point(color = "#023777") +
  theme_classic() +
  theme(axis.line = element_line(colour = "gray")) +
  ylab("Theoretical Quantiles") +
  xlab("Sample Quantiles") +
  ggtitle("B")
ggsave("paper/Residual-plot-fcal-grbf.pdf",
  plot = p1 / p2,
  width = 8,
  height = 8,
  units = "in"
)
print(p1 / p2)

Figure 13: Plots assessing normality for the distribution of residuals for the chosen GRBF model for FCAL. (A) Histrogram; (B) Q-Q plot.

Code

predClass <- models.grbf[[6]]$pprob[, c("ids", "class")]
temp <- merge(models.grbf[[6]]$pred, predClass, by = "ids")
labels <- c("A", "B", "C", "D", "E", "F")
p <- list()

for (i in 1:6) {
  p[[i]] <- subset(temp, class == i) %>%
    ggplot(aes(sample = resid_ss)) +
    stat_qq_band() +
    stat_qq_line(color = "#D8829D") +
    stat_qq_point(color = "#023777") +
    theme_classic() +
    theme(axis.line = element_line(colour = "gray")) +
    ylab("Theoretical Quantiles") +
    xlab("Sample Quantiles") +
    ggtitle(labels[i])
}

ArrangedPlot <- (p[[1]] + p[[2]]) / (p[[3]] + p[[4]]) / (p[[5]] + p[[6]])

ggsave("paper/cluster-resids-grbf.pdf",
  ArrangedPlot,
  width = 8,
  height = 8,
  units = "in"
)
print(ArrangedPlot)

Figure 14: Q-Q plots assessing normality for the distribution of residuals stratifed by cluster assignment.

Model metrics

Code

summaryplot(models.grbf[[2]],
  models.grbf[[3]],
  models.grbf[[4]],
  models.grbf[[5]],
  models.grbf[[6]],
  models.grbf[[7]],
  #  models.grbf[[8]],
  which = c("loglik", "AIC", "BIC", "entropy", "ICL"),
  mfrow = c(2, 3),
  axis = "Class"
)

Figure 15: Model metrics for FCAL for G = 2-8

Code

grbf.metrics <- makeMetrics(G.grbf, models.grbf)
buildDT(grbf.metrics)

Clusters	Maximum log-likelihood	AIC	BIC
2	-16590.47	33210.94	33285.09
3	-16422.90	32891.80	33005.49
4	-16340.79	32743.58	32896.81
5	-16272.49	32622.98	32815.76
6	-16237.73	32569.45	32801.78
7	-16205.04	32520.07	32791.94
8	-16187.21	32500.41	32811.83

The optimum AIC is 32500.41 with cluster 10. The optimum BIC is 32791.94 with cluster 9

Spaghetti plots per cluster

Code

for (G in G.grbf) {
  cat(paste0("##### G = ", G, "\n\n"))
  cat("###### Log-scale, all subjects \n\n")
  grid::grid.newpage()
  spaghettiPlot(fcal,
    models.grbf,
    G = G,
    log = TRUE,
    tmax = 7,
    sizes = TRUE,
    knots = TRUE,
    knot.type = "equal",
    grbf = TRUE,
    var.time = "calpro_time"
  )
}

##### G = 2

###### Log-scale, all subjects

##### G = 3

###### Log-scale, all subjects

##### G = 4

###### Log-scale, all subjects

##### G = 5

###### Log-scale, all subjects

##### G = 6

###### Log-scale, all subjects

##### G = 7

###### Log-scale, all subjects

##### G = 8

###### Log-scale, all subjects

IBD Associations

Code

cluster <- numeric()

dict.grbf <- subset(dict, ids %in% unique(fcal$ids))

for (id in dict.grbf$ids) {
  cluster <- c(
    cluster,
    subset(models.grbf[[6]]$pprob, ids == id)$class
  )
}


dict.grbf$diagnosis <- as.factor(dict.grbf$diagnosis)
dict.grbf$cluster <- as.factor(cluster)
rm(cluster)

Code

describe_cat("diagnosis", dict.grbf)


   Crohn's Disease Ulcerative Colitis               IBDU               <NA> 
               544                380                112                  0 

   Crohn's Disease Ulcerative Colitis               IBDU               <NA> 
         0.5250965          0.3667954          0.1081081          0.0000000

Code

describe_cat("diagnosis", subset(dict.grbf, cluster == 1))


   Crohn's Disease Ulcerative Colitis               IBDU               <NA> 
               117                 45                 24                  0 

   Crohn's Disease Ulcerative Colitis               IBDU               <NA> 
         0.6290323          0.2419355          0.1290323          0.0000000

Code

describe_cat("diagnosis", subset(dict.grbf, cluster == 2))


   Crohn's Disease Ulcerative Colitis               IBDU               <NA> 
               173                 48                 35                  0 

   Crohn's Disease Ulcerative Colitis               IBDU               <NA> 
         0.6757812          0.1875000          0.1367188          0.0000000

Code

describe_cat("diagnosis", subset(dict.grbf, cluster == 3))


   Crohn's Disease Ulcerative Colitis               IBDU               <NA> 
                72                 33                 12                  0 

   Crohn's Disease Ulcerative Colitis               IBDU               <NA> 
         0.6153846          0.2820513          0.1025641          0.0000000

Code

describe_cat("diagnosis", subset(dict.grbf, cluster == 4))


   Crohn's Disease Ulcerative Colitis               IBDU               <NA> 
                70                193                 18                  0 

   Crohn's Disease Ulcerative Colitis               IBDU               <NA> 
        0.24911032         0.68683274         0.06405694         0.00000000

Code

describe_cat("diagnosis", subset(dict.grbf, cluster == 5))


   Crohn's Disease Ulcerative Colitis               IBDU               <NA> 
                83                 44                 23                  0 

   Crohn's Disease Ulcerative Colitis               IBDU               <NA> 
         0.5533333          0.2933333          0.1533333          0.0000000

Code

describe_cat("diagnosis", subset(dict.grbf, cluster == 6))


   Crohn's Disease Ulcerative Colitis               IBDU               <NA> 
                29                 17                  0                  0 

   Crohn's Disease Ulcerative Colitis               IBDU               <NA> 
         0.6304348          0.3695652          0.0000000          0.0000000

Fisher’s exact test

Code

fisher.test(dict.grbf$diagnosis,
  dict.grbf$cluster,
  simulate.p.value = TRUE, # calculate p-values using Monte Carlo
  B = 1e5
) # replicates for Monte Carlo test


    Fisher's Exact Test for Count Data with simulated p-value (based on
    1e+05 replicates)

data:  dict.grbf$diagnosis and dict.grbf$cluster
p-value = 1e-05
alternative hypothesis: two.sided

NCS vs GRBF comparison

Code

pred.df <- data.frame(
  calpro_time = seq(0, 7, by = 0.01),
  diagnosis = "Crohn's Disease"
)
pred.df <- rbind(
  pred.df,
  data.frame(
    calpro_time = seq(0, 7, by = 0.01),
    diagnosis = "Ulcerative colitis"
  )
)
pred.df <- rbind(
  pred.df,
  data.frame(
    calpro_time = seq(0, 7, by = 0.01),
    diagnosis = "IBDU"
  )
)

p.list <- list()

l <- 1.2
centers <- c(7 / 4, 14 / 4, 21 / 4)
pred.df$grbf1 <- exp(-1 * ((pred.df$calpro_time - centers[1])^2) / (2 * l^2))
pred.df$grbf2 <- exp(-1 * ((pred.df$calpro_time - centers[2])^2) / (2 * l^2))
pred.df$grbf3 <- exp(-1 * ((pred.df$calpro_time - centers[3])^2) / (2 * l^2))

NCS.pred <- predictY(models.fcal[[6]], pred.df, "calpro_time", draws = TRUE)$pred
GRBF.pred <- predictY(models.grbf[[6]], pred.df, "calpro_time", draws = TRUE)$pred

NCS.switch <- c(4, 1, 6, 2, 5, 3)

for (g in 1:6) {
  plot.df <- data.frame(
    time = pred.df$calpro_time,
    pred = NCS.pred[, NCS.switch[g]],
    upper = NCS.pred[, NCS.switch[g] + 6],
    lower = NCS.pred[, NCS.switch[g] + 12],
    Model = "NCS"
  )
  plot.df <- rbind(
    plot.df,
    data.frame(
      time = pred.df$calpro_time,
      pred = GRBF.pred[, g],
      upper = GRBF.pred[, g + 6],
      lower = GRBF.pred[, g + 12],
      Model = "GRBF"
    )
  )

  p <- plot.df %>%
    ggplot(aes(x = time, color = Model)) +
    geom_line(aes(y = pred)) +
    geom_line(aes(y = upper), lty = 2) +
    geom_line(aes(y = lower), lty = 2) +
    xlab("Time") +
    ylab("Predicted cluster mean") +
    ylim(3, 8) +
    theme_minimal() +
    scale_color_manual(values = c("#0E6BA8", "#EF6461"))
  p.list[[g]] <- p
}

final.plot <- (p.list[[1]] + p.list[[2]]) /
  (p.list[[3]] + p.list[[4]]) /
  (p.list[[5]] + p.list[[6]]) +
  plot_annotation(tag_levels = "A") +
  plot_layout(guides = "collect") & theme(legend.position = "bottom")


ggsave("paper/comparison.pdf",
  final.plot,
  width = 8.3,
  height = 11.7,
  units = "in"
)
ggsave("paper/comparison.png",
  final.plot,
  width = 8.3,
  height = 11.7,
  units = "in"
)
print(final.plot)

Acknowledgments

This work is funded by the Medical Research Council & University of Edinburgh via a Precision Medicine PhD studentship (MR/N013166/1, to NC-C).

This work has made use of the resources provided by the Edinburgh Compute and Data Facility (ECDF).

Author contributions

References

Constantine-Cooke, Nathan, Karla Monterrubio-Gómez, Nikolas Plevris, Lauranne A. A. P. Derikx, Beatriz Gros, Gareth-Rhys Jones, Riccardo E. Marioni, Charlie W. Lees, and Catalina A. Vallejos. 2023. “Longitudinal Fecal Calprotectin Profiles Characterize Disease Course Heterogeneity in Crohn’s Disease.” Clinical Gastroenterology and Hepatology, March, S1542356523002343. https://doi.org/10.1016/j.cgh.2023.03.026.

Elhakeem, Ahmed, Rachael A. Hughes, Kate Tilling, Diana L. Cousminer, Stefan A. Jackowski, Tim J. Cole, Alex S. F. Kwong, et al. 2022. “Using Linear and Natural Cubic Splines, SITAR, and Latent Trajectory Models to Characterise Nonlinear Longitudinal Growth Trajectories in Cohort Studies.” BMC Medical Research Methodology 22 (1): 68. https://doi.org/10.1186/s12874-022-01542-8.

Harrell, Frank E. 2015. “General Aspects of Fitting Regression Models.” In Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis, edited by Jr. Harrell Frank E., 13–44. Springer Series in Statistics. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-19425-7_2.

Proust-Lima, Cécile, Viviane Philipps, and Benoit Liquet. 2017. “Estimation of Extended Mixed Models Using Latent Classes and Latent Processes: The R Package lcmm.” Journal of Statistical Software 78 (2): 1–56. https://doi.org/10.18637/jss.v078.i02.

Reuse

Licensed by CC BY unless otherwise stated.

MRC Human Genetics Unit logo

Session info

Code

pander(sessionInfo())

R version 4.4.2 (2024-10-31)

Platform: aarch64-apple-darwin20

locale: en_GB.UTF-8||en_GB.UTF-8||en_GB.UTF-8||C||en_GB.UTF-8||en_GB.UTF-8

attached base packages: splines, stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: qqplotr(v.0.0.6), pander(v.0.6.5), ggalluvial(v.0.12.5), ggdist(v.3.3.2), kml(v.2.5.0), longitudinalData(v.2.4.7), misc3d(v.0.9-1), rgl(v.1.3.14), clv(v.0.3-2.4), class(v.7.3-22), cluster(v.2.1.6), lcmm(v.2.1.0), libdr(v.1.0.0), magrittr(v.2.0.3), patchwork(v.1.3.0), lubridate(v.1.9.3), forcats(v.1.0.0), stringr(v.1.5.1), dplyr(v.1.1.4), purrr(v.1.0.2), readr(v.2.1.5), tidyr(v.1.3.1), tibble(v.3.2.1), ggplot2(v.3.5.1) and tidyverse(v.2.0.0)

loaded via a namespace (and not attached): tidyselect(v.1.2.1), viridisLite(v.0.4.2), farver(v.2.1.2), viridis(v.0.6.5), bitops(v.1.0-9), fastmap(v.1.2.0), pracma(v.2.4.4), digest(v.0.6.37), timechange(v.0.3.0), lifecycle(v.1.0.4), survival(v.3.7-0), compiler(v.4.4.2), rlang(v.1.1.4), tools(v.4.4.2), utf8(v.1.2.4), yaml(v.2.3.10), knitr(v.1.49), labeling(v.0.4.3), htmlwidgets(v.1.6.4), plyr(v.1.8.9), withr(v.3.0.2), numDeriv(v.2016.8-1.1), grid(v.4.4.2), fansi(v.1.0.6), opdisDownsampling(v.1.0.1), caTools(v.1.18.3), colorspace(v.2.1-1), extrafontdb(v.1.0), scales(v.1.3.0), iterators(v.1.0.14), MASS(v.7.3-61), cli(v.3.6.3), mvtnorm(v.1.3-2), rmarkdown(v.2.29), ragg(v.1.3.3), marqLevAlg(v.2.0.8), generics(v.0.1.3), twosamples(v.2.0.1), robustbase(v.0.99-4-1), reshape2(v.1.4.4), tzdb(v.0.4.0), parallel(v.4.4.2), base64enc(v.0.1-3), vctrs(v.0.6.5), Matrix(v.1.7-1), jsonlite(v.1.8.9), hms(v.1.1.3), pbmcapply(v.1.5.1), systemfonts(v.1.1.0), foreach(v.1.5.2), glue(v.1.8.0), DEoptimR(v.1.1-3-1), codetools(v.0.2-20), rngWELL(v.0.10-10), distributional(v.0.5.0), stringi(v.1.8.4), gtable(v.0.3.6), quadprog(v.1.5-8), extrafont(v.0.19), randtoolbox(v.2.0.5), munsell(v.0.5.1), pillar(v.1.9.0), htmltools(v.0.5.8.1), R6(v.2.5.1), textshaping(v.0.4.0), tcltk(v.4.4.2), doParallel(v.1.0.17), evaluate(v.1.0.1), lattice(v.0.22-6), qqconf(v.1.3.2), Rcpp(v.1.0.13-1), gridExtra(v.2.3), nlme(v.3.1-166), Rttf2pt1(v.1.3.12), xfun(v.0.49) and pkgconfig(v.2.0.3)