Preprocessing

Nathan Constantine-Cooke; Karla Monterrubio-Gómez; Catalina A. Vallejos

Preprocessing

Authors

Affiliations

Nathan Constantine-Cooke

Centre for Genomic and Experimental Medicine

MRC Human Genetics Unit

Karla Monterrubio-Gómez

MRC Human Genetics Unit

Centre for Genomic and Experimental Medicine

Catalina A. Vallejos

MRC Human Genetics Unit

Published

November 27, 2024

Introduction

Code

set.seed(123)
### Required packages
library(tidyverse) # ggplot2, dplyr, and magrittr
library(knitr) # Markdown utilities
library(pander) # Pretty markdown rendering
library(datefixR) # Standardising dates
library(lubridate) # Date handling
library(patchwork) # Arranging plots

This page details preprocessing steps undertaken prior to model fitting. These steps include examining data quality, performing data cleaning, and deriving the study cohort.

The biomarker (FC and CRP) data used in this pipeline have been primarily obtained from TRAK, a system used by NHS Lothian for the electronic ordering of laboratory tests. However, we also use phenotyping data manually curated by the Lees group for previous studies (see Jenkinson et al. (2020)).

Prior to this report, all subject CHI (community health index) numbers, which uniquely identifies a patient when they interact with NHS Scotland services, were pseudoanonymised. Each CHI has been replaced with a unique random number.

Inclusion/exclusion criteria definitions

The inclusion criteria for this study can broadly be categorised into two classifications: baseline and longitudinal. Baseline criteria can be applied using information known at diagnosis whilst longitudinal criteria are based on biomarker measurements taken over time.

The longitudinal criteria for each biomarker (FC and CRP) are considered independently. If a subject meets the below criteria for FC but not CRP, then only the FC for the subject will be modelled and vice versa.

Baseline
Longitudinal

Diagnosis of Inflammatory bowel disease

As inflammatory bowel disease (IBD) is the disease of interest for this study, subjects are required to have a confirmed diagnosis of either Crohn’s disease, ulcerative colitis, or inflammatory bowel disease unclassified.

Diagnosis date

Study subjects are required to have been diagnosed with IBD between October 2004 and December 2017. The lower bound is required as FC tests were not introduced until 2005 and we require study subjects to have a FC available close to diagnosis (see the longitudinal tab). The upper bound is required to ensure subjects have the opportunity to have at least five years of follow-up. The most recent FC observation in this dataset is January 2023 which informed when deciding upon this cut-off.

Diagnostic measurement

We require study subjects to have a biomarker measurement taken ±90 days of reported date of diagnosis.

At least three non-censored measurements

To ensure stability and interpretability, study subjects are also required to have at three non-censored biomarker measurements taken across follow-up. Like most biomarkers, the biomarkers used in this study are subject to censoring. However, these censored observations make it impossible to detect changes over time and we therefore do not wish to be reliant on censored observations.

Datasets

Code

if (file.exists("/.dockerenv")) { # Check if running in Docker
  # Assume igmm/Vallejo-predict/libdr/ is passed to the data volume
  prefix <- "data/"
} else {
  # Assume running outside of a Docker container and the IGC(/IGMM) datastore is
  # mounted at /Volumes
  prefix <- "/Volumes/igmm/cvallejo-predicct/libdr/"
}

fcal.pheno <- read.csv(file.path(prefix, "2024-10-03/fcal-cleaned.csv"))

# Extract from TRAK which also now introduces CRP.
labs <- read.csv(file.path(prefix, "2024-10-03/markers-cleaned.csv"))
fcal <- subset(labs, TEST == "f-Calprotectin-ALP")

# Add sex and diagnosis type from fcal.pheno
fcal <- fcal.pheno %>%
  distinct(ids, .keep_all = TRUE) %>%
  select(ids, sex, diagnosis, diagnosis_date) %>%
  merge(x = fcal, by = "ids", all.x = TRUE, all.y = FALSE)

crp <- subset(labs, TEST == "C-Reactive Prot")
updated <- read.csv(file.path(prefix, "2024-10-03/allPatientsNathanCleaned.csv"))
outcomes <- read.csv(file.path(prefix, "2024-10-03/cd-cleaned.csv"))

non.ibd <- read.csv(file.path (prefix, "2024-10-24/non-ibd.csv"))

Multiple datasets are used in this report with varying degrees of previous curation.

markers.csv is a 2024 extraction from TRAKcare describing 286294 test results (FC: 60866; CRP: 225428). Each row represents a test result with columns indicating the corresponding subject ID, the test type (FC or CRP), the time of the test, and the test result. This dataset contains the most recent data out of the other datasets. However, there has been no manual curation of these data, and no information about the patient’s characteristics are available (other than their ID).
fcal-cleaned.csv does not describe CRP data. However, these data have been curated by the Lees group. In addition to 60866 FC test results, each row also provides the sex, IBD type, date of diagnosis, and date of death (if applicable) for the corresponding subject.
allPatientsNathanCleaned.csv provides sex, IBD type, and diagnosis date for most (but not all) subjects in labs.csv not described by fcal-cleaned.csv.
cd-cleaned.csv describes CD patient data (n = 1753). In addition to the same basic patient characteristic variables in other datasets, this dataset also describes surgical outcomes and biologic treatments prescribed to the subject.
non-ibd.csv describes subjects who were found to not have IBD during the phenotyping process. This dataset is used to remove subjects who do not have IBD from the analysis.

Note

The “cleaned” in the above file names refers to the CHI numbers having been replaced. No further processing of the data has been undertaken for this study prior to the steps outlined in this report.

Subject data

We will create a “dictionary” which, for each subject, lists their subject ID, their IBD type, and their date of diagnosis. In addition to ensuring inclusion criteria are met, date of diagnosis will be used to retime all biomarker measurements of a subject such that t=0 corresponds to their diagnosis date. IBD types are used to test if clusters are enriched by IBD type and may be used to adjust the clustering.

There are multiple potential sources for date of diagnosis and IBD type: the CD outcomes dataset, a demographics dataset maintained by the Lees group, and a raw TRAK extraction. For every subject listed in at least one of these datasets, we have tried to extract their date of diagnosis and diagnosis type. We have assumed a hierarchy of datasets where manually curated datasets are preferred for extracting data over a dataset generated with minimal human involvement.

Diagnosis and diagnosis date

Code

ids <- unique(c(outcomes$ids, fcal$ids, updated$ids))
diagnosis <- character()
date.of.diag <- character()
for (id in ids) {
  if (id %in% outcomes$ids) {
    # Outcomes data only contains CD subjects
    diagnosis <- c(diagnosis, "Crohn's Disease")
    date.of.diag <- c(
      date.of.diag,
      subset(outcomes, ids == id)$diagnosisDate
    )
  } else if (id %in% updated$ids) {
    diagnosis <- c(diagnosis, subset(updated, ids == id)[1, "diagnosis"])
    date.of.diag <- c(
      date.of.diag,
      subset(updated, ids == id)[1, "diagnosisDate"]
    )
  } else if (id %in% fcal$ids) {
    diagnosis <- c(diagnosis, subset(fcal, ids == id)[1, "diagnosis"])
    date.of.diag <- c(
      date.of.diag,
      subset(fcal, ids == id)[1, "diagnosis_date"]
    )
  } else {
    diagnosis <- c(diagnosis, "Unknown")
    date.of.diag <- c(
      date.of.diag,
      "Unknown"
    )
  }
}

dict <- data.frame(
  ids = ids,
  diagnosis = diagnosis,
  date.of.diag = fix_date_char(date.of.diag)
)


rm(id, ids, diagnosis, date.of.diag) # clean up
dict <- fix_date_df(dict, "date.of.diag")

Before filtering by inclusion/exclusion criteria, there are 7 individuals for which a diagnostic date is missing. However, 0 missing values are observed for diagnosis (IBD type).

As can be seen in the below table, two spellings of Crohn’s disease (with and without an apostrophe (’)) and three names for IBDU are used. There are also subjects reported to not have IBD. These will be removed, leaving 10153 subjects included in the subsequent analyses.

Code

kable(unique(dict$diagnosis),
  col.names = c("Diagnosis")
)
dict <- dict %>%
  subset(diagnosis != "Not IBD") %>%
  subset(!(ids %in% non.ibd$ids))

Table 1: Reported IBD types

Diagnosis
Crohn’s Disease
Inflamatory Bowel Disease - Unknown Subtype
Crohns Disease
Ulcerative Colitis
Inflamatory Bowel Disease
IBDU
Not IBD

The Crohn’s disease and IBDU names have been standardised (IBDU, Inflammatory Bowel Disease, Inflammatory Bowel Disease - Unknown Subtype are assumed to be equivalent). The subject reported to not have IBD has been removed.

Code

dict$old <- dict$diagnosis
dict$diagnosis <- plyr::mapvalues(
  dict$diagnosis,
  from = c(
    "Crohns Disease",
    "Inflamatory Bowel Disease",
    "Inflamatory Bowel Disease - Unknown Subtype"
  ),
  to = c(
    "Crohn's Disease",
    "IBDU",
    "IBDU"
  )
)

kable(addmargins(table(dict$diagnosis, dict$old), margin = 2))
dict$old <- NULL

Table 2: Contingency table showing mapping of IBD types to a standardised format

	Crohn’s Disease	Crohns Disease	IBDU	Inflamatory Bowel Disease	Inflamatory Bowel Disease - Unknown Subtype	Ulcerative Colitis	Sum
Crohn’s Disease	2471	1328	0	0	0	0	3799
IBDU	0	0	1	242	841	0	1084
Ulcerative Colitis	0	0	0	0	0	5270	5270

Age and sex

Code

# Merge with fcal to add sex information
dict <- fcal.pheno[, c("ids", "sex")] %>%
  distinct(ids,
    .keep_all = TRUE
  ) %>%
  merge(x = dict, by = "ids", all.x = TRUE, all.y = FALSE)

# Update NA sex if sex available from updated
dict <- merge(dict,
  updated[, c("ids", "sex")],
  by = "ids",
  all.x = TRUE,
  all.y = FALSE
)

for (i in 1:nrow(dict)) {
  if (is.na(dict[i, "sex.x"]) & !is.na(dict[i, "sex.y"])) {
    dict[i, "sex.x"] <- dict[i, "sex.y"]
  }
}

dict$sex <- dict$sex.x
dict$sex.x <- dict$sex.y <- NULL

# Add age at IBD diagnosis
updated <- fix_date_df(updated, "diagnosisDate")
updated$age <- with(updated, year(diagnosisDate) - dateOfBirth)
dict <- merge(dict,
  updated[, c("ids", "age")],
  by = "ids",
  all.x = TRUE,
  all.y = FALSE
)

Note that there 7 individuals for which age at diagnosis is missing (this is due to missing diagnostic dates). However, there are 0 individuals for which sex is missing.

Date of death

Code

dict <- merge(dict,
  updated[, c("ids", "death")],
  by = "ids",
  all.x = TRUE,
  all.y = FALSE
) %>%
  rename(date.of.death = death)

dict <- fix_date_df(dict, "date.of.death")

Date of death information is available for 1136 individuals. We assume that all remaining individuals have not died by the date of data extraction.

Exclusion criteria according to diagnosis and date of diagnosis

Date of diagnosis in an integral aspect of the inclusion criteria and also used to retime biomarker measurements.

Firstly, we remove 7 individuals with a missing date of diagnosis.

Code

dict <- dict[!is.na(dict$date.of.diag), ]

This leaves 10146 for subsequent analyses.

Code

# no. subjects over upper bound
max_fup <- max(labs$COLLECTION_DATE)
max_fup <- as.Date(max_fup)
n.upper <- nrow(subset(dict, max_fup - date.of.diag < 5 * 365.25))

# no. subjects under lower bound
n.lower <- nrow(subset(dict, (year(date.of.diag) < 2004) |
  (year(date.of.diag) == 2004 & month(date.of.diag) <= 9)))

# subset to subjects meeting the criteria.
dict <- subset(dict, max_fup - date.of.diag >= 5 * 365.25)
dict <- subset(dict, (year(date.of.diag) >= 2005) |
  (year(date.of.diag) == 2004 & month(date.of.diag) >= 10))

As described in the inclusion/exclusion criteria section, the inclusion criteria considers upper and lower bounds for the diagnostic dates. In this case, we only consider subjects diagnosed after October 2004 and 13 August 2019 (this ensures individuals can be followed up by at least 5 year by the time of the last longitudinal measument obtained for the cohort; 2024-08-13). As such, subjects who do not meet this criteria will be excluded. 77 subjects were diagnosed on 13 August 2019 or later and were removed, and 4561 subjects were diagnosed prior to October 2004 and were likewise removed. This results in a cohort size of 5508 subjects.

Exploration

We now explore date of diagnosis to ensure data quality is at a suitable standard.

Year of diagnosis

From Figure 1, we can see the number of IBD diagnoses each year is relatively static with the exception of 2004 and 2005. The former describes only three months of the year, and incidence was likely still increasing across both of these two years. Jones et al. (2019) found IBD incidence from 2008 onwards to be consistent over time for patients diagnosed by NHS Lothian which is in agreement with our findings.

Code

dict %>%
  ggplot(aes(x = year(date.of.diag))) +
  geom_histogram(fill = "#B8F2E6", color = "black", binwidth = 1) +
  theme_minimal() +
  xlab("Year of IBD diagnosis") +
  ylab("Frequency")

Figure 1: Distribution of year of diagnosis

Month of diagnosis

It appears there are subjects for whom only year of diagnosis was available. This has resulted in the 1^st of January for that year being recorded as the date of diagnosis for that subject. As such, far more diagnoses are reported in January than in other months (Figure 3). There are 1062 subjects reported to have been diagnosed on New Year’s day .

If a subject’s exact date of diagnosis is not known, then this is most likely because the subject was not diagnosed by NHS Lothian. Instead, the date of diagnosis would have needed to be recalled by the subject which introduces the inaccuracy observed.

As subjects are require a diagnostic biomarker measurement to be available within 90 days of reported date of diagnosis to be included in this study and patients diagnosed outside of NHS Lothian will not have a biomarker test result available within this period, the effect on our study should be minimal. However, we will revisit month of diagnosis after filtering by existence of diagnostic biomarker measurements to confirm our assumption.

Code

dict %>%
  ggplot(aes(x = as.factor(year(date.of.diag)))) +
  geom_bar(color = "black", fill = "#FEC601", linewidth = 0.3) +
  theme_minimal() +
  ylab("Frequency") +
  xlab("Year of IBD Diagnosis")

Code

dict %>%
  ggplot(aes(x = as.factor(month(date.of.diag, label = TRUE)))) +
  geom_bar(color = "black", fill = "#FEC601", linewidth = 0.3) +
  theme_minimal() +
  ylab("Frequency") +
  xlab("Month of IBD Diagnosis")

Figure 3: Bar plot of month of diagnosis

Code

diag.ts <- dict %>%
  mutate(
    date.of.diag.month =
      as.Date(paste0(format(date.of.diag, "%Y-%m"), "-01"), format = "%Y-%m-%d")
  ) %>%
  group_by(date.of.diag.month) %>%
  summarise(count = n()) %>%
  ungroup()

ggplot(diag.ts, aes(x = as.Date(date.of.diag.month, "%Y-%m"), y = count, group = 1)) +
  geom_line() +
  xlab("Diagnosis date") +
  ylab("Number of diagnosis") +
  scale_x_date(date_labels = "%Y-%m") +
  theme_minimal()

Faecal calprotectin

Faecal calprotectin (FC), a marker of intestinal inflammation, has been obtained from two datasets. The first dataset has been curated by members of the Lees group whilst the second dataset is a direct extract from TRAK, a patient monitoring system used by NHS Lothian.

Incorporating later extract

Data from these datasets has been merged and duplicates of data (same subject ID, measurement date, and recorded value) were removed. We also reduced the FC dataset to only describe subjects for whom IBD type and date of diagnosis are available for.

For the TRAK extract, times for test results are given in datetime format (where both date and time of the day are provided). The times have been dropped as this degree of granularity is not required. Duplicate tests (same id, date and test value) are removed here.

Code

fcal <- fcal[, c(
  "ids",
  "COLLECTION_DATE",
  "TEST_DATA",
  "sex",
  "diagnosis",
  "diagnosis_date"
)]

# Subset to only include those that passed the earlier inclusion/exclusion
fcal <- subset(fcal, ids %in% dict$ids)

# Collection dates include collection times which are not required. Discarding.

fcal$COLLECTION_DATE <- readr::parse_date(
  stringr::str_split_fixed(fcal$COLLECTION_DATE, " ", n = 2)[, 1],
  format = "%Y-%m-%d"
)

colnames(fcal)[1:3] <- c("ids", "calpro_date", "calpro_result")

fcal <- subset(fcal, ids %in% dict$ids)
fcal <- fcal %>%
  select(-diagnosis)

fcal <- merge(fcal,
  dict[, c("ids", "diagnosis", "date.of.diag")],
  by = "ids",
  all.x = TRUE
)

Pre-processing of test results

Code

## Some values cannot be directly coerced as numeric

fcal$calpro_result <- plyr::mapvalues(
  fcal$calpro_result,
  from = c(
    "<20",
    "<25",
    "<50",
    ">1250",
    ">2500",
    ">3000",
    ">6000"
  ),
  to = c(
    "20",
    "25",
    "50",
    "1250",
    "2500",
    "3000",
    "6000"
  )
)

# Here, values that cannot be converted into a numeric value will be excluded
# (they are converted to NA)
fcal$calpro_result <- as.numeric(fcal$calpro_result) # Remove error codes
fcal <- fcal[!is.na(fcal[, "calpro_result"]), ]

# Apply limits of detection
fcal[fcal[, "calpro_result"] < 20, "calpro_result"] <- 20
fcal[fcal[, "calpro_result"] > 1250, "calpro_result"] <- 1250

FC data can be both left and right censored. FC recorded as “<20” were mapped to “20”. FC recorded as “>1250”, “>2500”, or “>6000” were all mapped to “1250”. Any FC tests given an error code (e.g. marked as an insufficient sample by the laboratory) have been removed.

More information on FC censoring

The lower limit of detection is <20 \mu g/g (20 \mu g of calprotectin per g of stool). By reducing the upper limit, it is possible to run more tests in parallel. As a higher throughput has been required over time, the upper threshold for tests has become lower. Initially test results over 6000 \mu g/g were censored, then 2500 \mu g/g and now 1250 \mu g /g is the upper limit for FC tests. This change has resulted in minimal impact in clinics as 1250 \mu g/g is still considered a high result. However, this change has potential implications for research.

Code

fcal %>%
  ggplot(aes(x = calpro_result)) +
  geom_density(
    linewidth = 0.8,
    alpha = 0.5,
    fill = "#9FD8CB",
    color = "#517664"
  ) +
  theme_minimal() +
  theme(axis.text.y = element_blank()) +
  xlab("FCAL (µg/g)") +
  ylab("Density") +
  geom_vline(xintercept = 250, colour = "red")

Figure 4: Density plot of FCAL measurements by observed value

FC test results on the original measurement scale are heavily right-skewed towards 100 \mu g/g. These data will be log transformed- resulting in the multi-modal distribution seen in Figure 5.

Code

fcal %>%
  ggplot(aes(x = log(calpro_result))) +
  geom_density(
    linewidth = 0.8,
    alpha = 0.5,
    fill = "#9FD8CB",
    color = "#517664"
  ) +
  theme_minimal() +
  theme(axis.text.y = element_blank()) +
  xlab("log(FCAL (µg/g))") +
  ylab("Density") +
  geom_vline(xintercept = log(250), colour = "red")

Figure 5: Density plot of logged FCAL test results

Removal of duplicate FCAL measurements

Code

duplicated.ids <- c()

for (id in unique(fcal$ids)) {
  sub.fcal <- subset(fcal, ids == id) # Get FC data for a subject
  sub.fcal <- sub.fcal[order(sub.fcal$calpro_date), ] # Order by dates
  time.diff <- diff(sub.fcal$calpro_date) # Find time between ordered dates
  value.diff <- diff(sub.fcal$calpro_result) # Find difference in observed values
  # If two measurements are within 10 days of each other and have the same value
  if (any(time.diff <= 10 & value.diff == 0)) {
    duplicated.ids <- c(duplicated.ids, id)
    # Remove suspected duplicates (taking into account difference is lagged)
    sub.fcal <- sub.fcal[c(TRUE, !(time.diff <= 10 & value.diff == 0)), ]
    # Remove data for subject with duplicates
    fcal <- subset(fcal, ids != id)
    # Add non duplicated data back
    fcal <- rbind(fcal, sub.fcal)
  }
}

Given FC was retrospectively collected from observational data, it is possible duplicate tests have been recorded in the extract. As it is NHS Lothian policy to not test FC days apart, we can assume that any tests with the same observation within a small time frame are likely to be duplicates. This may occur when collection dates and testing dates have been used interchangeably.

If a subsequent FC test for a subject was dated within 10 days of a previous test and has the same observed value, this observation was dropped.

Code

htmltools::p(paste(
  "There are",
  length(duplicated.ids),
  "subjects who appear to have duplicated FC measurements"
))

There are 27 subjects who appear to have duplicated FC measurements

Retiming - part 1

Time of FC measurements were retimed and scaled to be the number of years since IBD diagnosis.

Code

# Dates have already been converted to Date class by fix_date_char() for dict
fcal$calpro_time <- as.numeric(fcal$calpro_date - fcal$date.of.diag) / 365.25

fcal %>%
  ggplot(aes(x = calpro_time, y = log(calpro_result), color = factor(ids))) +
  geom_line(alpha = 0.2) +
  geom_point(alpha = 0.6) +
  theme_minimal() +
  scale_color_manual(values = viridis::viridis(length(unique(fcal$ids)))) +
  guides(color = "none") +
  xlab("Time (years)") +
  ylab("Log(FCAL (µg/g))") +
  ggtitle("")

Figure 6: Spaghetti plot of FC trajectories (preprocessed)

After retiming, it is clear some FC observations are earlier than diagnosis and in some cases substantially earlier (Figure 6). Tests taken close to the reported date of diagnosis are likely “early” as a result of diagnostic delay whereas tests from much earlier were likely requested due to other conditions.

As such, FC results earlier than 90 days prior to diagnosis will be discarded. If a subject has an FC within 90 days before diagnosis, then all of their FC tests will be retimed such that their earliest FC within this period is equal to 0 and all later measurements are shifted accordingly to maintain the same differences between measurement times.

Code

fcal <- subset(fcal, calpro_time >= -0.25)

Inclusion/exclusion: removal of subjects without a diagnostic FC test

As indicated in our inclusion/exclusion criteria, we reduce the dataset to only subjects with a diagnostic FC. This equates to subjects with at least one FCAL measurement within 3 months of diagnosis t \leq 0.25.

Code

diagnostic <- fcal %>%
  group_by(ids) %>%
  summarise(n = sum(calpro_time < 0.25)) %>%
  subset(n > 0)

fcal <- subset(fcal, ids %in% diagnostic$ids)

After this exclusion, 1554 subjects remain in the data.

Code

##The following code is used to save the cleaned data generated so far. 

diag.time <- c()
fc.ids <-  unique(fcal$ids)

for (id in fc.ids) {
  temp <- subset(fcal, ids == id)
  temp <- temp[order(temp$calpro_time), ]
  diag.time <- c(diag.time, temp[1, "calpro_time"])
}


fc.dist <- data.frame(ids = fc.ids, diagnostic = diag.time)

if (!dir.exists(paste0(prefix, "processed"))) {
  dir.create(paste0(prefix, "processed"))
}

saveRDS(fc.dist, paste0(prefix, "processed/fc-diag-dist.RDS"))

Retiming - part 2

Here, we show the distribution of the diagnostic FCAL measurements with respect to diagnosis date (in days).

Code

p <- fcal %>%
  group_by(ids) %>%
  filter(calpro_time == min(calpro_time)) %>%
  ggplot(aes(x = calpro_time * 365.25)) +
  geom_density(fill = "#20A39E", color = "#187370") +
  theme_minimal() +
  labs(
    y = "Density",
    x = "Time from diagnosis to first faecal calprotectin (days)"
  ) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "red")

ggsave("plots/fc-diagnostic-dist.png",
  p,
  width = 16 * 2 / 3,
  height = 6,
  units = "in"
)

Although most diagnostic measurements were observed around the recorded diagnosis time (near zero), this is not always the case. As such, we decided to retime measurements/diagnosis time as described above (i.e. time = 0 matches the time of first available FCAL measurement). This is only applied to individuals for which the first available CRP is prior to the recorded diagnosis time.

Code

# Retime so that t_0 = 0.
for (id in unique(fcal$ids)) {
  temp <- subset(fcal, ids == id)
  if (any(temp$calpro_time < 0)) {
    add <- sort(temp$calpro_time)[1]
    fcal[fcal[, "ids"] == id, "calpro_time"] <-
      fcal[fcal[, "ids"] == id, "calpro_time"] + abs(add)
  }
}

length(unique(fcal$ids))

[1] 1554

Code

fcal %>%
  ggplot(aes(x = calpro_time, y = log(calpro_result), color = factor(ids))) +
  geom_line(alpha = 0.2) +
  geom_point(alpha = 0.6) +
  theme_minimal() +
  scale_color_manual(values = viridis::viridis(length(unique(fcal$ids)))) +
  guides(color = "none") +
  xlab("Time (years)") +
  ylab("Log(FCAL (µg/g))") +
  ggtitle("")

Figure 7: Spaghetti plot of FC trajectories (processed)

At this stage, we also exclude FCAL measurements recorded beyond 7 years post diagnosis.

Code

fcal <- subset(fcal, calpro_time <= 7)
length(unique(fcal$ids))

[1] 1554

At this point, the data contains 11582 FCAL measurements across all 1554 individuals.

Inclusion/exclusion: a minimum number of FCAL measurements

Here, we summarise the number of observations per individual (with and without considering censored values) as well as length of follow-up. The latter is defined as the difference (in years) between diagnosis time and the time of the latest FCAL measurement.

Code

countsDF <- fcal %>%
  group_by(ids) %>%
  summarise(
    n.total = n(),
    censored.left = sum(calpro_result == 20),
    censored.right = sum(calpro_result == 1250),
    n.noncensored = n.total - censored.left - censored.right,
    n.negtime.nondiag = sum(calpro_time != 0 & calpro_date - date.of.diag < 0),
    followup = max(calpro_time)
  )

Code

p1 <- countsDF %>%
  ggplot(aes(x = n.total)) +
  geom_histogram(color = "black", fill = "#5BBA6F", binwidth = 1) +
  theme_minimal() +
  ylab("Count") +
  xlab("Measurements per subject")
p2 <- countsDF %>%
  ggplot(aes(x = followup)) +
  geom_histogram(color = "black", fill = "#5BBA6F", binwidth = 1) +
  theme_minimal() +
  ylab("Count") +
  xlab("Follow-up (years)")
p3 <- countsDF %>%
  ggplot(aes(y = n.total, x = followup)) +
  geom_hex() +
  xlab("Follow-up (years since diagnosis)") +
  ylab("Number of FC measurements") +
  theme_minimal() +
  labs(fill = "Count")

p1 + p2 + p3 + plot_annotation(tag_levels = "A")

Figure 8: (A) Histogram of the number of FCAL measurements per subject. (B) Histogram of FCAL follow-up per subject (before exclusions). (C) Scatterplot with number of FCAL measurements vs follow-up.

Summary observations:

The number of FC measurements per subject is skewed towards low values. However, there are also subjects with many FC observations. These subjects were investigated and found to often have a complex disease course, such as acute severe ulcerative colitis, and required close monitoring as a result.
We do not directly require a minimum follow-up for a subject as this may bias our findings. For example, UC patients who undergo a proctocolectomy (surgical removal of the colon and rectum) are less likely to have FC measurements after their surgery than patients who did not require surgery. By applying criteria based on follow-up, we could indirectly exclude subjects based on disease outcomes.
However, the requirement of at least 3 FCAL measurements (within 7 years from diagnosis) will indirectly remove individuals with a very short follow-up time.

Note that our inclusion criteria requires at least 3 non-censored FCAL measurements per individual. As such, censoring needs to be taken into account before applying filtering out individuals.

Code

p1 <- countsDF %>%
  ggplot(aes(x = n.noncensored)) +
  geom_histogram(color = "black", fill = "#5BBA6F", binwidth = 1) +
  theme_minimal() +
  ylab("Count") +
  xlab("Number of non-censored measurements per subject")

p2 <- countsDF %>%
  ggplot(aes(y = n.noncensored, x = followup)) +
  geom_point(color = "#FF4F79", size = 2) +
  xlab("Follow-up (years since diagnosis)") +
  ylab("Number of non-censored FCAL measurements") +
  theme_minimal() +
  geom_hline(yintercept = 3)

p1 + p2 + plot_annotation(tag_levels = "A")

Code

# Number of individuals with at least 3 FCAL measurements
sum(countsDF$n.total >= 3)

[1] 1155

Code

# Number of individuals with at least 3 non-censored FCAL measurements
sum(countsDF$n.noncensored >= 3)

[1] 1036

Code

# Number of individuals with at least 3 non-censored FCAL measurements after
# discarding non-diagnostic values with negative calpro_time
sum(countsDF$n.noncensored - countsDF$n.negtime.nondiag >= 3)

[1] 1028

The following is used to select individuals with at least 3 non-censored FCAL measurements.

Code

fcal <- fcal %>%
  subset(ids %in% countsDF$ids[countsDF$n.noncensored >= 3])

length(unique(fcal$ids))

[1] 1036

This leaves a cohort with 1036 individuals.

Code

p1 <- countsDF %>%
  subset(n.noncensored >= 3) %>%
  ggplot(aes(x = n.total)) +
  geom_histogram(color = "black", fill = "#5BBA6F", binwidth = 1) +
  theme_minimal() +
  ylab("Count") +
  xlab("Measurements per subject")
p2 <- countsDF %>%
  subset(n.noncensored >= 3) %>%
  ggplot(aes(x = followup)) +
  geom_histogram(color = "black", fill = "#5BBA6F", binwidth = 1) +
  theme_minimal() +
  ylab("Count") +
  xlab("Follow-up (years)")
p3 <- countsDF %>%
  subset(n.noncensored >= 3) %>%
  ggplot(aes(y = n.total, x = followup)) +
  geom_hex() +
  xlab("Follow-up (years since diagnosis)") +
  ylab("Number of FC measurements") +
  theme_minimal() +
  labs(fill = "Count")

p1 + p2 + p3 + plot_annotation(tag_levels = "A")

Figure 9: (A) Histogram of the number of FCAL measurements per subject. (B) Histogram of FCAL follow-up per subject (before exclusions). (C) Scatterplot with number of FCAL measurements vs follow-up. In all cases, only individuals with at least 3 non-censored measurements are included.

Code

fcal %>%
  select(ids) %>%
  table() %>%
  quantile(probs = c(0.25, 0.5, 0.75))

  25%   50%   75% 
 5.75  9.00 13.00

Revisiting month of diagnosis

After filtering by subjects who have a diagnostic FC available, January is no longer over-represented for month of diagnosis as we hypothesised (Figure 10)

Code

dict.temp <- subset(dict, ids %in% unique(fcal$ids))
dict.temp %>%
  ggplot(aes(x = as.factor(month(date.of.diag, label = TRUE)))) +
  geom_bar(color = "black", fill = "#FEC601", linewidth = 0.3) +
  theme_minimal() +
  ylab("Frequency") +
  xlab("Month of IBD Diagnosis")

Figure 10: Bar plot of month of diagnosis

C-reactive protein

Code

crp <- crp[, c("ids", "COLLECTION_DATE", "TEST_DATA")] %>%
  subset(ids %in% dict$ids)

# Collection dates include collection times which are not required. Discarding.
crp$COLLECTION_DATE <- readr::parse_date(
  stringr::str_split_fixed(crp$COLLECTION_DATE, " ", n = 2)[, 1],
  format = "%Y-%m-%d"
)

colnames(crp) <- c("ids", "crp_date", "crp_result")

crp <- merge(crp,
  dict[, c("ids", "diagnosis")],
  by = "ids",
  all.x = TRUE
)

C-reactive protein (CRP) is a blood-based marker of inflammation. Unlike FCAL, which is highly specific to gastrointestinal inflammation, CRP levels are indicative of overall inflammation and are more likely to be affected by extra-intestinal factors such as infections. However, patient adherence to CRP testing is very strong, and CRP has seen historic usage in the field of IBD. Although CRP has been used for many years, we unfortunately only have test results from as early as 2008 due to the transition towards using TRAK for storing CRP results instead of the previous system.

As there is only one extract for CRP data (obtained from TRAK), merging multiple datasets is not required. As with the FCAL data, the recorded time of day for each test has been discarded in favour of the date. Any potential duplicate results have been removed.

Pre-processing of test results

Code

crp$crp_result <- as.numeric(
  plyr::mapvalues(
    crp$crp_result,
    from = c(
      "<0.2", "<1", "<1.", "<1.0", "<2", "<3", "<3.0", "<5",
      ">90", ">320"
    ),
    to = c(1, 1, 1, 1, 2, 3, 3, 5, 90, 320)
  )
)

# Lower-bound censoring
crp <- crp %>%
  mutate(crp_result = if_else(crp_result < 1, 1, crp_result))

# Remove error test results
crp <- crp[!is.na(crp[, "crp_result"]), ]
# Map numerical <1 test results (e.g 0.2) to 1.

CRP is left-censored, but does not appear to be right-censored. Multiple cut-offs are reported (1.1, 1.5, 1.6, 2, 3, 3.5, 4) with there also being variations for how < 1mg/L is reported (1.5, 1.6, 2). We map every observation < 1 to 1 and map <2, <3, and < 5 to 2, 3, and 5 respectively. Any failed tests (e.g due to contamination or an insufficient sample being provided) are discarded. As such, 1mg/L is used as a left censoring value for CRP.

As can be seen in Figure 11, extremely elevated CRP has been observed in some cases. These observations are physiologically valid. However, these data may skew our models and may be candidates for removal.

Code

crp %>%
  ggplot(aes(x = crp_result)) +
  geom_density(
    linewidth = 0.8,
    alpha = 0.5,
    fill = "#FFC800",
    color = "#FF8427"
  ) +
  theme_minimal() +
  theme(axis.text.y = element_blank()) +
  xlab("CRP (mg/L)") +
  ylab("Density")

Figure 11: Density plot of FC measurements by observed value

As with the FC data, the CRP data is also log transformed (Figure 12).

Code

crp %>%
  ggplot(aes(x = log(crp_result))) +
  geom_density(
    linewidth = 0.8,
    alpha = 0.5,
    fill = "#FFC800",
    color = "#FF8427"
  ) +
  theme_minimal() +
  theme(axis.text.y = element_blank()) +
  xlab("log(CRP (mg/L))") +
  ylab("Density")

Figure 12: Density plot of logged CRP test results

Observing Figure 12, it would appear the influence of extremely elevated CRP results will be minimal.

Removal of duplicates

Code

crp <- crp %>% distinct(ids,
  crp_date,
  crp_result,
  .keep_all = TRUE
)

As multiple CRP measurements are possible within a short time-frame, we only consider as duplicates those with an exact match by ids, crp_date and crp_result.

Retiming - part 1

As with the FCAL data, the times of CRP measurements are retimed and scaled to be the number of years since IBD diagnosis.

Code

crp <- merge(crp, dict[, c("ids", "date.of.diag")],
  by = "ids", all.x = TRUE, all.y = FALSE
)

# Dates have already been converted to Date class by fix_date_char() for dict
crp$crp_time <- as.numeric(crp$crp_date - crp$date.of.diag) / 365.25

crp %>%
  ggplot(aes(x = crp_time, y = log(crp_result), color = factor(ids))) +
  geom_line(alpha = 0.2) +
  geom_point(alpha = 0.3) +
  theme_minimal() +
  scale_color_manual(values = viridis::viridis(length(unique(crp$ids)))) +
  guides(color = "none") +
  xlab("Time (years)") +
  ylab("Log(CRP (mg/L))") +
  ggtitle("")

Figure 13: Spaghetti plot of CRP trajectories (preprocessed)

As should be expected given the ubiquitous nature of CRP testing, many CRP observations are earlier than diagnosis (Figure 13). Tests taken close to reported date of diagnosis are likely “early” as a result of diagnostic delay whereas tests from much earlier were likely requested due to other conditions.

As with FCAL, CRP results earlier than 90 days prior to diagnosis will be discarded. If a subject has a CRP within 90 days before diagnosis, then all of their CRP tests will be retimed such that their earliest CRP within this period is equal to 0 and all later measurements are shifted accordingly to maintain the same differences between measurement times.

Note

This may mean t_0 corresponds to slightly different dates across FCAL and CRP for the same subject.

We also reduce the dataset to only subjects with a diagnostic CRP which equates to subjects with a CRP measurement within 3 months of diagnosis t \leq 0.25

Code

crp <- subset(crp, crp_time >= -0.25)

Inclusion/exclusion: removal of subjects without a diagnostic CRP test

Code

diagnostic <- crp %>%
  group_by(ids) %>%
  summarise(n = sum(crp_time < 0.25)) %>%
  subset(n > 0)

crp <- subset(crp, ids %in% diagnostic$ids)

After this exclusion, 2406 remain in the data.

Code

diag.time <- c()
crp.ids <-  unique(crp$ids)

for (id in crp.ids) {
  temp <- subset(crp, ids == id)
  temp <- temp[order(temp$crp_time), ]
  diag.time <- c(diag.time, temp[1, "crp_time"])
}


crp.dist <- data.frame(ids = crp.ids, diagnostic = diag.time)

saveRDS(crp.dist, paste0(prefix, "processed/crp-diag-dist.RDS"))

Retiming - part 2

Here, we show the distribution of the diagnostic CRP measurements with respect to diagnosis date (in days).

Code

p <- crp %>%
  group_by(ids) %>%
  filter(crp_time == min(crp_time)) %>%
  ggplot(aes(x = crp_time * 365.25)) +
  geom_density(fill = "#415A77", color = "#1B263B") +
  theme_minimal() +
  labs(
    y = "Density",
    x = "Time from diagnosis to first CRP (days)"
  ) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "red")

ggsave("plots/crp-diagnostic-dist.png",
  p,
  width = 16 * 2 / 3,
  height = 6,
  units = "in"
)

Similar to what was observed for FC, most diagnostic measurements were around the recorded diagnosis time (near zero). Here, we also retime measurements/diagnosis time as described above (i.e. time = 0 matches the time of first available CRP measurement). This is only applied to individuals for which the first available CRP is prior to the recorded diagnosis time.

Code

# Retime so that t_0 = 0.
for (id in unique(crp$ids)) {
  temp <- subset(crp, ids == id)
  if (any(temp$crp_time < 0)) {
    add <- sort(temp$crp_time)[1]
    crp[crp[, "ids"] == id, "crp_time"] <-
      crp[crp[, "ids"] == id, "crp_time"] + abs(add)
  }
}

length(unique(crp$ids))

[1] 2406

Code

crp %>%
  ggplot(aes(x = crp_time, y = log(crp_result), color = factor(ids))) +
  geom_line(alpha = 0.2) +
  geom_point(alpha = 0.3) +
  theme_minimal() +
  scale_color_manual(values = viridis::viridis(length(unique(crp$ids)))) +
  guides(color = "none") +
  xlab("Time (years)") +
  ylab("Log(CRP (mg/L))") +
  ggtitle("")

At this stage, we also exclude CRP measurements recorded beyond 7-years post diagnosis.

Code

crp <- subset(crp, crp_time <= 7)
length(unique(crp$ids))

[1] 2406

At this point, the data contains 10545 FCAL measurements across all 1036 individuals.

Additional pre-processing

Code

countsDF <- crp %>%
  group_by(ids) %>%
  summarise(
    n.total = n(),
    censored.left = sum(crp_result == 1),
    n.noncensored = n.total - censored.left,
    n.negtime.nondiag = sum(crp_time != 0 & crp_date - date.of.diag < 0),
    followup = max(crp_time)
  )

p1 <- countsDF %>%
  ggplot(aes(x = n.total)) +
  geom_histogram(color = "black", fill = "#5BBA6F", binwidth = 1) +
  theme_minimal() +
  ylab("Count") +
  xlab("Measurements per subject")
p2 <- countsDF %>%
  ggplot(aes(x = followup)) +
  geom_histogram(color = "black", fill = "#5BBA6F", binwidth = 1) +
  theme_minimal() +
  ylab("Count") +
  xlab("Follow-up (years)")
p3 <- countsDF %>%
  ggplot(aes(y = n.total, x = followup)) +
  geom_hex() +
  xlab("Follow-up (years since diagnosis)") +
  ylab("Number of CRP measurements") +
  theme_minimal() +
  labs(fill = "Count")

p1 + p2 + p3 + plot_annotation(tag_levels = "A")

Figure 14: (A) Histogram of the number of CRP measurements per subject. (B) Histogram of CRP follow-up per subject (before exclusions). (C) Scatterplot with number of CRP measurements vs follow-up.

Our observations are very similar to what we had for FCAL. However, as CRP measurement frequency is higher, we find a lower proportion of individuals with a low follow-up time.

As CRP is not specific to gastrointestinal inflammation, a variety of factors (e.g. infection) can affect the measurements. To focus on long-term trends, further processing was applied to smooth out short-term fluctuations. Measurements were grouped into intervals of t: [0, 0.5), [0.5, 1.5), [1.5, 2.5), [2.5, 3.5), [3.5, 4.5), [4.5, 5.5), [5.5, 7], where t = 0 (years) is the time of diagnosis. The median CRP for each interval was calculated for each subject and used as input for subsequent analyses. The centre of each interval was used as the corresponding observation time.

Code

id.list <- unique(crp$ids)
crp.ma <- matrix(NA, nrow = length(id.list), ncol = 7)

for (i in seq_along(id.list)) {
  subject_data <- subset(crp, ids == id.list[i])
  for (j in seq(0, 6)) {
    if (j == 6) {
      sub.obs <- subset(
        subject_data,
        crp_time >= j - 0.5 & crp_time <= j + 1
      )
    } else {
      sub.obs <- subset(
        subject_data,
        crp_time >= j - 0.5 & crp_time < j + 0.5
      )
    }
    if (nrow(sub.obs) > 0) {
      crp.ma[i, j + 1] <- median(sub.obs$crp_result)
    }
  }
}
rownames(crp.ma) <- id.list

# Convert back to a long-format dataset and rename columns
crp.ma <- reshape2::melt(t(crp.ma), id.vars = row.names(crp.ma), na.rm = TRUE)
colnames(crp.ma) <- c("crp_time", "ids", "crp_result")
crp.ma <- crp.ma[, c(2, 3, 1)]

# Reset the time to start at zero
crp.ma$crp_time <- crp.ma$crp_time - 1

# Take into account uneven spacing at start and end
crp.ma$crp_time <- plyr::mapvalues(crp.ma$crp_time,
  from = c(0, 6),
  to = c(0.25, 6.25)
)

# Add diagnosis type back
crp.ma <- merge(crp.ma, dict[, -3], by = "ids", all.x = TRUE, all.y = FALSE)

Inclusion/exclusion: a minimum number of CRP measurements

Note that our inclusion criteria requires at least 3 CRP measurements per individual. This is applied after the additional pre-processing applied in the previous section. Here, we re-calculate relevant summary statistics.

Code

countsDF <- crp.ma %>%
  group_by(ids) %>%
  summarise(
    n.total = n(),
    censored.left = sum(crp_result == 0), # 0 because it was log transformed
    n.noncensored = n.total - censored.left,
    followup = max(crp_time),
    crp.var = var(crp_result, na.rm = TRUE)
  )

p1 <- countsDF %>%
  ggplot(aes(x = n.total)) +
  geom_histogram(color = "black", fill = "#5BBA6F", binwidth = 1) +
  theme_minimal() +
  ylab("Count") +
  xlab("Measurements per subject")
p2 <- countsDF %>%
  ggplot(aes(x = followup)) +
  geom_histogram(color = "black", fill = "#5BBA6F", binwidth = 1) +
  theme_minimal() +
  ylab("Count") +
  xlab("Follow-up (years)")
p3 <- countsDF %>%
  ggplot(aes(y = n.total, x = followup)) +
  geom_hex() +
  xlab("Follow-up (years since diagnosis)") +
  ylab("Number of CRP measurements") +
  theme_minimal() +
  labs(fill = "Count")

p1 + p2 + p3 + plot_annotation(tag_levels = "A")

Figure 15: (A) Histogram of the number of CRP measurements per subject. (B) Histogram of CRP follow-up per subject (before exclusions). (C) Scatterplot with number of CRP measurements vs follow-up.

Code

# Number of individuals with at least 3 CRP measurements
sum(countsDF$n.total >= 3)

[1] 1939

Code

# Number of individuals with at least 3 CRP measurements and non zero variance
sum(countsDF$n.total >= 3 & countsDF$crp.var != 0)

[1] 1838

Code

# Number of individuals with at least 3 non-censored FCAL measurements
sum(countsDF$n.noncensored >= 3)

[1] 1939

The following is used to select individuals with at least 3 CRP measurements and non-zero variance.

Code

crp.ma.0 <- crp.ma %>%
  subset(ids %in% countsDF$ids[countsDF$n.total >= 3 & countsDF$crp.var == 0])
saveRDS(crp.ma.0, paste0(prefix, "processed/crp-ma-0.RDS"))

crp.ma <- crp.ma %>%
  subset(ids %in% countsDF$ids[countsDF$n.total >= 3 & countsDF$crp.var != 0])

length(unique(crp.ma$ids))

[1] 1838

This leaves a cohort with 2406 individuals.

Code

p1 <- countsDF %>%
  subset(n.total >= 3 & crp.var != 0) %>%
  ggplot(aes(x = n.total)) +
  geom_histogram(color = "black", fill = "#5BBA6F", binwidth = 1) +
  theme_minimal() +
  ylab("Count") +
  xlab("Measurements per subject")
p2 <- countsDF %>%
  subset(n.total >= 3 & crp.var != 0) %>%
  ggplot(aes(x = followup)) +
  geom_histogram(color = "black", fill = "#5BBA6F", binwidth = 1) +
  theme_minimal() +
  ylab("Count") +
  xlab("Follow-up (years)")
p3 <- countsDF %>%
  subset(n.total >= 3 & crp.var != 0) %>%
  ggplot(aes(y = n.total, x = followup)) +
  geom_point(color = "#FF4F79", size = 2) +
  xlab("Follow-up (years since diagnosis)") +
  ylab("Number of FCAL measurements") +
  theme_minimal() +
  geom_hline(yintercept = 3)

p1 + p2 + p3 + plot_annotation(tag_levels = "A")

Figure 16: (A) Histogram of the number of CRP measurements per subject. (B) Histogram of FCAL follow-up per subject (before exclusions). (C) Scatterplot with number of CRP measurements vs follow-up. In all cases, only individuals with at least 3 measurements and non-zero variance are included.

Additional pre-processing

Some subjects have the same median log-transformed CRP across all of their observations, effectively having a variance of 0. These subjects will be extracted and considered as their own cluster, as they may cause numerical issues when model fitting but remain of note.

Code

p1 <- crp.ma %>%
  ggplot(aes(x = crp_result)) +
  geom_density(
    linewidth = 0.8,
    alpha = 0.5,
    fill = "#EE92C2",
    color = "#7B5D6F"
  ) +
  theme_minimal() +
  theme(axis.text.y = element_blank()) +
  xlab("CRP (mg/L)") +
  ylab("Density")

p2 <- crp.ma %>%
  ggplot(aes(x = log(crp_result))) +
  geom_density(
    linewidth = 0.8,
    alpha = 0.5,
    fill = "#ADF1D2",
    color = "#50635A"
  ) +
  theme_minimal() +
  theme(axis.text.y = element_blank()) +
  xlab("log(CRP (mg/L))") +
  ylab("Density")

print(p1 + p2)

Code

ggsave("plots/crp-dist-ma.pdf", p1 + p2, width = 8, height = 4.5, units = "in")

There are 9898 total data points after the additional preprocessing describing 1838 subjects.

Code

crp.ma %>%
  select(ids) %>%
  table() %>%
  quantile(probs = c(0.25, 0.5, 0.75))

Table 3: Quantiles for number of CRP observations per subject after additional preprocessing.

25% 50% 75% 
  4   6   7

Revisiting month of diagnosis

After filtering by subjects who have a diagnostic CRP available, January is no longer over-represented for month of diagnosis. This is a comparable finding to FC (Figure 10).

Code

dict.temp <- subset(dict, ids %in% unique(fcal$ids))
dict.temp %>%
  ggplot(aes(x = as.factor(month(date.of.diag, label = TRUE)))) +
  geom_bar(color = "black", fill = "#FEC601", linewidth = 0.3) +
  theme_minimal() +
  ylab("Frequency") +
  xlab("Month of IBD Diagnosis")

Figure 17: Bar plot of month of diagnosis

Updated density plots

The following plots are updated versions of previous density plots (Figure 4, Figure 5, Figure 11 and Figure 12). These plots only include subjects who have met the inclusion criteria.

Code

p1 <- fcal %>%
  ggplot(aes(x = calpro_result)) +
  geom_density(
    linewidth = 0.8,
    alpha = 0.5,
    fill = "#2E86AB",
    color = "#07004D"
  ) +
  theme_minimal() +
  theme(axis.text.y = element_blank()) +
  xlab("FC (µg/g)") +
  ylab("Density")
print(p1)

Figure 18: Updated density plot of FC measurements by observed value

Code

p2 <- fcal %>%
  ggplot(aes(x = log(calpro_result))) +
  geom_density(
    linewidth = 0.8,
    alpha = 0.5,
    fill = "#F7A072",
    color = "#936136"
  ) +
  theme_minimal() +
  theme(axis.text.y = element_blank()) +
  xlab("log(FC (µg/g))") +
  ylab("Density")
print(p2)

ggsave("plots/fcal-dist.pdf", p1 + p2, width = 8, height = 4.5, units = "in")

Figure 19: Updated density plot of logged FC test results

Code

p3 <- crp %>%
  ggplot(aes(x = crp_result)) +
  geom_density(
    linewidth = 0.8,
    alpha = 0.5,
    fill = "#EE92C2",
    color = "#7B5D6F"
  ) +
  theme_minimal() +
  theme(axis.text.y = element_blank()) +
  xlab("CRP (mg/L)") +
  ylab("Density")
print(p3)

Figure 20: Updated density plot of CRP measurements by observed value

Code

p4 <- crp %>%
  ggplot(aes(x = log(crp_result))) +
  geom_density(
    linewidth = 0.8,
    alpha = 0.5,
    fill = "#ADF1D2",
    color = "#50635A"
  ) +
  theme_minimal() +
  theme(axis.text.y = element_blank()) +
  xlab("log(CRP (mg/L))") +
  ylab("Density")
print(p4)
ggsave("plots/crp-dist.pdf", p3 + p4, width = 8, height = 4.5, units = "in")

Figure 21: Updated density plot of logged CRP test results

Code

p5 <- crp.ma %>%
  ggplot(aes(x = crp_result)) +
  geom_density(
    linewidth = 0.8,
    alpha = 0.5,
    fill = "#EE92C2",
    color = "#7B5D6F"
  ) +
  theme_minimal() +
  theme(axis.text.y = element_blank()) +
  xlab("Processed CRP (mg/L)") +
  ylab("Density")
print(p5)

Figure 22: Updated density plot of CRP after additional preprocessing

Code

p6 <- crp.ma %>%
  ggplot(aes(x = log(crp_result))) +
  geom_density(
    linewidth = 0.8,
    alpha = 0.5,
    fill = "#ADF1D2",
    color = "#50635A"
  ) +
  theme_minimal() +
  theme(axis.text.y = element_blank()) +
  xlab("log(processed CRP (mg/L))") +
  ylab("Density")
print(p6)
ggsave("plots/crp-processed-dist.pdf",
  p5 + p6,
  width = 8,
  height = 4.5,
  units = "in"
)


p <- (p1 + p2) / (p5 + p6) +
  plot_annotation(tag_levels = "A") &
  theme(plot.tag = element_text(size = 18, face = "bold"))


ggsave("plots/Sup-Figure-1.pdf",
  p,
  width = 8,
  height = 8,
  units = "in"
)

Figure 23: Updated density plot of logged CRP after additional preprocessing

Code

# Number of CRP observations (unprocessed) per subject
crp %>%
  filter(ids %in% unique(crp.ma$ids)) %>%
  select(ids) %>%
  table() %>%
  quantile(probs = c(0.25, 0.5, 0.75))

25% 50% 75% 
 10  20  36

Code

crp %>%
  filter(ids %in% unique(crp.ma$ids)) %>%
  select(ids) %>%
  nrow()

[1] 49364

Code

# Number of CRP observations (unprocessed) per subject for the overlap cohort
crp %>%
  filter(ids %in% unique(crp.ma$ids) & ids %in% unique(fcal$ids)  ) %>%
  select(ids) %>%
  table() %>%
  quantile(probs = c(0.25, 0.5, 0.75))

25% 50% 75% 
 14  26  43

Data saving

The processed subject dictionary and FC and CRP datasets have been saved to files, so they can be used for the primary analysis.

Code

if (!dir.exists(paste0(prefix, "processed"))) {
  dir.create(paste0(prefix, "processed"))
}

fcal$calpro_result <- log(fcal$calpro_result)
crp.ma$crp_result <- log(crp.ma$crp_result)


saveRDS(dict, paste0(prefix, "processed/dict-initial.RDS"))
saveRDS(fcal, paste0(prefix, "processed/fcal.RDS"))
saveRDS(crp, paste0(prefix, "processed/crp.RDS"))
saveRDS(crp.ma, paste0(prefix, "processed/median-crp.RDS"))

Code

fcal %>%
  filter(ids %in% crp.ma$ids) %>%
  select(ids) %>%
  table() %>%
  quantile(probs = c(0.25, 0.5, 0.75))

25% 50% 75% 
  6   9  14

Code

crp.ma %>%
  filter(ids %in% fcal$ids) %>%
  select(ids) %>%
  table() %>%
  quantile(probs = c(0.25, 0.5, 0.75))

25% 50% 75% 
  5   6   7

Acknowledgments

This work is funded by the Medical Research Council & University of Edinburgh via a Precision Medicine PhD studentship (MR/N013166/1, to NC-C).

Author contributions

NC-C performed the processing and wrote the text. NP, ML, curated and provided datasets. KM-G, CWL, and CAV provided supervisory support and assisted with revisions.

References

Jenkinson, P W, N Plevris, S Siakavellas, M Lyons, I D Arnott, D Wilson, A J M Watson, G-R Jones, and C W Lees. 2020. “Temporal Trends in Surgical Resection Rates and Biologic Prescribing in Crohn’s Disease: A Population-Based Cohort Study.” Journal of Crohns and Colitis 14 (9): 1241–47. https://doi.org/10.1093/ecco-jcc/jjaa044.

Jones, Gareth-Rhys, Mathew Lyons, Nikolas Plevris, Philip W Jenkinson, Cathy Bisset, Christopher Burgess, Shahida Din, et al. 2019. “IBD Prevalence in Lothian, Scotland, Derived by Capture-Recapture Methodology.” Gut 68 (11): 1953–60. https://doi.org/10.1136/gutjnl-2019-318936.

Reuse

Licensed by CC BY unless otherwise stated.

MRC Human Genetics Unit logo

Session info

Code

pander(sessionInfo())

R version 4.4.2 (2024-10-31)

Platform: aarch64-apple-darwin20

locale: en_GB.UTF-8||en_GB.UTF-8||en_GB.UTF-8||C||en_GB.UTF-8||en_GB.UTF-8

attached base packages: stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: patchwork(v.1.3.0), datefixR(v.1.7.0), pander(v.0.6.5), knitr(v.1.49), lubridate(v.1.9.3), forcats(v.1.0.0), stringr(v.1.5.1), dplyr(v.1.1.4), purrr(v.1.0.2), readr(v.2.1.5), tidyr(v.1.3.1), tibble(v.3.2.1), ggplot2(v.3.5.1) and tidyverse(v.2.0.0)

loaded via a namespace (and not attached): viridis(v.0.6.5), utf8(v.1.2.4), generics(v.0.1.3), lattice(v.0.22-6), stringi(v.1.8.4), hms(v.1.1.3), digest(v.0.6.37), magrittr(v.2.0.3), evaluate(v.1.0.1), grid(v.4.4.2), timechange(v.0.3.0), fastmap(v.1.2.0), plyr(v.1.8.9), jsonlite(v.1.8.9), gridExtra(v.2.3), fansi(v.1.0.6), viridisLite(v.0.4.2), scales(v.1.3.0), textshaping(v.0.4.0), codetools(v.0.2-20), cli(v.3.6.3), rlang(v.1.1.4), munsell(v.0.5.1), withr(v.3.0.2), yaml(v.2.3.10), tools(v.4.4.2), reshape2(v.1.4.4), tzdb(v.0.4.0), colorspace(v.2.1-1), vctrs(v.0.6.5), R6(v.2.5.1), lifecycle(v.1.0.4), htmlwidgets(v.1.6.4), ragg(v.1.3.3), pkgconfig(v.2.0.3), hexbin(v.1.28.5), pillar(v.1.9.0), gtable(v.0.3.6), glue(v.1.8.0), Rcpp(v.1.0.13-1), systemfonts(v.1.1.0), xfun(v.0.49), tidyselect(v.1.2.1), farver(v.2.1.2), htmltools(v.0.5.8.1), rmarkdown(v.2.29), labeling(v.0.4.3) and compiler(v.4.4.2)

--- title: "Preprocessing" author: - name: "Nathan Constantine-Cooke" corresponding: true url: https://scholar.google.com/citations?user=2emHWR0AAAAJ&hl=en&oi=ao affiliations: - ref: CGEM - ref: HGU - name: "Karla Monterrubio-Gómez" url: https://scholar.google.com/citations?user=YmyxSXAAAAAJ&hl=en affiliations: - ref: HGU - ref: CGEM - name: "Catalina A. Vallejos" url: https://scholar.google.com/citations?user=lkdrwm0AAAAJ&hl=en&oi=ao affiliations: - ref: HGU #comments: # giscus: # repo: quarto-dev/quarto-docs editor_options: markdown: wrap: 72 --- ## Introduction ```{R Package load} #| message: false #| warning: false #| cache: false set.seed(123) ### Required packages library(tidyverse) # ggplot2, dplyr, and magrittr library(knitr) # Markdown utilities library(pander) # Pretty markdown rendering library(datefixR) # Standardising dates library(lubridate) # Date handling library(patchwork) # Arranging plots ``` This page details preprocessing steps undertaken prior to model fitting. These steps include examining data quality, performing data cleaning, and deriving the study cohort. The biomarker (FC and CRP) data used in this pipeline have been primarily obtained from TRAK, a system used by NHS Lothian for the electronic ordering of laboratory tests. However, we also use phenotyping data manually curated by the Lees group for previous studies (see @Jenkinson2020). Prior to this report, all subject CHI (community health index) numbers, which uniquely identifies a patient when they interact with NHS Scotland services, were pseudoanonymised. Each CHI has been replaced with a unique random number. ### Inclusion/exclusion criteria definitions {#inclusionexclusion-criteria-definitions} The inclusion criteria for this study can broadly be categorised into two classifications: baseline and longitudinal. Baseline criteria can be applied using information known at diagnosis whilst longitudinal criteria are based on biomarker measurements taken over time. The longitudinal criteria for each biomarker (FC and CRP) are considered independently. If a subject meets the below criteria for FC but not CRP, then only the FC for the subject will be modelled and vice versa. ::: panel-tabset #### Baseline ##### Diagnosis of Inflammatory bowel disease As inflammatory bowel disease (IBD) is the disease of interest for this study, subjects are required to have a confirmed diagnosis of either Crohn's disease, ulcerative colitis, or inflammatory bowel disease unclassified. ##### Diagnosis date Study subjects are required to have been diagnosed with IBD between October 2004 and December 2017. The lower bound is required as FC tests were not introduced until 2005 and we require study subjects to have a FC available close to diagnosis (see the longitudinal tab). The upper bound is required to ensure subjects have the opportunity to have at least five years of follow-up. The most recent FC observation in this dataset is January 2023 which informed when deciding upon this cut-off. #### Longitudinal ##### Diagnostic measurement We require study subjects to have a biomarker measurement taken ±90 days of reported date of diagnosis. ##### At least three non-censored measurements To ensure stability and interpretability, study subjects are also required to have at three non-censored biomarker measurements taken across follow-up. Like most biomarkers, the biomarkers used in this study are subject to censoring. However, these censored observations make it impossible to detect changes over time and we therefore do not wish to be reliant on censored observations. ::: ### Datasets ```{R Read files} if (file.exists("/.dockerenv")) { # Check if running in Docker # Assume igmm/Vallejo-predict/libdr/ is passed to the data volume prefix <- "data/" } else { # Assume running outside of a Docker container and the IGC(/IGMM) datastore is # mounted at /Volumes prefix <- "/Volumes/igmm/cvallejo-predicct/libdr/" } fcal.pheno <- read.csv(file.path(prefix, "2024-10-03/fcal-cleaned.csv")) # Extract from TRAK which also now introduces CRP. labs <- read.csv(file.path(prefix, "2024-10-03/markers-cleaned.csv")) fcal <- subset(labs, TEST == "f-Calprotectin-ALP") # Add sex and diagnosis type from fcal.pheno fcal <- fcal.pheno %>% distinct(ids, .keep_all = TRUE) %>% select(ids, sex, diagnosis, diagnosis_date) %>% merge(x = fcal, by = "ids", all.x = TRUE, all.y = FALSE) crp <- subset(labs, TEST == "C-Reactive Prot") updated <- read.csv(file.path(prefix, "2024-10-03/allPatientsNathanCleaned.csv")) outcomes <- read.csv(file.path(prefix, "2024-10-03/cd-cleaned.csv")) non.ibd <- read.csv(file.path (prefix, "2024-10-24/non-ibd.csv")) ``` Multiple datasets are used in this report with varying degrees of previous curation. - `markers.csv` is a 2024 extraction from TRAKcare describing `r nrow(labs)` test results (FC: `r nrow(fcal)`; CRP: `r nrow(crp)`). Each row represents a test result with columns indicating the corresponding subject ID, the test type (FC or CRP), the time of the test, and the test result. This dataset contains the most recent data out of the other datasets. However, there has been no manual curation of these data, and no information about the patient's characteristics are available (other than their ID). - `fcal-cleaned.csv` does not describe CRP data. However, these data have been curated by the Lees group. In addition to `r nrow(fcal)` FC test results, each row also provides the sex, IBD type, date of diagnosis, and date of death (if applicable) for the corresponding subject. - `allPatientsNathanCleaned.csv` provides sex, IBD type, and diagnosis date for most (but not all) subjects in `labs.csv` not described by `fcal-cleaned.csv`. - `cd-cleaned.csv` describes CD patient data (n = `r nrow(outcomes)`). In addition to the same basic patient characteristic variables in other datasets, this dataset also describes surgical outcomes and biologic treatments prescribed to the subject. - `non-ibd.csv` describes subjects who were found to not have IBD during the phenotyping process. This dataset is used to remove subjects who do not have IBD from the analysis. ::: callout-note The "cleaned" in the above file names refers to the CHI numbers having been replaced. No further processing of the data has been undertaken for this study prior to the steps outlined in this report. ::: ## Subject data We will create a "dictionary" which, for each subject, lists their subject ID, their IBD type, and their date of diagnosis. In addition to ensuring inclusion criteria are met, date of diagnosis will be used to retime all biomarker measurements of a subject such that $t=0$ corresponds to their diagnosis date. IBD types are used to test if clusters are enriched by IBD type and may be used to adjust the clustering. There are multiple potential sources for date of diagnosis and IBD type: the CD outcomes dataset, a demographics dataset maintained by the Lees group, and a raw TRAK extraction. For every subject listed in at least one of these datasets, we have tried to extract their date of diagnosis and diagnosis type. We have assumed a hierarchy of datasets where manually curated datasets are preferred for extracting data over a dataset generated with minimal human involvement. ### Diagnosis and diagnosis date ```{R Create subject dictionary} ids <- unique(c(outcomes$ids, fcal$ids, updated$ids)) diagnosis <- character() date.of.diag <- character() for (id in ids) { if (id %in% outcomes$ids) { # Outcomes data only contains CD subjects diagnosis <- c(diagnosis, "Crohn's Disease") date.of.diag <- c( date.of.diag, subset(outcomes, ids == id)$diagnosisDate ) } else if (id %in% updated$ids) { diagnosis <- c(diagnosis, subset(updated, ids == id)[1, "diagnosis"]) date.of.diag <- c( date.of.diag, subset(updated, ids == id)[1, "diagnosisDate"] ) } else if (id %in% fcal$ids) { diagnosis <- c(diagnosis, subset(fcal, ids == id)[1, "diagnosis"]) date.of.diag <- c( date.of.diag, subset(fcal, ids == id)[1, "diagnosis_date"] ) } else { diagnosis <- c(diagnosis, "Unknown") date.of.diag <- c( date.of.diag, "Unknown" ) } } dict <- data.frame( ids = ids, diagnosis = diagnosis, date.of.diag = fix_date_char(date.of.diag) ) rm(id, ids, diagnosis, date.of.diag) # clean up dict <- fix_date_df(dict, "date.of.diag") ``` Before filtering by inclusion/exclusion criteria, there are `r sum(is.na(dict$date.of.diag))` individuals for which a diagnostic date is missing. However, `r sum(is.na(dict$diagnosis))` missing values are observed for diagnosis (IBD type). As can be seen in the below table, two spellings of Crohn's disease (with and without an apostrophe (')) and three names for IBDU are used. There are also subjects reported to not have IBD. These will be removed, leaving `r sum(dict$diagnosis != "Not IBD") - nrow(non.ibd)` subjects included in the subsequent analyses. ```{R Frequency table to diagnosis pre-process} #| label: tbl-diag-pre #| tbl-cap: Reported IBD types kable(unique(dict$diagnosis), col.names = c("Diagnosis") ) dict <- dict %>% subset(diagnosis != "Not IBD") %>% subset(!(ids %in% non.ibd$ids)) ``` The Crohn's disease and IBDU names have been standardised (IBDU, Inflammatory Bowel Disease, Inflammatory Bowel Disease - Unknown Subtype are assumed to be equivalent). The subject reported to not have IBD has been removed. ```{R Contingency table of IBD diagnosis pre and post processing} #| label: tbl-diag-post #| tbl-cap: Contingency table showing mapping of IBD types to a standardised format #| results: "hold" dict$old <- dict$diagnosis dict$diagnosis <- plyr::mapvalues( dict$diagnosis, from = c( "Crohns Disease", "Inflamatory Bowel Disease", "Inflamatory Bowel Disease - Unknown Subtype" ), to = c( "Crohn's Disease", "IBDU", "IBDU" ) ) kable(addmargins(table(dict$diagnosis, dict$old), margin = 2)) dict$old <- NULL ``` ### Age and sex ```{r add sex to dict} # Merge with fcal to add sex information dict <- fcal.pheno[, c("ids", "sex")] %>% distinct(ids, .keep_all = TRUE ) %>% merge(x = dict, by = "ids", all.x = TRUE, all.y = FALSE) # Update NA sex if sex available from updated dict <- merge(dict, updated[, c("ids", "sex")], by = "ids", all.x = TRUE, all.y = FALSE ) for (i in 1:nrow(dict)) { if (is.na(dict[i, "sex.x"]) & !is.na(dict[i, "sex.y"])) { dict[i, "sex.x"] <- dict[i, "sex.y"] } } dict$sex <- dict$sex.x dict$sex.x <- dict$sex.y <- NULL # Add age at IBD diagnosis updated <- fix_date_df(updated, "diagnosisDate") updated$age <- with(updated, year(diagnosisDate) - dateOfBirth) dict <- merge(dict, updated[, c("ids", "age")], by = "ids", all.x = TRUE, all.y = FALSE ) ``` Note that there `r sum(is.na(dict$age))` individuals for which age at diagnosis is missing (this is due to missing diagnostic dates). However, there are `r sum(is.na(dict$sex))` individuals for which sex is missing. ### Date of death ```{r add date.of.death} dict <- merge(dict, updated[, c("ids", "death")], by = "ids", all.x = TRUE, all.y = FALSE ) %>% rename(date.of.death = death) dict <- fix_date_df(dict, "date.of.death") ``` Date of death information is available for `r sum(!is.na(dict$date.of.death))` individuals. We assume that all remaining individuals have not died by the date of data extraction. ## Exclusion criteria according to diagnosis and date of diagnosis Date of diagnosis in an integral aspect of the inclusion criteria and also used to retime biomarker measurements. Firstly, we remove `r sum(is.na(dict$date.of.diag))` individuals with a missing date of diagnosis. ```{R Remove NA date of diagnosis} dict <- dict[!is.na(dict$date.of.diag), ] ``` This leaves `r nrow(dict)` for subsequent analyses. ```{R Apply date of diagnosis exclusion} # no. subjects over upper bound max_fup <- max(labs$COLLECTION_DATE) max_fup <- as.Date(max_fup) n.upper <- nrow(subset(dict, max_fup - date.of.diag < 5 * 365.25)) # no. subjects under lower bound n.lower <- nrow(subset(dict, (year(date.of.diag) < 2004) | (year(date.of.diag) == 2004 & month(date.of.diag) <= 9))) # subset to subjects meeting the criteria. dict <- subset(dict, max_fup - date.of.diag >= 5 * 365.25) dict <- subset(dict, (year(date.of.diag) >= 2005) | (year(date.of.diag) == 2004 & month(date.of.diag) >= 10)) ``` As described in the [inclusion/exclusion criteria section](#inclusionexclusion-criteria-definitions), the inclusion criteria considers upper and lower bounds for the diagnostic dates. In this case, we only consider subjects diagnosed after October 2004 and 13 August 2019 (this ensures individuals can be followed up by at least 5 year by the time of the last longitudinal measument obtained for the cohort; `r max_fup`). As such, subjects who do not meet this criteria will be excluded. `r n.upper` subjects were diagnosed on 13 August 2019 or later and were removed, and `r n.lower` subjects were diagnosed prior to October 2004 and were likewise removed. This results in a cohort size of `r nrow(dict)` subjects. ```{R Clean up excluded date of diagnosis counts} #| include: false rm(n.upper, n.lower) # clean up ``` ## Exploration  We now explore date of diagnosis to ensure data quality is at a suitable standard. ##### Year of diagnosis From @fig-diag-year, we can see the number of IBD diagnoses each year is relatively static with the exception of 2004 and 2005. The former describes only three months of the year, and incidence was likely still increasing across both of these two years. @Jones2019 found IBD incidence from 2008 onwards to be consistent over time for patients diagnosed by NHS Lothian which is in agreement with our findings. ```{R Year of diagnosis} #| label: fig-diag-year #| fig-cap: "Distribution of year of diagnosis" dict %>% ggplot(aes(x = year(date.of.diag))) + geom_histogram(fill = "#B8F2E6", color = "black", binwidth = 1) + theme_minimal() + xlab("Year of IBD diagnosis") + ylab("Frequency") ``` ##### Month of diagnosis It appears there are subjects for whom only year of diagnosis was available. This has resulted in the 1<sup>st</sup> of January for that year being recorded as the date of diagnosis for that subject. As such, far more diagnoses are reported in January than in other months (@fig-diag-month). There are `r sum(day(dict$date.of.diag) == 1 & month(dict$date.of.diag) == 1)` subjects reported to have been diagnosed on New Year's day . If a subject's exact date of diagnosis is not known, then this is most likely because the subject was not diagnosed by NHS Lothian. Instead, the date of diagnosis would have needed to be recalled by the subject which introduces the inaccuracy observed. As subjects are require a diagnostic biomarker measurement to be available within 90 days of reported date of diagnosis to be included in this study and patients diagnosed outside of NHS Lothian will not have a biomarker test result available within this period, the effect on our study should be minimal. However, we will revisit month of diagnosis after filtering by existence of diagnostic biomarker measurements to confirm our assumption. ```{R Year of diagnosis (preprocessed)} #| label: fig-diag-year-processed #| fig-cap: "Bar plot of year of diagnosis" dict %>% ggplot(aes(x = as.factor(year(date.of.diag)))) + geom_bar(color = "black", fill = "#FEC601", linewidth = 0.3) + theme_minimal() + ylab("Frequency") + xlab("Year of IBD Diagnosis") ``` ```{R Month of diagnosis (preprocessed)} #| label: fig-diag-month #| fig-cap: "Bar plot of month of diagnosis" dict %>% ggplot(aes(x = as.factor(month(date.of.diag, label = TRUE)))) + geom_bar(color = "black", fill = "#FEC601", linewidth = 0.3) + theme_minimal() + ylab("Frequency") + xlab("Month of IBD Diagnosis") ``` ```{r diagnostic trend} diag.ts <- dict %>% mutate( date.of.diag.month = as.Date(paste0(format(date.of.diag, "%Y-%m"), "-01"), format = "%Y-%m-%d") ) %>% group_by(date.of.diag.month) %>% summarise(count = n()) %>% ungroup() ggplot(diag.ts, aes(x = as.Date(date.of.diag.month, "%Y-%m"), y = count, group = 1)) + geom_line() + xlab("Diagnosis date") + ylab("Number of diagnosis") + scale_x_date(date_labels = "%Y-%m") + theme_minimal() ``` ## Faecal calprotectin Faecal calprotectin (FC), a marker of intestinal inflammation, has been obtained from two datasets. The first dataset has been curated by members of the Lees group whilst the second dataset is a direct extract from TRAK, a patient monitoring system used by NHS Lothian. ### Incorporating later extract Data from these datasets has been merged and duplicates of data (same subject ID, measurement date, and recorded value) were removed. We also reduced the FC dataset to only describe subjects for whom IBD type and date of diagnosis are available for. For the TRAK extract, times for test results are given in datetime format (where both date and time of the day are provided). The times have been dropped as this degree of granularity is not required. Duplicate tests (same id, date and test value) are removed here. ```{R FCAL merge} fcal <- fcal[, c( "ids", "COLLECTION_DATE", "TEST_DATA", "sex", "diagnosis", "diagnosis_date" )] # Subset to only include those that passed the earlier inclusion/exclusion fcal <- subset(fcal, ids %in% dict$ids) # Collection dates include collection times which are not required. Discarding. fcal$COLLECTION_DATE <- readr::parse_date( stringr::str_split_fixed(fcal$COLLECTION_DATE, " ", n = 2)[, 1], format = "%Y-%m-%d" ) colnames(fcal)[1:3] <- c("ids", "calpro_date", "calpro_result") fcal <- subset(fcal, ids %in% dict$ids) fcal <- fcal %>% select(-diagnosis) fcal <- merge(fcal, dict[, c("ids", "diagnosis", "date.of.diag")], by = "ids", all.x = TRUE ) ``` ### Pre-processing of test results ```{R FCAL censor mapping} #| warning: false ## Some values cannot be directly coerced as numeric fcal$calpro_result <- plyr::mapvalues( fcal$calpro_result, from = c( "<20", "<25", "<50", ">1250", ">2500", ">3000", ">6000" ), to = c( "20", "25", "50", "1250", "2500", "3000", "6000" ) ) # Here, values that cannot be converted into a numeric value will be excluded # (they are converted to NA) fcal$calpro_result <- as.numeric(fcal$calpro_result) # Remove error codes fcal <- fcal[!is.na(fcal[, "calpro_result"]), ] # Apply limits of detection fcal[fcal[, "calpro_result"] < 20, "calpro_result"] <- 20 fcal[fcal[, "calpro_result"] > 1250, "calpro_result"] <- 1250 ``` FC data can be both left and right censored. FC recorded as "\<20" were mapped to "20". FC recorded as "\>1250", "\>2500", or "\>6000" were all mapped to "1250". Any FC tests given an error code (e.g. marked as an insufficient sample by the laboratory) have been removed. ::: {.callout-note collapse="true"} #### More information on FC censoring The lower limit of detection is $<20 \mu g/g$ ($20 \mu g$ of calprotectin per $g$ of stool). By reducing the upper limit, it is possible to run more tests in parallel. As a higher throughput has been required over time, the upper threshold for tests has become lower. Initially test results over $6000 \mu g/g$ were censored, then $2500 \mu g/g$ and now $1250 \mu g /g$ is the upper limit for FC tests. This change has resulted in minimal impact in clinics as $1250 \mu g/g$ is still considered a high result. However, this change has potential implications for research. ::: ```{r fig-fcal-meas} #| label: "fig-fcal-meas" #| fig-cap: "Density plot of FCAL measurements by observed value" #| warning: false fcal %>% ggplot(aes(x = calpro_result)) + geom_density( linewidth = 0.8, alpha = 0.5, fill = "#9FD8CB", color = "#517664" ) + theme_minimal() + theme(axis.text.y = element_blank()) + xlab("FCAL (µg/g)") + ylab("Density") + geom_vline(xintercept = 250, colour = "red") ``` FC test results on the original measurement scale are heavily right-skewed towards $100 \mu g/g$. These data will be log transformed- resulting in the multi-modal distribution seen in @fig-logfcal-meas. ```{R} #| label: "fig-logfcal-meas" #| fig-cap: "Density plot of logged FCAL test results" fcal %>% ggplot(aes(x = log(calpro_result))) + geom_density( linewidth = 0.8, alpha = 0.5, fill = "#9FD8CB", color = "#517664" ) + theme_minimal() + theme(axis.text.y = element_blank()) + xlab("log(FCAL (µg/g))") + ylab("Density") + geom_vline(xintercept = log(250), colour = "red") ``` ### Removal of duplicate FCAL measurements ```{R} duplicated.ids <- c() for (id in unique(fcal$ids)) { sub.fcal <- subset(fcal, ids == id) # Get FC data for a subject sub.fcal <- sub.fcal[order(sub.fcal$calpro_date), ] # Order by dates time.diff <- diff(sub.fcal$calpro_date) # Find time between ordered dates value.diff <- diff(sub.fcal$calpro_result) # Find difference in observed values # If two measurements are within 10 days of each other and have the same value if (any(time.diff <= 10 & value.diff == 0)) { duplicated.ids <- c(duplicated.ids, id) # Remove suspected duplicates (taking into account difference is lagged) sub.fcal <- sub.fcal[c(TRUE, !(time.diff <= 10 & value.diff == 0)), ] # Remove data for subject with duplicates fcal <- subset(fcal, ids != id) # Add non duplicated data back fcal <- rbind(fcal, sub.fcal) } } ``` Given FC was retrospectively collected from observational data, it is possible duplicate tests have been recorded in the extract. As it is NHS Lothian policy to not test FC days apart, we can assume that any tests with the same observation within a small time frame are likely to be duplicates. This may occur when collection dates and testing dates have been used interchangeably. If a subsequent FC test for a subject was dated within 10 days of a previous test and has the same observed value, this observation was dropped. ```{R} htmltools::p(paste( "There are", length(duplicated.ids), "subjects who appear to have duplicated FC measurements" )) ``` ### Retiming - part 1 Time of FC measurements were retimed and scaled to be the number of years since IBD diagnosis. ```{R FCAL time mapping} #| label: fig-fcal-spag-pre #| fig-cap: "Spaghetti plot of FC trajectories (preprocessed)" # Dates have already been converted to Date class by fix_date_char() for dict fcal$calpro_time <- as.numeric(fcal$calpro_date - fcal$date.of.diag) / 365.25 fcal %>% ggplot(aes(x = calpro_time, y = log(calpro_result), color = factor(ids))) + geom_line(alpha = 0.2) + geom_point(alpha = 0.6) + theme_minimal() + scale_color_manual(values = viridis::viridis(length(unique(fcal$ids)))) + guides(color = "none") + xlab("Time (years)") + ylab("Log(FCAL (µg/g))") + ggtitle("") ``` After retiming, it is clear some FC observations are earlier than diagnosis and in some cases substantially earlier (@fig-fcal-spag-pre). Tests taken close to the reported date of diagnosis are likely "early" as a result of diagnostic delay whereas tests from much earlier were likely requested due to other conditions. As such, FC results earlier than 90 days prior to diagnosis will be discarded. If a subject has an FC within 90 days before diagnosis, then all of their FC tests will be retimed such that their earliest FC within this period is equal to 0 and all later measurements are shifted accordingly to maintain the same differences between measurement times. ```{r remove prediag fcal} fcal <- subset(fcal, calpro_time >= -0.25) ``` ### Inclusion/exclusion: removal of subjects without a diagnostic FC test As indicated in our inclusion/exclusion criteria, we reduce the dataset to only subjects with a diagnostic FC. This equates to subjects with at least one FCAL measurement within 3 months of diagnosis $t \leq 0.25$. ```{R FCAL retiming} diagnostic <- fcal %>% group_by(ids) %>% summarise(n = sum(calpro_time < 0.25)) %>% subset(n > 0) fcal <- subset(fcal, ids %in% diagnostic$ids) ``` After this exclusion, `r length(unique(fcal$ids))` subjects remain in the data. ```{R} ##The following code is used to save the cleaned data generated so far. diag.time <- c() fc.ids <- unique(fcal$ids) for (id in fc.ids) { temp <- subset(fcal, ids == id) temp <- temp[order(temp$calpro_time), ] diag.time <- c(diag.time, temp[1, "calpro_time"]) } fc.dist <- data.frame(ids = fc.ids, diagnostic = diag.time) if (!dir.exists(paste0(prefix, "processed"))) { dir.create(paste0(prefix, "processed")) } saveRDS(fc.dist, paste0(prefix, "processed/fc-diag-dist.RDS")) ``` ### Retiming - part 2 Here, we show the distribution of the diagnostic FCAL measurements with respect to diagnosis date (in days). ```{r fcal days before diag} p <- fcal %>% group_by(ids) %>% filter(calpro_time == min(calpro_time)) %>% ggplot(aes(x = calpro_time * 365.25)) + geom_density(fill = "#20A39E", color = "#187370") + theme_minimal() + labs( y = "Density", x = "Time from diagnosis to first faecal calprotectin (days)" ) + geom_vline(xintercept = 0, linetype = "dashed", color = "red") ggsave("plots/fc-diagnostic-dist.png", p, width = 16 * 2 / 3, height = 6, units = "in" ) ``` Although most diagnostic measurements were observed around the recorded diagnosis time (near zero), this is not always the case. As such, we decided to retime measurements/diagnosis time as described above (i.e. time = 0 matches the time of first available FCAL measurement). This is only applied to individuals for which the first available CRP is prior to the recorded diagnosis time. ```{R} #| label: fig-fcal-spag-post #| fig-cap: "Spaghetti plot of FC trajectories (processed)" # Retime so that t_0 = 0. for (id in unique(fcal$ids)) { temp <- subset(fcal, ids == id) if (any(temp$calpro_time < 0)) { add <- sort(temp$calpro_time)[1] fcal[fcal[, "ids"] == id, "calpro_time"] <- fcal[fcal[, "ids"] == id, "calpro_time"] + abs(add) } } length(unique(fcal$ids)) fcal %>% ggplot(aes(x = calpro_time, y = log(calpro_result), color = factor(ids))) + geom_line(alpha = 0.2) + geom_point(alpha = 0.6) + theme_minimal() + scale_color_manual(values = viridis::viridis(length(unique(fcal$ids)))) + guides(color = "none") + xlab("Time (years)") + ylab("Log(FCAL (µg/g))") + ggtitle("") ``` At this stage, we also exclude FCAL measurements recorded beyond 7 years post diagnosis. ```{r remove fcal after 7 years} fcal <- subset(fcal, calpro_time <= 7) length(unique(fcal$ids)) ``` At this point, the data contains `r nrow(fcal)` FCAL measurements across all `r length(unique(fcal$ids))` individuals. ### Inclusion/exclusion: a minimum number of FCAL measurements Here, we summarise the number of observations per individual (with and without considering censored values) as well as length of follow-up. The latter is defined as the difference (in years) between diagnosis time and the time of the latest FCAL measurement. ```{r} countsDF <- fcal %>% group_by(ids) %>% summarise( n.total = n(), censored.left = sum(calpro_result == 20), censored.right = sum(calpro_result == 1250), n.noncensored = n.total - censored.left - censored.right, n.negtime.nondiag = sum(calpro_time != 0 & calpro_date - date.of.diag < 0), followup = max(calpro_time) ) ``` ```{R fcal followup pre exclusions} #| label: fig-fcal-follow-up #| fig-cap: "(A) Histogram of the number of FCAL measurements per subject. (B) Histogram of FCAL follow-up per subject (before exclusions). (C) Scatterplot with number of FCAL measurements vs follow-up." #| fig-width: 12 #| column: body-outset p1 <- countsDF %>% ggplot(aes(x = n.total)) + geom_histogram(color = "black", fill = "#5BBA6F", binwidth = 1) + theme_minimal() + ylab("Count") + xlab("Measurements per subject") p2 <- countsDF %>% ggplot(aes(x = followup)) + geom_histogram(color = "black", fill = "#5BBA6F", binwidth = 1) + theme_minimal() + ylab("Count") + xlab("Follow-up (years)") p3 <- countsDF %>% ggplot(aes(y = n.total, x = followup)) + geom_hex() + xlab("Follow-up (years since diagnosis)") + ylab("Number of FC measurements") + theme_minimal() + labs(fill = "Count") p1 + p2 + p3 + plot_annotation(tag_levels = "A") ``` Summary observations: - The number of FC measurements per subject is skewed towards low values. However, there are also subjects with many FC observations. These subjects were investigated and found to often have a complex disease course, such as acute severe ulcerative colitis, and required close monitoring as a result. - We do not directly require a minimum follow-up for a subject as this may bias our findings. For example, UC patients who undergo a proctocolectomy (surgical removal of the colon and rectum) are less likely to have FC measurements after their surgery than patients who did not require surgery. By applying criteria based on follow-up, we could indirectly exclude subjects based on disease outcomes. - However, the requirement of at least 3 FCAL measurements (within 7 years from diagnosis) will indirectly remove individuals with a very short follow-up time. Note that our inclusion criteria requires at least 3 non-censored FCAL measurements per individual. As such, censoring needs to be taken into account before applying filtering out individuals. ```{r} p1 <- countsDF %>% ggplot(aes(x = n.noncensored)) + geom_histogram(color = "black", fill = "#5BBA6F", binwidth = 1) + theme_minimal() + ylab("Count") + xlab("Number of non-censored measurements per subject") p2 <- countsDF %>% ggplot(aes(y = n.noncensored, x = followup)) + geom_point(color = "#FF4F79", size = 2) + xlab("Follow-up (years since diagnosis)") + ylab("Number of non-censored FCAL measurements") + theme_minimal() + geom_hline(yintercept = 3) p1 + p2 + plot_annotation(tag_levels = "A") ``` ```{R calculate number of non censored FCAL} # Number of individuals with at least 3 FCAL measurements sum(countsDF$n.total >= 3) # Number of individuals with at least 3 non-censored FCAL measurements sum(countsDF$n.noncensored >= 3) # Number of individuals with at least 3 non-censored FCAL measurements after # discarding non-diagnostic values with negative calpro_time sum(countsDF$n.noncensored - countsDF$n.negtime.nondiag >= 3) ``` The following is used to select individuals with at least 3 non-censored FCAL measurements. ```{R FCAL frequency} fcal <- fcal %>% subset(ids %in% countsDF$ids[countsDF$n.noncensored >= 3]) length(unique(fcal$ids)) ``` This leaves a cohort with `r length(unique(fcal$ids))` individuals. ```{r fcal followup post exclusions} #| label: fig-fcal-follow-up-post #| fig-cap: "(A) Histogram of the number of FCAL measurements per subject. (B) Histogram of FCAL follow-up per subject (before exclusions). (C) Scatterplot with number of FCAL measurements vs follow-up. In all cases, only individuals with at least 3 non-censored measurements are included." #| fig-width: 12 #| column: body-outset p1 <- countsDF %>% subset(n.noncensored >= 3) %>% ggplot(aes(x = n.total)) + geom_histogram(color = "black", fill = "#5BBA6F", binwidth = 1) + theme_minimal() + ylab("Count") + xlab("Measurements per subject") p2 <- countsDF %>% subset(n.noncensored >= 3) %>% ggplot(aes(x = followup)) + geom_histogram(color = "black", fill = "#5BBA6F", binwidth = 1) + theme_minimal() + ylab("Count") + xlab("Follow-up (years)") p3 <- countsDF %>% subset(n.noncensored >= 3) %>% ggplot(aes(y = n.total, x = followup)) + geom_hex() + xlab("Follow-up (years since diagnosis)") + ylab("Number of FC measurements") + theme_minimal() + labs(fill = "Count") p1 + p2 + p3 + plot_annotation(tag_levels = "A") ```  ```{R} fcal %>% select(ids) %>% table() %>% quantile(probs = c(0.25, 0.5, 0.75)) ``` ### Revisiting month of diagnosis After filtering by subjects who have a diagnostic FC available, January is no longer over-represented for month of diagnosis as we hypothesised (@fig-diag-month-redux) ```{R Month of diagnosis (postprocessed)} #| label: fig-diag-month-redux #| fig-cap: "Bar plot of month of diagnosis" dict.temp <- subset(dict, ids %in% unique(fcal$ids)) dict.temp %>% ggplot(aes(x = as.factor(month(date.of.diag, label = TRUE)))) + geom_bar(color = "black", fill = "#FEC601", linewidth = 0.3) + theme_minimal() + ylab("Frequency") + xlab("Month of IBD Diagnosis") ``` ## C-reactive protein ```{R crp preprocess} crp <- crp[, c("ids", "COLLECTION_DATE", "TEST_DATA")] %>% subset(ids %in% dict$ids) # Collection dates include collection times which are not required. Discarding. crp$COLLECTION_DATE <- readr::parse_date( stringr::str_split_fixed(crp$COLLECTION_DATE, " ", n = 2)[, 1], format = "%Y-%m-%d" ) colnames(crp) <- c("ids", "crp_date", "crp_result") crp <- merge(crp, dict[, c("ids", "diagnosis")], by = "ids", all.x = TRUE ) ``` C-reactive protein (CRP) is a blood-based marker of inflammation. Unlike FCAL, which is highly specific to gastrointestinal inflammation, CRP levels are indicative of overall inflammation and are more likely to be affected by extra-intestinal factors such as infections. However, patient adherence to CRP testing is very strong, and CRP has seen historic usage in the field of IBD. Although CRP has been used for many years, we unfortunately only have test results from as early as 2008 due to the transition towards using TRAK for storing CRP results instead of the previous system. As there is only one extract for CRP data (obtained from TRAK), merging multiple datasets is not required. As with the FCAL data, the recorded time of day for each test has been discarded in favour of the date. Any potential duplicate results have been removed. ### Pre-processing of test results ```{R crp censor mapping} #| warning: false crp$crp_result <- as.numeric( plyr::mapvalues( crp$crp_result, from = c( "<0.2", "<1", "<1.", "<1.0", "<2", "<3", "<3.0", "<5", ">90", ">320" ), to = c(1, 1, 1, 1, 2, 3, 3, 5, 90, 320) ) ) # Lower-bound censoring crp <- crp %>% mutate(crp_result = if_else(crp_result < 1, 1, crp_result)) # Remove error test results crp <- crp[!is.na(crp[, "crp_result"]), ] # Map numerical <1 test results (e.g 0.2) to 1. ``` CRP is left-censored, but does not appear to be right-censored. Multiple cut-offs are reported (`r unique(as.numeric(gsub("<", "", names(table(crp$crp_result))[2:8])))`) with there also being variations for how $< 1mg/L$ is reported (`r names(table(crp$crp_result))[3:5]`). We map every observation $< 1$ to 1 and map \<2, \<3, and \< 5 to 2, 3, and 5 respectively. Any failed tests (e.g due to contamination or an insufficient sample being provided) are discarded. As such, $1mg/L$ is used as a left censoring value for CRP. As can be seen in @fig-crp-meas, extremely elevated CRP has been observed in some cases. These observations are physiologically valid. However, these data may skew our models and may be candidates for removal. ```{R} #| label: "fig-crp-meas" #| fig-cap: "Density plot of FC measurements by observed value" crp %>% ggplot(aes(x = crp_result)) + geom_density( linewidth = 0.8, alpha = 0.5, fill = "#FFC800", color = "#FF8427" ) + theme_minimal() + theme(axis.text.y = element_blank()) + xlab("CRP (mg/L)") + ylab("Density") ``` As with the FC data, the CRP data is also log transformed (@fig-logcrp-meas). ```{R} #| label: "fig-logcrp-meas" #| fig-cap: "Density plot of logged CRP test results" crp %>% ggplot(aes(x = log(crp_result))) + geom_density( linewidth = 0.8, alpha = 0.5, fill = "#FFC800", color = "#FF8427" ) + theme_minimal() + theme(axis.text.y = element_blank()) + xlab("log(CRP (mg/L))") + ylab("Density") ``` Observing @fig-logcrp-meas, it would appear the influence of extremely elevated CRP results will be minimal. ### Removal of duplicates ```{R} crp <- crp %>% distinct(ids, crp_date, crp_result, .keep_all = TRUE ) ``` As multiple CRP measurements are possible within a short time-frame, we only consider as duplicates those with an exact match by `ids`, `crp_date` and `crp_result`. ### Retiming - part 1 As with the FCAL data, the times of CRP measurements are retimed and scaled to be the number of years since IBD diagnosis. ```{R CRP time mapping} #| label: fig-crp-spag-pre #| fig-cap: "Spaghetti plot of CRP trajectories (preprocessed)" crp <- merge(crp, dict[, c("ids", "date.of.diag")], by = "ids", all.x = TRUE, all.y = FALSE ) # Dates have already been converted to Date class by fix_date_char() for dict crp$crp_time <- as.numeric(crp$crp_date - crp$date.of.diag) / 365.25 crp %>% ggplot(aes(x = crp_time, y = log(crp_result), color = factor(ids))) + geom_line(alpha = 0.2) + geom_point(alpha = 0.3) + theme_minimal() + scale_color_manual(values = viridis::viridis(length(unique(crp$ids)))) + guides(color = "none") + xlab("Time (years)") + ylab("Log(CRP (mg/L))") + ggtitle("") ``` As should be expected given the ubiquitous nature of CRP testing, many CRP observations are earlier than diagnosis (@fig-crp-spag-pre). Tests taken close to reported date of diagnosis are likely "early" as a result of diagnostic delay whereas tests from much earlier were likely requested due to other conditions. As with FCAL, CRP results earlier than 90 days prior to diagnosis will be discarded. If a subject has a CRP within 90 days before diagnosis, then all of their CRP tests will be retimed such that their earliest CRP within this period is equal to 0 and all later measurements are shifted accordingly to maintain the same differences between measurement times. ::: callout-note This may mean $t_0$ corresponds to slightly different dates across FCAL and CRP for the same subject. ::: We also reduce the dataset to only subjects with a diagnostic CRP which equates to subjects with a CRP measurement within 3 months of diagnosis $t \leq 0.25$ ```{r} crp <- subset(crp, crp_time >= -0.25) ``` ### Inclusion/exclusion: removal of subjects without a diagnostic CRP test As indicated in our inclusion/exclusion criteria, we reduce the dataset to only subjects with a diagnostic FC. This equates to subjects with at least one FCAL measurement within 3 months of diagnosis $t \leq 0.25$. ```{R CRP retiming} diagnostic <- crp %>% group_by(ids) %>% summarise(n = sum(crp_time < 0.25)) %>% subset(n > 0) crp <- subset(crp, ids %in% diagnostic$ids) ``` After this exclusion, `r length(unique(crp$ids))` remain in the data. ```{R} diag.time <- c() crp.ids <- unique(crp$ids) for (id in crp.ids) { temp <- subset(crp, ids == id) temp <- temp[order(temp$crp_time), ] diag.time <- c(diag.time, temp[1, "crp_time"]) } crp.dist <- data.frame(ids = crp.ids, diagnostic = diag.time) saveRDS(crp.dist, paste0(prefix, "processed/crp-diag-dist.RDS")) ``` ### Retiming - part 2 Here, we show the distribution of the diagnostic CRP measurements with respect to diagnosis date (in days). ```{r crp days before diag} p <- crp %>% group_by(ids) %>% filter(crp_time == min(crp_time)) %>% ggplot(aes(x = crp_time * 365.25)) + geom_density(fill = "#415A77", color = "#1B263B") + theme_minimal() + labs( y = "Density", x = "Time from diagnosis to first CRP (days)" ) + geom_vline(xintercept = 0, linetype = "dashed", color = "red") ggsave("plots/crp-diagnostic-dist.png", p, width = 16 * 2 / 3, height = 6, units = "in" ) ``` Similar to what was observed for FC, most diagnostic measurements were around the recorded diagnosis time (near zero). Here, we also retime measurements/diagnosis time as described above (i.e. time = 0 matches the time of first available CRP measurement). This is only applied to individuals for which the first available CRP is prior to the recorded diagnosis time. ```{R} # Retime so that t_0 = 0. for (id in unique(crp$ids)) { temp <- subset(crp, ids == id) if (any(temp$crp_time < 0)) { add <- sort(temp$crp_time)[1] crp[crp[, "ids"] == id, "crp_time"] <- crp[crp[, "ids"] == id, "crp_time"] + abs(add) } } length(unique(crp$ids)) crp %>% ggplot(aes(x = crp_time, y = log(crp_result), color = factor(ids))) + geom_line(alpha = 0.2) + geom_point(alpha = 0.3) + theme_minimal() + scale_color_manual(values = viridis::viridis(length(unique(crp$ids)))) + guides(color = "none") + xlab("Time (years)") + ylab("Log(CRP (mg/L))") + ggtitle("") ``` At this stage, we also exclude CRP measurements recorded beyond 7-years post diagnosis. ```{r remove crp after 7 years} crp <- subset(crp, crp_time <= 7) length(unique(crp$ids)) ``` At this point, the data contains `r nrow(fcal)` FCAL measurements across all `r length(unique(fcal$ids))` individuals. ### Additional pre-processing Here, we summarise the number of observations per individual (with and without considering censored values) as well as length of follow-up. The latter is defined as the difference (in years) between diagnosis time and the time of the latest CRP measurement. ```{R crp followup pre exclusions} #| label: fig-crp-follow-up #| fig-cap: "(A) Histogram of the number of CRP measurements per subject. (B) Histogram of CRP follow-up per subject (before exclusions). (C) Scatterplot with number of CRP measurements vs follow-up." #| fig-width: 12 #| column: body-outset countsDF <- crp %>% group_by(ids) %>% summarise( n.total = n(), censored.left = sum(crp_result == 1), n.noncensored = n.total - censored.left, n.negtime.nondiag = sum(crp_time != 0 & crp_date - date.of.diag < 0), followup = max(crp_time) ) p1 <- countsDF %>% ggplot(aes(x = n.total)) + geom_histogram(color = "black", fill = "#5BBA6F", binwidth = 1) + theme_minimal() + ylab("Count") + xlab("Measurements per subject") p2 <- countsDF %>% ggplot(aes(x = followup)) + geom_histogram(color = "black", fill = "#5BBA6F", binwidth = 1) + theme_minimal() + ylab("Count") + xlab("Follow-up (years)") p3 <- countsDF %>% ggplot(aes(y = n.total, x = followup)) + geom_hex() + xlab("Follow-up (years since diagnosis)") + ylab("Number of CRP measurements") + theme_minimal() + labs(fill = "Count") p1 + p2 + p3 + plot_annotation(tag_levels = "A") ``` Our observations are very similar to what we had for FCAL. However, as CRP measurement frequency is higher, we find a lower proportion of individuals with a low follow-up time. As CRP is not specific to gastrointestinal inflammation, a variety of factors (e.g. infection) can affect the measurements. To focus on long-term trends, further processing was applied to smooth out short-term fluctuations. Measurements were grouped into intervals of t: [0, 0.5), [0.5, 1.5), [1.5, 2.5), [2.5, 3.5), [3.5, 4.5), [4.5, 5.5), [5.5, 7], where t = 0 (years) is the time of diagnosis. The median CRP for each interval was calculated for each subject and used as input for subsequent analyses. The centre of each interval was used as the corresponding observation time. ```{r crp additional preproc} id.list <- unique(crp$ids) crp.ma <- matrix(NA, nrow = length(id.list), ncol = 7) for (i in seq_along(id.list)) { subject_data <- subset(crp, ids == id.list[i]) for (j in seq(0, 6)) { if (j == 6) { sub.obs <- subset( subject_data, crp_time >= j - 0.5 & crp_time <= j + 1 ) } else { sub.obs <- subset( subject_data, crp_time >= j - 0.5 & crp_time < j + 0.5 ) } if (nrow(sub.obs) > 0) { crp.ma[i, j + 1] <- median(sub.obs$crp_result) } } } rownames(crp.ma) <- id.list # Convert back to a long-format dataset and rename columns crp.ma <- reshape2::melt(t(crp.ma), id.vars = row.names(crp.ma), na.rm = TRUE) colnames(crp.ma) <- c("crp_time", "ids", "crp_result") crp.ma <- crp.ma[, c(2, 3, 1)] # Reset the time to start at zero crp.ma$crp_time <- crp.ma$crp_time - 1 # Take into account uneven spacing at start and end crp.ma$crp_time <- plyr::mapvalues(crp.ma$crp_time, from = c(0, 6), to = c(0.25, 6.25) ) # Add diagnosis type back crp.ma <- merge(crp.ma, dict[, -3], by = "ids", all.x = TRUE, all.y = FALSE) ``` ### Inclusion/exclusion: a minimum number of CRP measurements Note that our inclusion criteria requires at least 3 CRP measurements per individual. This is applied after the additional pre-processing applied in the previous section. Here, we re-calculate relevant summary statistics. ```{R crp followup pre exclusions post processed} #| label: fig-crp-follow-up-postproc #| fig-cap: "(A) Histogram of the number of CRP measurements per subject. (B) Histogram of CRP follow-up per subject (before exclusions). (C) Scatterplot with number of CRP measurements vs follow-up." #| fig-width: 12 #| column: body-outset countsDF <- crp.ma %>% group_by(ids) %>% summarise( n.total = n(), censored.left = sum(crp_result == 0), # 0 because it was log transformed n.noncensored = n.total - censored.left, followup = max(crp_time), crp.var = var(crp_result, na.rm = TRUE) ) p1 <- countsDF %>% ggplot(aes(x = n.total)) + geom_histogram(color = "black", fill = "#5BBA6F", binwidth = 1) + theme_minimal() + ylab("Count") + xlab("Measurements per subject") p2 <- countsDF %>% ggplot(aes(x = followup)) + geom_histogram(color = "black", fill = "#5BBA6F", binwidth = 1) + theme_minimal() + ylab("Count") + xlab("Follow-up (years)") p3 <- countsDF %>% ggplot(aes(y = n.total, x = followup)) + geom_hex() + xlab("Follow-up (years since diagnosis)") + ylab("Number of CRP measurements") + theme_minimal() + labs(fill = "Count") p1 + p2 + p3 + plot_annotation(tag_levels = "A") ``` ```{R calculate number of non censored CRP} # Number of individuals with at least 3 CRP measurements sum(countsDF$n.total >= 3) # Number of individuals with at least 3 CRP measurements and non zero variance sum(countsDF$n.total >= 3 & countsDF$crp.var != 0) # Number of individuals with at least 3 non-censored FCAL measurements sum(countsDF$n.noncensored >= 3) ``` The following is used to select individuals with at least 3 CRP measurements and non-zero variance. ```{R CRP frequency} crp.ma.0 <- crp.ma %>% subset(ids %in% countsDF$ids[countsDF$n.total >= 3 & countsDF$crp.var == 0]) saveRDS(crp.ma.0, paste0(prefix, "processed/crp-ma-0.RDS")) crp.ma <- crp.ma %>% subset(ids %in% countsDF$ids[countsDF$n.total >= 3 & countsDF$crp.var != 0]) length(unique(crp.ma$ids)) ``` This leaves a cohort with `r length(unique(crp$ids))` individuals.  ```{r crp followup post exclusions} #| label: fig-crp-follow-up-post-exclusions #| fig-cap: "(A) Histogram of the number of CRP measurements per subject. (B) Histogram of FCAL follow-up per subject (before exclusions). (C) Scatterplot with number of CRP measurements vs follow-up. In all cases, only individuals with at least 3 measurements and non-zero variance are included." p1 <- countsDF %>% subset(n.total >= 3 & crp.var != 0) %>% ggplot(aes(x = n.total)) + geom_histogram(color = "black", fill = "#5BBA6F", binwidth = 1) + theme_minimal() + ylab("Count") + xlab("Measurements per subject") p2 <- countsDF %>% subset(n.total >= 3 & crp.var != 0) %>% ggplot(aes(x = followup)) + geom_histogram(color = "black", fill = "#5BBA6F", binwidth = 1) + theme_minimal() + ylab("Count") + xlab("Follow-up (years)") p3 <- countsDF %>% subset(n.total >= 3 & crp.var != 0) %>% ggplot(aes(y = n.total, x = followup)) + geom_point(color = "#FF4F79", size = 2) + xlab("Follow-up (years since diagnosis)") + ylab("Number of FCAL measurements") + theme_minimal() + geom_hline(yintercept = 3) p1 + p2 + p3 + plot_annotation(tag_levels = "A") ``` ### Additional pre-processing Some subjects have the same median log-transformed CRP across all of their observations, effectively having a variance of 0. These subjects will be extracted and considered as their own cluster, as they may cause numerical issues when model fitting but remain of note. ```{R} p1 <- crp.ma %>% ggplot(aes(x = crp_result)) + geom_density( linewidth = 0.8, alpha = 0.5, fill = "#EE92C2", color = "#7B5D6F" ) + theme_minimal() + theme(axis.text.y = element_blank()) + xlab("CRP (mg/L)") + ylab("Density") p2 <- crp.ma %>% ggplot(aes(x = log(crp_result))) + geom_density( linewidth = 0.8, alpha = 0.5, fill = "#ADF1D2", color = "#50635A" ) + theme_minimal() + theme(axis.text.y = element_blank()) + xlab("log(CRP (mg/L))") + ylab("Density") print(p1 + p2) ggsave("plots/crp-dist-ma.pdf", p1 + p2, width = 8, height = 4.5, units = "in") ``` There are `r nrow(crp.ma)` total data points after the additional preprocessing describing `r length(unique(crp.ma$ids))` subjects. ```{R} #| label: tbl-crp-ma-freq #| tbl-cap: "Quantiles for number of CRP observations per subject after additional preprocessing." crp.ma %>% select(ids) %>% table() %>% quantile(probs = c(0.25, 0.5, 0.75)) ``` ### Revisiting month of diagnosis After filtering by subjects who have a diagnostic CRP available, January is no longer over-represented for month of diagnosis. This is a comparable finding to FC (@fig-diag-month-redux). ```{R Month of diagnosis (postprocessed)} #| label: fig-diag-month-redux-crp #| fig-cap: "Bar plot of month of diagnosis" dict.temp <- subset(dict, ids %in% unique(fcal$ids)) dict.temp %>% ggplot(aes(x = as.factor(month(date.of.diag, label = TRUE)))) + geom_bar(color = "black", fill = "#FEC601", linewidth = 0.3) + theme_minimal() + ylab("Frequency") + xlab("Month of IBD Diagnosis") ``` ## Updated density plots The following plots are updated versions of previous density plots (@fig-fcal-meas, @fig-logfcal-meas, @fig-crp-meas and @fig-logcrp-meas). These plots only include subjects who have met the inclusion criteria. ```{R Updated FCAL distribution} #| label: "fig-fcal-meas-update" #| fig-cap: "Updated density plot of FC measurements by observed value" #| warning: false p1 <- fcal %>% ggplot(aes(x = calpro_result)) + geom_density( linewidth = 0.8, alpha = 0.5, fill = "#2E86AB", color = "#07004D" ) + theme_minimal() + theme(axis.text.y = element_blank()) + xlab("FC (µg/g)") + ylab("Density") print(p1) ``` ```{R} #| label: "fig-logfcal-meas-update" #| fig-cap: "Updated density plot of logged FC test results" p2 <- fcal %>% ggplot(aes(x = log(calpro_result))) + geom_density( linewidth = 0.8, alpha = 0.5, fill = "#F7A072", color = "#936136" ) + theme_minimal() + theme(axis.text.y = element_blank()) + xlab("log(FC (µg/g))") + ylab("Density") print(p2) ggsave("plots/fcal-dist.pdf", p1 + p2, width = 8, height = 4.5, units = "in") ``` ```{R Updated CRP distribution} #| label: "fig-crp-meas-update" #| fig-cap: "Updated density plot of CRP measurements by observed value" #| warning: false p3 <- crp %>% ggplot(aes(x = crp_result)) + geom_density( linewidth = 0.8, alpha = 0.5, fill = "#EE92C2", color = "#7B5D6F" ) + theme_minimal() + theme(axis.text.y = element_blank()) + xlab("CRP (mg/L)") + ylab("Density") print(p3) ``` ```{R} #| label: "fig-logcrp-meas-update" #| fig-cap: "Updated density plot of logged CRP test results" p4 <- crp %>% ggplot(aes(x = log(crp_result))) + geom_density( linewidth = 0.8, alpha = 0.5, fill = "#ADF1D2", color = "#50635A" ) + theme_minimal() + theme(axis.text.y = element_blank()) + xlab("log(CRP (mg/L))") + ylab("Density") print(p4) ggsave("plots/crp-dist.pdf", p3 + p4, width = 8, height = 4.5, units = "in") ``` ```{R Updated CRP distribution} #| label: "fig-crp-meas-processed" #| fig-cap: "Updated density plot of CRP after additional preprocessing" #| warning: false p5 <- crp.ma %>% ggplot(aes(x = crp_result)) + geom_density( linewidth = 0.8, alpha = 0.5, fill = "#EE92C2", color = "#7B5D6F" ) + theme_minimal() + theme(axis.text.y = element_blank()) + xlab("Processed CRP (mg/L)") + ylab("Density") print(p5) ``` ```{R} #| label: "fig-logcrp-meas-processed" #| fig-cap: "Updated density plot of logged CRP after additional preprocessing" p6 <- crp.ma %>% ggplot(aes(x = log(crp_result))) + geom_density( linewidth = 0.8, alpha = 0.5, fill = "#ADF1D2", color = "#50635A" ) + theme_minimal() + theme(axis.text.y = element_blank()) + xlab("log(processed CRP (mg/L))") + ylab("Density") print(p6) ggsave("plots/crp-processed-dist.pdf", p5 + p6, width = 8, height = 4.5, units = "in" ) p <- (p1 + p2) / (p5 + p6) + plot_annotation(tag_levels = "A") & theme(plot.tag = element_text(size = 18, face = "bold")) ggsave("plots/Sup-Figure-1.pdf", p, width = 8, height = 8, units = "in" ) ``` ```{R} # Number of CRP observations (unprocessed) per subject crp %>% filter(ids %in% unique(crp.ma$ids)) %>% select(ids) %>% table() %>% quantile(probs = c(0.25, 0.5, 0.75)) ``` ```{R} crp %>% filter(ids %in% unique(crp.ma$ids)) %>% select(ids) %>% nrow() ``` ```{R} # Number of CRP observations (unprocessed) per subject for the overlap cohort crp %>% filter(ids %in% unique(crp.ma$ids) & ids %in% unique(fcal$ids) ) %>% select(ids) %>% table() %>% quantile(probs = c(0.25, 0.5, 0.75)) ``` ## Data saving The processed subject dictionary and FC and CRP datasets have been saved to files, so they can be used for the primary analysis. ```{R} if (!dir.exists(paste0(prefix, "processed"))) { dir.create(paste0(prefix, "processed")) } fcal$calpro_result <- log(fcal$calpro_result) crp.ma$crp_result <- log(crp.ma$crp_result) saveRDS(dict, paste0(prefix, "processed/dict-initial.RDS")) saveRDS(fcal, paste0(prefix, "processed/fcal.RDS")) saveRDS(crp, paste0(prefix, "processed/crp.RDS")) saveRDS(crp.ma, paste0(prefix, "processed/median-crp.RDS")) ``` ```{R} fcal %>% filter(ids %in% crp.ma$ids) %>% select(ids) %>% table() %>% quantile(probs = c(0.25, 0.5, 0.75)) ``` ```{R} crp.ma %>% filter(ids %in% fcal$ids) %>% select(ids) %>% table() %>% quantile(probs = c(0.25, 0.5, 0.75)) ``` ## Acknowledgments {.appendix} This work is funded by the Medical Research Council & University of Edinburgh via a Precision Medicine PhD studentship (MR/N013166/1, to **NC-C**). ## Author contributions {.appendix} **NC-C** performed the processing and wrote the text. **NP**, **ML**, curated and provided datasets. **KM-G**, **CWL**, and **CAV** provided supervisory support and assisted with revisions. ## References {.appendix} ::: {#refs} ::: ## Reuse {.appendix} Licensed by <a href="https://creativecommons.org/licenses/by/4.0/">CC BY</a> unless otherwise stated. ## {.appendix} ::: center <img src="../../images/MRC_HGU_Edinburgh RGB.png" alt="MRC Human Genetics Unit logo" class="center" height="50px"/> <img src="../../images/cgem-logo.png" alt="Centre for Genomic & Experimental Medicine logo" height="50px"/> ::: ------------------------------------------------------------------------ ## Session info {.appendix} ```{R Session info} pander(sessionInfo()) ```