- Improving Risk Stratification for Patients With Type 2 Myocardial InfarctionJournal of the American College of Cardiology 2023
- medRxivLatent Crohn’s Disease Subgroups are Identified by Longitudinal Faecal Calprotectin ProfilesAug 2022
Background High faecal calprotectin is associated with poor outcomes in Crohn’s disease. Monitoring of faecal calprotectin trajectories could characterise disease progression before severe complications occur. Aims We undertook an unbiased assessment of a retrospective incident Crohn’s disease cohort to assess for inter-individual variability in faecal calprotectin levels over time. We aimed to explore whether latent classes of such profiles are associated with a composite endpoint consisting of surgery, hospitalisation, or Montreal behaviour progression and other clinical information. Methods Latent class mixed models were used to model faecal calprotectin trajectories within five years of diagnosis. Akaike information criterion, Bayesian information criterion, alluvial plots, and class-specific trajectories were used to decide the optimal number of classes. Log-rank tests of Kaplan-Meier estimators were used to test for associations between class membership and outcomes. Results Our study cohort comprised 365 subjects and 2856 faecal calprotectin measurements (median 7 per subject). Four latent classes were found and broadly described as a class with consistently high faecal calprotectin and three classes characterised by downward trends for calprotectin. Class membership was significantly associated with the composite endpoint, and separately, hospitalisation and Montreal disease progression, but not surgery. Early biologic therapy was strongly associated with class membership. Conclusions Our analysis provides a novel stratification approach for Crohn’s disease patients based on faecal calprotectin trajectories. Characterising this heterogeneity helps to better understand different patterns of disease progression and to identify those with a higher risk of worse outcomes. Ultimately, this information will assist the design of more targeted interventions.
- arXivA review on competing risks methods for survival analysisarXiv Dec 2022
When modelling competing risks survival data, several techniques have been proposed in both the statistical and machine learning literature. State-of-the-art methods have extended classical approaches with more flexible assumptions that can improve predictive performance, allow high dimensional data and missing values, among others. Despite this, modern approaches have not been widely employed in applied settings. This article aims to aid the uptake of such methods by providing a condensed compendium of competing risks survival methods with a unified notation and interpretation across approaches. We highlight available software and, when possible, demonstrate their usage via reproducible R vignettes. Moreover, we discuss two major concerns that can affect benchmark studies in this context: the choice of performance metrics and reproducibility.
- GenBioscMET: Bayesian modeling of DNA methylation heterogeneity at single-cell resolutionGenome Biology Dec 2021
High-throughput single-cell measurements of DNA methylomes can quantify methylation heterogeneity and uncover its role in gene regulation. However, technical limitations and sparse coverage can preclude this task. scMET is a hierarchical Bayesian model which overcomes sparsity, sharing information across cells and genomic features to robustly quantify genuine biological heterogeneity. scMET can identify highly variable features that drive epigenetic heterogeneity, and perform differential methylation and variability analyses. We illustrate how scMET facilitates the characterization of epigenetically distinct cell populations and how it enables the formulation of novel hypotheses on the epigenetic regulation of gene expression. scMET is available at https://github.com/andreaskapou/scMET.
- medRxivDNA Methylation scores augment 10-year risk prediction of diabetesmedRxiv Dec 2021
Type 2 diabetes mellitus (T2D) is one of the most prevalent diseases in the world and presents a major health and economic burden, a notable proportion of which could be alleviated with improved early prediction and intervention. While standard risk factors including age, obesity, and hypertension have shown good predictive performance, we show that the use of CpG DNA methylation information leads to a significant improvement in the prediction of 10-year T2D incidence risk. Whilst previous studies have been largely constrained by linear assumptions and the use of CpGs one-at-the-time, we have adopted a more flexible approach based on a range of linear and tree-ensemble models for classification and time-to-event prediction. Using the Generation Scotland cohort (n=9,537) our best performing model (Area Under the Curve (AUC)=0.880, Precision Recall AUC (PRAUC)=0.539, McFadden’s R2=0.316) used a LASSO Cox proportional-hazards predictor and showed notable improvement in onset prediction, above and beyond standard risk factors (AUC=0.860, PRAUC=0.444 R2=0.261). Replication of the main finding was observed in an external test dataset (the German-based KORA study, p=3.7x10-4). Tree-ensemble methods provided comparable performance and future improvements to these models are discussed. Finally, we introduce MethylPipeR, an R package with accompanying user interface, for systematic and reproducible development of complex trait and incident disease predictors. While MethylPipeR was applied to incident T2D prediction with DNA methylation in our experiments, the package is designed for generalised development of predictive models and is applicable to a wide range of omics data and target traits.
- arXivModel updating after interventions paradoxically introduces biasarXiv Dec 2021
Machine learning is increasingly being used to generate prediction models for use in a number of real-world settings, from credit risk assessment to clinical decision support. Recent discussions have highlighted potential problems in the updating of a predictive score for a binary outcome when an existing predictive score forms part of the standard workflow, driving interventions. In this setting, the existing score induces an additional causative pathway which leads to miscalibration when the original score is replaced. We propose a general causal framework to describe and address this problem, and demonstrate an equivalent formulation as a partially observed Markov decision process. We use this model to demonstrate the impact of such ‘naive updating’ when performed repeatedly. Namely, we show that successive predictive scores may converge to a point where they predict their own effect, or may eventually tend toward a stable oscillation between two values, and we argue that neither outcome is desirable. Furthermore, we demonstrate that even if model-fitting procedures improve, actual performance may worsen. We complement these findings with a discussion of several potential routes to overcome these issues.
- NatComSingle-nucleus RNA-seq2 reveals functional crosstalk between liver zonation and ploidyNature Communications Dec 2021
Single-cell RNA-seq reveals the role of pathogenic cell populations in development and progression of chronic diseases. In order to expand our knowledge on cellular heterogeneity, we have developed a single-nucleus RNA-seq2 method tailored for the comprehensive analysis of the nuclear transcriptome from frozen tissues, allowing the dissection of all cell types present in the liver, regardless of cell size or cellular fragility. We use this approach to characterize the transcriptional profile of individual hepatocytes with different levels of ploidy, and have discovered that ploidy states are associated with different metabolic potential, and gene expression in tetraploid mononucleated hepatocytes is conditioned by their position within the hepatic lobule. Our work reveals a remarkable crosstalk between gene dosage and spatial distribution of hepatocytes.
- medRxivDevelopment and assessment of a machine learning tool for predicting emergency admission in ScotlandmedRxiv Dec 2021
Avoiding emergency hospital admission (EA) is advantageous to individual health and the healthcare system. We develop a statistical model estimating risk of EA for most of the Scottish population (> 4.8M individuals) using electronic health records, such as hospital episodes and prescribing activity. We demonstrate good predictive accuracy (AUROC 0.80), calibration and temporal stability. We find strong prediction of respiratory and metabolic EA, show a substantial risk contribution from socioeconomic decile, and highlight an important problem in model updating. Our work constitutes a rare example of a population-scale machine learning score to be deployed in a healthcare setting.Competing Interest StatementThe authors have declared no competing interest.Funding StatementJL, CAV and LJMA were partially supported by Wave 1 of The UKRI Strategic Priorities Fund under the EPSRC Grant EP/T001569/1, particularly the "Health" theme within that grant and The Alan Turing Institute; JL, BAM, CAV, LJMA and SJV were partially supported by Health Data Research UK, an initiative funded by UK Research and Innovation, Department of Health and Social Care (England), the devolved administrations, and leading medical research charities; SJV, NC and GB were partially supported by the University of Warwick Impact Fund. SRE is funded by the EPSRC doctoral training partnership (DTP) at Durham University, grant reference EP/R513039/1; LJMA was partially supported by a Health Programme Fellowship at The Alan Turing Institute; CAV was supported by a Chancellor’s Fellowship provided by the University of Edinburgh.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:This study and the use of NHS data was approved by the Public Benefit and Privacy Panel for Health and Social Care (study number 1718-0370; approval evidenced in application outcome minutes for 2018/19 at https://www.informationgovernance.scot.nhs.uk/pbpphsc/application-outcomes/ ). In addition, accessing data was approved by the Public Health Scotland National Safe Haven, through the the electronic Data Research and Innovation Service (eDRIS) and the Public Benefit and Privacy Panel (PBPP) (study number 1718-0370). All studies have been conducted in accordance with information governance standards; data had no patient identifiers available to the researchers. This work was conducted in accordance with UK data governance regulations under PBPP application number eDRIS 1718-0370 All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesRaw data for this project are patient-level NHS Scotland health records, and are confidential. Due to the confidential nature of the data used, all analysis took place on remote ’safe havens’, without access to internet, software updates or unpublished software. Information Governance training was required for all researchers accessing the analysis environment. Moreover, to avoid the risk of accidental disclosure of sensitive information, an independent team carried out statistical disclosure control checks to all data exports, including the outputs presented in this manuscript. All analysis code and co-ordinates required to reproduce our Figures are available in github.com/jamesliley/SPARRAv4 https://github.com/jamesliley/SPARRAv4
- bioRxivSCRaPL: hierarchical Bayesian modelling of associations in single cell multi-omics databioRxiv Dec 2021
Single-cell multi-omics assays offer unprecedented opportunities to explore gene regulation at cellular level. However, high levels of technical noise and data sparsity frequently lead to a lack of statistical power in correlative analyses, identifying very few, if any, significant associations between different molecular layers. Here we propose SCRaPL, a novel computational tool that increases power by carefully modelling noise in the experimental systems. We show on real and simulated multi-omics single-cell data sets that SCRaPL achieves higher sensitivity and better robustness in identifying correlations, while maintaining a similar level of false positives as standard analyses based on Pearson correlation.Competing Interest StatementThe authors have declared no competing interest.
- GenBioEleven grand challenges in single-cell data scienceGenome Biology Dec 2020
The recent boom in microfluidics and combinatorial indexing strategies, combined with low sequencing costs, has empowered single-cell sequencing technology. Thousands—or even millions—of cells analyzed in a single experiment amount to a data revolution in single-cell biology and pose unique data science problems. Here, we outline eleven challenges that will be central to bringing this emerging field of single-cell data science forward. For each challenge, we highlight motivating research questions, review prior work, and formulate open problems. This compendium is for established researchers, newcomers, and students alike, highlighting interesting and rewarding problems for the coming years.
- CircHigh-Sensitivity Cardiac Troponin and the Universal Definition of Myocardial InfarctionCirculation Dec 2020
Background: The introduction of more sensitive cardiac troponin assays has led to increased recognition of myocardial injury in acute illnesses other than acute coronary syndrome. The Universal Definition of Myocardial Infarction recommends high-sensitivity cardiac troponin testing and classification of patients with myocardial injury based on pathogenesis, but the clinical implications of implementing this guideline are not well understood. Methods: In a stepped-wedge cluster randomized, controlled trial, we implemented a high-sensitivity cardiac troponin assay and the recommendations of the Universal Definition in 48 282 consecutive patients with suspected acute coronary syndrome. In a prespecified secondary analysis, we compared the primary outcome of myocardial infarction or cardiovascular death and secondary outcome of noncardiovascular death at 1 year across diagnostic categories. Results: Implementation increased the diagnosis of type 1 myocardial infarction by 11% (510/4471), type 2 myocardial infarction by 22% (205/916), and acute and chronic myocardial injury by 36% (443/1233) and 43% (389/898), respectively. Compared with those without myocardial injury, the rate of the primary outcome was highest in those with type 1 myocardial infarction (cause-specific hazard ratio [HR] 5.64 [95% CI, 5.12–6.22]), but was similar across diagnostic categories, whereas noncardiovascular deaths were highest in those with acute myocardial injury (cause specific HR 2.65 [95% CI, 2.33–3.01]). Despite modest increases in antiplatelet therapy and coronary revascularization after implementation in patients with type 1 myocardial infarction, the primary outcome was unchanged (cause specific HR 1.00 [95% CI, 0.82–1.21]). Increased recognition of type 2 myocardial infarction and myocardial injury did not lead to changes in investigation, treatment or outcomes. Conclusions: Implementation of high-sensitivity cardiac troponin assays and the recommendations of the Universal Definition of Myocardial Infarction identified patients at high-risk of cardiovascular and noncardiovascular events but was not associated with consistent increases in treatment or improved outcomes. Trials of secondary prevention are urgently required to determine whether this risk is modifiable in patients without type 1 myocardial infarction.
- CellSysCorrecting the Mean-Variance Dependency for Differential Variability Testing Using Single-Cell RNA Sequencing DataCell Systems Dec 2018
Cell-to-cell transcriptional variability in otherwise homogeneous cell populations plays an important role in tissue function and development. Single-cell RNA sequencing can characterize this variability in a transcriptome-wide manner. However, technical variation and the confounding between variability and mean expression estimates hinder meaningful comparison of expression variability between cell populations. To address this problem, we introduce an analysis approach that extends the BASiCS statistical framework to derive a residual measure of variability that is not confounded by mean expression. This includes a robust procedure for quantifying technical noise in experiments where technical spike-in molecules are not available. We illustrate how our method provides biological insight into the dynamics of cell-to-cell expression variability, highlighting a synchronization of biosynthetic machinery components in immune cells upon activation. In contrast to the uniform up-regulation of the biosynthetic machinery, CD4+ T cells show heterogeneous up-regulation of immune-related and lineage-defining genes during activation and differentiation.
- LancetHigh-sensitivity troponin in the evaluation of patients with suspected acute coronary syndrome: a stepped-wedge, cluster-randomised controlled trialThe Lancet Dec 2018
Background: High-sensitivity cardiac troponin assays permit use of lower thresholds for the diagnosis of myocardial infarction, but whether this improves clinical outcomes is unknown. We aimed to determine whether the introduction of a high-sensitivity cardiac troponin I (hs-cTnI) assay with a sex-specific 99th centile diagnostic threshold would reduce subsequent myocardial infarction or cardiovascular death in patients with suspected acute coronary syndrome. Methods: In this stepped-wedge, cluster-randomised controlled trial across ten secondary or tertiary care hospitals in Scotland, we evaluated the implementation of an hs-cTnI assay in consecutive patients who had been admitted to the hospitals’ emergency departments with suspected acute coronary syndrome. Patients were eligible for inclusion if they presented with suspected acute coronary syndrome and had paired cardiac troponin measurements from the standard care and trial assays. During a validation phase of 6–12 months, results from the hs-cTnI assay were concealed from the attending clinician, and a contemporary cardiac troponin I (cTnI) assay was used to guide care. Hospitals were randomly allocated to early (n=5 hospitals) or late (n=5 hospitals) implementation, in which the high-sensitivity assay and sex-specific 99th centile diagnostic threshold was introduced immediately after the 6-month validation phase or was deferred for a further 6 months. Patients reclassified by the high-sensitivity assay were defined as those with an increased hs-cTnI concentration in whom cTnI concentrations were below the diagnostic threshold on the contemporary assay. The primary outcome was subsequent myocardial infarction or death from cardiovascular causes at 1 year after initial presentation. Outcomes were compared in patients reclassified by the high-sensitivity assay before and after its implementation by use of an adjusted generalised linear mixed model. This trial is registered with ClinicalTrials.gov, number NCT01852123. Findings: Between June 10, 2013, and March 3, 2016, we enrolled 48 282 consecutive patients (61 [SD 17] years, 47% women) of whom 10 360 (21%) patients had cTnI concentrations greater than those of the 99th centile of the normal range of values, who were identified by the contemporary assay or the high-sensitivity assay. The high-sensitivity assay reclassified 1771 (17%) of 10 360 patients with myocardial injury or infarction who were not identified by the contemporary assay. In those reclassified, subsequent myocardial infarction or cardiovascular death within 1 year occurred in 105 (15%) of 720 patients in the validation phase and 131 (12%) of 1051 patients in the implementation phase (adjusted odds ratio for implementation vs validation phase 1·10, 95% CI 0·75 to 1·61; p=0·620). Interpretation: Use of a high-sensitivity assay prompted reclassification of 1771 (17%) of 10 360 patients with myocardial injury or infarction, but was not associated with a lower subsequent incidence of myocardial infarction or cardiovascular death at 1 year. Our findings question whether the diagnostic threshold for myocardial infarction should be based on the 99th centile derived from a normal reference population.
- NatMetNormalizing single-cell RNA sequencing data: challenges and opportunitiesNature Methods Dec 2017
Single-cell transcriptomics is becoming an important component of the molecular biologist’s toolkit. A critical step when analyzing data generated using this technology is normalization. However, normalization is typically performed using methods developed for bulk RNA sequencing or even microarray data, and the suitability of these methods for single-cell transcriptomics has not been assessed. We here discuss commonly used normalization approaches and illustrate how these can produce misleading results. Finally, we present alternative approaches and provide recommendations for single-cell RNA sequencing users.
- EconStatIncorporating unobserved heterogeneity in Weibull survival models: A Bayesian approachEconometrics and Statistics Dec 2017
Outlying observations and other forms of unobserved heterogeneity can distort inference for survival datasets. The family of Rate Mixtures of Weibull distributions includes subject-level frailty terms as a solution to this issue. With a parametric mixing distribution assigned to the frailties, this family generates flexible hazard functions. Covariates are introduced via an Accelerated Failure Time specification for which the interpretation of the regression coefficients does not depend on the choice of mixing distribution. A weakly informative prior is proposed by combining the structure of the Jeffreys prior with a proper prior on some model parameters. This improper prior is shown to lead to a proper posterior distribution under easily satisfied conditions. By eliciting the proper component of the prior through the coefficient of variation of the survival times, prior information is matched for different mixing distributions. Posterior inference on subject-level frailty terms is exploited as a tool for outlier detection. Finally, the proposed methodology is illustrated using two real datasets, one concerning bone marrow transplants and another on cerebral palsy.
- ScienceAging increases cell-to-cell transcriptional variability upon immune stimulationScience Dec 2017
Single-cell sequencing of mouse immune cells reveals how aging destabilizes a conserved transcriptional activation program. How and why the immune system becomes less effective with age are not well understood. Martinez-Jimenez et al. performed single-cell sequencing of CD4+ T cells in old and young mice of two species. In young mice, the gene expression program of early immune activation was tightly regulated and conserved between species. However, as mice aged, the expression of genes involved in pathways responding to immune cell stimulation was not as robust and exhibited increased cell-to-cell variability. Science, this issue p. 1433 Aging is characterized by progressive loss of physiological and cellular functions, but the molecular basis of this decline remains unclear. We explored how aging affects transcriptional dynamics using single-cell RNA sequencing of unstimulated and stimulated naïve and effector memory CD4+ T cells from young and old mice from two divergent species. In young animals, immunological activation drives a conserved transcriptomic switch, resulting in tightly controlled gene expression characterized by a strong up-regulation of a core activation program, coupled with a decrease in cell-to-cell variability. Aging perturbed the activation of this core program and increased expression heterogeneity across populations of cells in both species. These discoveries suggest that increased cell-to-cell transcriptional variability will be a hallmark feature of aging across most, if not all, mammalian tissues.
- RSS ABayesian survival modelling of university outcomesJournal of the Royal Statistical Society: Series A (Statistics in Society) Jul 2016
Dropouts and delayed graduations are critical issues in higher education systems world wide. A key task in this context is to identify risk factors associated with these events, providing potential targets for mitigating policies. For this, we employ a discrete time competing risks survival model, dealing simultaneously with university outcomes and its associated temporal component. We define survival times as the duration of the student’s enrolment at university and possible outcomes as graduation or two types of dropout (voluntary and involuntary), exploring the information recorded at admission time (e.g. educational level of the parents) as potential predictors. Although similar strategies have been previously implemented, we extend the previous methods by handling covariate selection within a Bayesian variable selection framework, where model uncertainty is formally addressed through Bayesian model averaging. Our methodology is general; however, here we focus on undergraduate students enrolled in three selected degree programmes of the Pontificia Universidad Católica de Chile during the period 2000–2011. Our analysis reveals interesting insights, highlighting the main covariates that influence students’ risk of dropout and delayed graduation.
- GenBioBeyond comparisons of means: understanding changes in gene expression at the single-cell levelGenome Biology Jul 2016
Traditional differential expression tools are limited to detecting changes in overall expression, and fail to uncover the rich information provided by single-cell level data sets. We present a Bayesian hierarchical model that builds upon BASiCS to study changes that lie beyond comparisons of means, incorporating built-in normalization and quantifying technical artifacts by borrowing information from spike-in genes. Using a probabilistic approach, we highlight genes undergoing changes in cell-to-cell heterogeneity but whose overall expression remains unchanged. Control experiments validate our method’s performance and a case study suggests that novel biological insights can be revealed. Our method is implemented in R and available at https://github.com/catavallejos/BASiCS.
- PLOSBASiCS: Bayesian Analysis of Single-Cell Sequencing DataPLOS Computational Biology Jun 2015
Single-cell mRNA sequencing can uncover novel cell-to-cell heterogeneity in gene expression levels in seemingly homogeneous populations of cells. However, these experiments are prone to high levels of unexplained technical noise, creating new challenges for identifying genes that show genuine heterogeneous expression within the population of cells under study. BASiCS (Bayesian Analysis of Single-Cell Sequencing data) is an integrated Bayesian hierarchical model where: (i) cell-specific normalisation constants are estimated as part of the model parameters, (ii) technical variability is quantified based on spike-in genes that are artificially introduced to each analysed cell’s lysate and (iii) the total variability of the expression counts is decomposed into technical and biological components. BASiCS also provides an intuitive detection criterion for highly (or lowly) variable genes within the population of cells under study. This is formalised by means of tail posterior probabilities associated to high (or low) biological cell-to-cell variance contributions, quantities that can be easily interpreted by users. We demonstrate our method using gene expression measurements from mouse Embryonic Stem Cells. Cross-validation and meaningful enrichment of gene ontology categories within genes classified as highly (or lowly) variable supports the efficacy of our approach.
- JASSObjective Bayesian Survival Analysis Using Shape Mixtures of Log-Normal DistributionsJournal of the American Statistical Association Jun 2015
Survival models such as the Weibull or log-normal lead to inference that is not robust to the presence of outliers. They also assume that all heterogeneity between individuals can be modeled through covariates. This article considers the use of infinite mixtures of lifetime distributions as a solution for these two issues. This can be interpreted as the introduction of a random effect in the survival distribution. We introduce the family of shape mixtures of log-normal distributions, which covers a wide range of density and hazard functions. Bayesian inference under nonsubjective priors based on the Jeffreys’ rule is examined and conditions for posterior propriety are established. The existence of the posterior distribution on the basis of a sample of point observations is not always guaranteed and a solution through set observations is implemented. In addition, we propose a method for outlier detection based on the mixture structure. A simulation study illustrates the performance of our methods under different scenarios and an application to a real dataset is provided. Supplementary materials for the article, which include R code, are available online.