Publications

2025

J. Theor. Biol.

Scalable inference of transcriptional variability with BASiCS

Alan O'Callaghan and Catalina A. Vallejos
Journal of Theoretical Biology 611 : 112157 (2025)
Abstract
BASiCS (Bayesian Analysis of Single-Cell Sequencing data) is an integrated Bayesian hierarchical model for the analysis of single-cell RNA sequencing data. BASiCS performs simultaneous data normalization and quantification of technical noise, and enables analysis of mean expression and expression variability within or across cell populations. We extend BASiCS with a divide and conquer inference scheme to enable scalable Bayesian inference for large datasets. We compare the performance of the divide and conquer approach to standard Markov Chain Monte Carlo (MCMC) and variational inference methods (ADVI) in terms of accuracy and scalability. Our results demonstrate that the divide and conquer approach enables large-scale scRNA-seq analysis, providing accurate and efficient inference while maintaining the interpretability and flexibility of the BASiCS framework.
@article{O_Callaghan_2025,
  title = {Scalable inference of transcriptional variability with BASiCS},
  author = {Alan O'Callaghan and Catalina A. Vallejos},
  journal = {Journal of Theoretical Biology},
  year = {2025},
  volume = {611},
  pages = {112157},
  doi = {10.1016/j.jtbi.2025.112157}} 
medRxiv

Large-scale clustering of longitudinal faecal calprotectin and C-reactive protein profiles in inflammatory bowel disease

Nathan Constantine-Cooke, Marie Vibeke Vestergaard, Nikolas Plevris, Karla Monterrubio-Gomez, Clara Ramos Belinchon, Solomon Ong, Alexander T. Elford, Beatriz Gros, Aleksejs Sazonovs, Gareth-Rhys Jones, Tine Jess, Catalina A. Vallejos and Charlie W. Lees
medRxiv (2025)
Abstract
Crohn's disease (CD) and ulcerative colitis (UC) are highly heterogeneous, dynamic and unpredictable, with a marked disconnect between symptoms and intestinal inflammation. Attempts to classify inflammatory bowel disease (IBD) subphenotypes to inform clinical decision making have been limited. We aimed to describe the latent disease heterogeneity by modelling routinely collected faecal calprotectin (FC) and CRP data, describing dynamic longitudinal inflammatory patterns in IBD. In this longitudinal study, we analysed patient-level post-diagnosis measurements of FC and CRP in two European cohorts. Latent class mixed models were used to cluster individuals with similar longitudinal profiles. Associations between cluster assignment and baseline characteristics were quantified using multinomial logistic regression. Differences in advanced therapy use across clusters were also explored. Finally, we considered uncertainty in cluster assignments with respect to follow-up length and the overlap between FC and CRP clusters. We included 1036 patients in the FC discovery analysis (Lothian) with a total of 10545 FC observations (median 9 per subject, IQR 6-13), and 7880 patients in the replication (Denmark). The CRP discovery analysis consisted of 1838 patients with 49364 measurements (median 20 per subject; IQR 10–36), with 10041 patients in the replication cohort. Eight distinct clusters of inflammatory behaviour over time were identified in the FC and CRP analysis for the Scottish cohort. This model was then applied to the Danish replication cohort, with similar patterns observed in both the Scottish and Danish populations. The clusters, FC1–8 and CRP1–8, were ordered from the lowest cumulative inflammatory burden to the highest. The clusters included groups with high diagnostic levels of inflammation which rapidly normalised, groups where high inflammation levels persisted throughout the full seven years of observation, and a series of intermediates including delayed remitters and relapsing remitters. CD and UC patients were unevenly distributed across the clusters. In UC, male sex was associated with the poorest prognostic cluster (FC8). The use and timing of advanced therapy was associated with cluster assignment, with the highest use of early advanced therapy in FC1. Of note, FC8 and CRP8 captured consistently high patterns of inflammation despite a high proportion of patients receiving advanced therapy, particularly for CD individuals. We observed that uncertainty in cluster assignments was higher for individuals with short longitudinal follow-up, particularly between clusters capturing similar earlier inflammation patterns. There was broadly poor agreement between FC and CRP clusters in keeping with the need to monitor both in clinical practice.Interpretation Distinct patterns of inflammatory behaviour over time are evident in patients with IBD. Cluster assignment is associated with disease type and both the use and timing of advanced therapy. These data pave the way for a deeper understanding of disease heterogeneity in IBD and enhanced patient stratification in the clinic.
@article{Constantine-Cooke2024.11.08.24316916,
  title = {Large-scale clustering of longitudinal faecal calprotectin and C-reactive protein profiles in inflammatory bowel disease},
  author = {Nathan Constantine-Cooke, Marie Vibeke Vestergaard, Nikolas Plevris, Karla Monterrubio-Gomez, Clara Ramos Belinchon, Solomon Ong, Alexander T. Elford, Beatriz Gros, Aleksejs Sazonovs, Gareth-Rhys Jones, Tine Jess, Catalina A. Vallejos and Charlie W. Lees},
  journal = {medRxiv},
  year = {2025},
  doi = {10.1101/2024.11.08.24316916}} 
Clin. Epigenetics

Blood-based epigenome-wide association study and prediction of alcohol consumption

Elena Bernabeu, Aleksandra D. Chybowska, Jacob K. Kresovich, Matthew Suderman, Daniel L. McCartney, Robert F. Hillary, Janie Corley, Maria Del C. Valdés-Hernández, Susana Muñoz Maniega, Mark E. Bastin, Joanna M. Wardlaw, Zongli Xu, Dale P. Sandler, Archie Campbell, Sarah E. Harris, Andrew M. McIntosh, Jack A. Taylor, Paul Yousefi, Simon R. Cox, Kathryn L. Evans, Matthew R. Robinson, Catalina A. Vallejos and Riccardo E. Marioni
Clinical Epigenetics 17 (1) (2025)
Abstract
Alcohol consumption is an important risk factor for multiple diseases. It is typically assessed via self-report, which is open to measurement error through recall bias. Instead, molecular data such as blood-based DNA methylation (DNAm) could be used to derive a more objective measure of alcohol consumption by incorporating information from cytosine-phosphate-guanine (CpG) sites known to be linked to the trait. Here, we explore the epigenetic architecture of self-reported weekly units of alcohol consumption in the Generation Scotland study. We first create a blood-based epigenetic score (EpiScore) of alcohol consumption using elastic net penalized linear regression. We explore the effect of pre-filtering for CpG features ahead of elastic net, as well as differential patterns by sex and by units consumed in the last week relative to an average week. The final EpiScore was trained on 16,717 individuals and tested in four external cohorts: the Lothian Birth Cohorts (LBC) of 1921 and 1936, the Sister Study, and the Avon Longitudinal Study of Parents and Children (total N across studies > 10,000). The maximum Pearson correlation between the EpiScore and self-reported alcohol consumption within cohort ranged from 0.41 to 0.53. In LBC1936, higher EpiScore levels had significant associations with poorer global brain imaging metrics, whereas self-reported alcohol consumption did not. Finally, we identified two novel CpG loci via a Bayesian penalized regression epigenome-wide association study of alcohol consumption. Together, these findings show how DNAm can objectively characterize patterns of alcohol consumption that associate with brain health, unlike self-reported estimates.
@article{Bernabeu_2025,
  title = {Blood-based epigenome-wide association study and prediction of alcohol consumption},
  author = {Elena Bernabeu, Aleksandra D. Chybowska, Jacob K. Kresovich, Matthew Suderman, Daniel L. McCartney, Robert F. Hillary, Janie Corley, Maria Del C. Valdés-Hernández, Susana Muñoz Maniega, Mark E. Bastin, Joanna M. Wardlaw, Zongli Xu, Dale P. Sandler, Archie Campbell, Sarah E. Harris, Andrew M. McIntosh, Jack A. Taylor, Paul Yousefi, Simon R. Cox, Kathryn L. Evans, Matthew R. Robinson, Catalina A. Vallejos and Riccardo E. Marioni},
  journal = {Clinical Epigenetics},
  year = {2025},
  volume = {17},
  number = {1},
  doi = {10.1186/s13148-025-01818-y}} 

2024

PLOS Digit. Health

Differential behaviour of a risk score for emergency hospital admission by demographics in Scotland—A retrospective study

Ioanna Thoma, Simon Rogers, Jillian Ireland, Rachel Porteous, Katie Borland, Catalina A. Vallejos, Louis J. M. Aslett and James Liley
PLOS Digital Health 3 (12) : e0000675 (2024)
@article{Thoma_2024,
  title = {Differential behaviour of a risk score for emergency hospital admission by demographics in Scotland—A retrospective study},
  author = {Ioanna Thoma, Simon Rogers, Jillian Ireland, Rachel Porteous, Katie Borland, Catalina A. Vallejos, Louis J. M. Aslett and James Liley},
  journal = {PLOS Digital Health},
  year = {2024},
  volume = {3},
  number = {12},
  pages = {e0000675},
  doi = {10.1371/journal.pdig.0000675}} 
Biom. J.

A review on statistical and machine learning competing risks methods

Karla Monterrubio-Gómez, Nathan Constantine-Cooke and Catalina A. Vallejos
Biometrical Journal 66 (2) : 2300060 (2024)
Abstract
When modeling competing risks (CR) survival data, several techniques have been proposed in both the statistical and machine learning literature. State-of-the-art methods have extended classical approaches with more flexible assumptions that can improve predictive performance, allow high-dimensional data and missing values, among others. Despite this, modern approaches have not been widely employed in applied settings. This article aims to aid the uptake of such methods by providing a condensed compendium of CR survival methods with a unified notation and interpretation across approaches. We highlight available software and, when possible, demonstrate their usage via reproducible R vignettes. Moreover, we discuss two major concerns that can affect benchmark studies in this context: the choice of performance metrics and reproducibility.
@article{https://doi.org/10.1002/bimj.202300060,
  title = {A review on statistical and machine learning competing risks methods},
  author = {Karla Monterrubio-Gómez, Nathan Constantine-Cooke and Catalina A. Vallejos},
  journal = {Biometrical Journal},
  year = {2024},
  volume = {66},
  number = {2},
  pages = {2300060},
  doi = {https://doi.org/10.1002/bimj.202300060}} 
NPJ Digit. Med.

Development and assessment of a machine learning tool for predicting emergency admission in Scotland

James Liley, Gergo Bohner, Samuel R. Emerson, Bilal A. Mateen, Katie Borland, David Carr, Scott Heald, Samuel D. Oduro, Jill Ireland, Keith Moffat, Rachel Porteous, Stephen Riddell, Simon Rogers, Ioanna Thoma, Nathan Cunningham, Chris Holmes, Katrina Payne, Sebastian J. Vollmer, Catalina A. Vallejos and Louis J. M. Aslett
npj Digital Medicine 7 (1) (2024)
Abstract
Emergency admissions (EA), where a patient requires urgent in-hospital care, are a major challenge for healthcare systems. The development of risk prediction models can partly alleviate this problem by supporting primary care interventions and public health planning. Here, we introduce SPARRAv4, a predictive score for EA risk that will be deployed nationwide in Scotland. SPARRAv4 was derived using supervised and unsupervised machine-learning methods applied to routinely collected electronic health records from approximately 4.8M Scottish residents (2013-18). We demonstrate improvements in discrimination and calibration with respect to previous scores deployed in Scotland, as well as stability over a 3-year timeframe. Our analysis also provides insights about the epidemiology of EA risk in Scotland, by studying predictive performance across different population sub-groups and reasons for admission, as well as by quantifying the effect of individual input features. Finally, we discuss broader challenges including reproducibility and how to safely update risk prediction models that are already deployed at population level.
@article{Liley_2024,
  title = {Development and assessment of a machine learning tool for predicting emergency admission in Scotland},
  author = {James Liley, Gergo Bohner, Samuel R. Emerson, Bilal A. Mateen, Katie Borland, David Carr, Scott Heald, Samuel D. Oduro, Jill Ireland, Keith Moffat, Rachel Porteous, Stephen Riddell, Simon Rogers, Ioanna Thoma, Nathan Cunningham, Chris Holmes, Katrina Payne, Sebastian J. Vollmer, Catalina A. Vallejos and Louis J. M. Aslett},
  journal = {npj Digital Medicine},
  year = {2024},
  volume = {7},
  number = {1},
  doi = {10.1038/s41746-024-01250-1}} 
AI Ethics

Ethical considerations of use of hold-out sets in clinical prediction model management

Louis Chislett, Louis J. M. Aslett, Alisha R. Davies, Catalina A. Vallejos and James Liley
AI and Ethics (2024)
Abstract
Clinical prediction models are statistical or machine learning models used to quantify the risk of a certain health outcome using patient data. These can then inform potential interventions on patients, causing an effect called performative prediction: predictions inform interventions which influence the outcome they were trying to predict, leading to a potential underestimation of risk in some patients if a model is updated on this data. One suggested resolution to this is the use of hold-out sets, in which a set of patients do not receive model derived risk scores, such that a model can be safely retrained. We present an overview of clinical and research ethics regarding potential implementation of hold-out sets for clinical prediction models in health settings. We focus on the ethical principles of beneficence, non-maleficence, autonomy and justice. We also discuss informed consent, clinical equipoise, and truth-telling. We present illustrative cases of potential hold-out set implementations and discuss statistical issues arising from different hold-out set sampling methods. We also discuss differences between hold-out sets and randomised control trials, in terms of ethics and statistical issues. Finally, we give practical recommendations for researchers interested in the use hold-out sets for clinical prediction models.
@article{Chislett_2024,
  title = {Ethical considerations of use of hold-out sets in clinical prediction model management},
  author = {Louis Chislett, Louis J. M. Aslett, Alisha R. Davies, Catalina A. Vallejos and James Liley},
  journal = {AI and Ethics},
  year = {2024},
  doi = {10.1007/s43681-024-00561-z}} 

2023

Clin. Gastroenterol. Hepatol.

Longitudinal Fecal Calprotectin Profiles Characterize Disease Course Heterogeneity in Crohn's Disease

Nathan Constantine-Cooke, Karla Monterrubio-Gomez, Nikolas Plevris, Lauranne A.A.P. Derikx, Beatriz Gros, Gareth-Rhys Jones, Riccardo E. Marioni, Charlie W. Lees and Catalina A. Vallejos
Clinical Gastroenterology and Hepatology 21 (11) : 2918-2927.e6 (2023)
Abstract
The progressive nature of Crohn's disease is highly variable and hard to predict. In addition, symptoms correlate poorly with mucosal inflammation. There is therefore an urgent need to better characterize the heterogeneity of disease trajectories in Crohn's disease by utilizing objective markers of inflammation. We aimed to better understand this heterogeneity by clustering Crohn's disease patients with similar longitudinal fecal calprotectin profiles. We performed a retrospective cohort study at the Edinburgh IBD Unit, a tertiary referral center, and used latent class mixed models to cluster Crohn's disease subjects using fecal calprotectin observed within 5 years of diagnosis. Information criteria, alluvial plots, and cluster trajectories were used to decide the optimal number of clusters. Chi-square test, Fisher's exact test, and analysis of variance were used to test for associations with variables commonly assessed at diagnosis. Our study cohort comprised 356 patients with newly diagnosed Crohn's disease and 2856 fecal calprotectin measurements taken within 5 years of diagnosis (median 7 per subject). Four distinct clusters were identified by characteristic calprotectin profiles: a cluster with consistently high fecal calprotectin and 3 clusters characterized by different downward longitudinal trends. Cluster membership was significantly associated with smoking (P = .015), upper gastrointestinal involvement (P < .001), and early biologic therapy (P < .001). Our analysis demonstrates a novel approach to characterizing the heterogeneity of Crohn's disease by using fecal calprotectin. The group profiles do not simply reflect different treatment regimens and do not mirror classical disease progression endpoints.
@article{Constantine-Cooke2023,
  title = {Longitudinal Fecal Calprotectin Profiles Characterize Disease Course Heterogeneity in Crohn's Disease},
  author = {Nathan Constantine-Cooke, Karla Monterrubio-Gomez, Nikolas Plevris, Lauranne A.A.P. Derikx, Beatriz Gros, Gareth-Rhys Jones, Riccardo E. Marioni, Charlie W. Lees and Catalina A. Vallejos},
  journal = {Clinical Gastroenterology and Hepatology},
  year = {2023},
  volume = {21},
  number = {11},
  pages = {2918-2927.e6},
  doi = {10.1016/j.cgh.2023.03.026}} 
Nat. Aging

Development and validation of DNA methylation scores in two European cohorts augment 10-year risk prediction of type 2 diabetes

Yipeng Cheng, Danni A. Gadd, Christian Gieger, Karla Monterrubio-Gomez, Yufei Zhang, Imrich Berta, Michael J. Stam, Natalia Szlachetka, Evgenii Lobzaev, Nicola Wrobel, Lee Murphy, Archie Campbell, Cliff Nangle, Rosie M. Walker, Chloe Fawns-Ritchie, Annette Peters, Wolfgang Rathmann, David J. Porteous, Kathryn L. Evans, Andrew M. McIntosh, Timothy I. Cannings, Melanie Waldenberger, Andrea Ganna, Daniel L. McCartney, Catalina A. Vallejos and Riccardo E. Marioni
Nature Aging 3 (4) : 450--458 (2023)
Abstract
Type 2 diabetes mellitus (T2D) presents a major health and economic burden that could be alleviated with improved early prediction and intervention. While standard risk factors have shown good predictive performance, we show that the use of blood-based DNA methylation information leads to a significant improvement in the prediction of 10-year T2D incidence risk. Previous studies have been largely constrained by linear assumptions, the use of cytosine–guanine pairs one-at-a-time and binary outcomes. We present a flexible approach (via an R package, MethylPipeR) based on a range of linear and tree-ensemble models that incorporate time-to-event data for prediction. Using the Generation Scotland cohort (training set ncases = 374, ncontrols = 9,461; test set ncases = 252, ncontrols = 4,526) our best-performing model (area under the receiver operating characteristic curve (AUC) = 0.872, area under the precision-recall curve (PRAUC) = 0.302) showed notable improvement in 10-year onset prediction beyond standard risk factors (AUC = 0.839, precision-recall AUC = 0.227). Replication was observed in the German-based KORA study (n = 1,451, ncases = 142, P = 1.6x10-5).
J. Am. Coll. Cardiol.

Improving Risk Stratification for Patients With Type 2 Myocardial Infarction

Caelan Taggart, Karla Monterrubio-Gómez, Andreas Roos, Jasper Boeddinghaus, Dorien M. Kimenai, Erik Kadesjo, Anda Bularga, Ryan Wereski, Amy Ferry, Matthew Lowry, Atul Anand, Kuan Ken Lee, Dimitrios Doudesis, Ioanna Manolopoulou, Thomas Nestelberger, Luca Koechlin, Pedro Lopez-Ayala, Christian Mueller, Nicholas L. Mills, Catalina A. Vallejos and Andrew R. Chapman
Journal of the American College of Cardiology 81 (2) : 156-168 (2023)
Abstract
Despite poor cardiovascular outcomes, there are no dedicated, validated risk stratification tools to guide investigation or treatment in type 2 myocardial infarction. The goal of this study was to derive and validate a risk stratification tool for the prediction of death or future myocardial infarction in patients with type 2 myocardial infarction. The T2-risk score was developed in a prospective multicenter cohort of consecutive patients with type 2 myocardial infarction. Cox proportional hazards models were constructed for the primary outcome of myocardial infarction or death at 1 year using variables selected a priori based on clinical importance. Discrimination was assessed by area under the receiving-operating characteristic curve (AUC). Calibration was investigated graphically. The tool was validated in a single-center cohort of consecutive patients and in a multicenter cohort study from sites across Europe. There were 1,121, 250, and 253 patients in the derivation, single-center, and multicenter validation cohorts, with the primary outcome occurring in 27% (297 of 1,121), 26% (66 of 250), and 14% (35 of 253) of patients, respectively. The T2-risk score incorporating age, ischemic heart disease, heart failure, diabetes mellitus, myocardial ischemia on electrocardiogram, heart rate, anemia, estimated glomerular filtration rate, and maximal cardiac troponin concentration had good discrimination (AUC: 0.76; 95% CI: 0.73-0.79) for the primary outcome and was well calibrated. Discrimination was similar in the consecutive patient (AUC: 0.83; 95% CI: 0.77-0.88) and multicenter (AUC: 0.74; 95% CI: 0.64-0.83) cohorts. T2-risk provided improved discrimination over the Global Registry of Acute Coronary Events 2.0 risk score in all cohorts. The T2-risk score performed well in different health care settings and could help clinicians to prognosticate, as well as target investigation and preventative therapies more effectively. (High-Sensitivity Troponin in the Evaluation of Patients With Suspected Acute Coronary Syndrome [High-STEACS]; NCT01852123)

2022

PLoS Comput. Biol.

SCRaPL: A Bayesian hierarchical framework for detecting technical associates in single cell multiomics data

Christos Maniatis, Catalina A. Vallejos and Guido Sanguinetti
PLoS Computational Biology 18 (6) : e1010163 (2022)
Abstract
Single-cell multi-omics assays offer unprecedented opportunities to explore epigenetic regulation at cellular level. However, high levels of technical noise and data sparsity frequently lead to a lack of statistical power in correlative analyses, identifying very few, if any, significant associations between different molecular layers. Here we propose SCRaPL, a novel computational tool that increases power by carefully modelling noise in the experimental systems. We show on real and simulated multi-omics single-cell data sets that SCRaPL achieves higher sensitivity and better robustness in identifying correlations, while maintaining a similar level of false positives as standard analyses based on Pearson and Spearman correlation.

2021

Genome Biol.

scMET: Bayesian modeling of DNA methylation heterogeneity at single-cell resolution

Chantriolnt-Andreas Kapourani, Ricard Argelaguet, Guido Sanguinetti and Catalina A Vallejos
Genome Biology 22 (1) : 114 (2021)
Abstract
High-throughput single-cell measurements of DNA methylomes can quantify methylation heterogeneity and uncover its role in gene regulation. However, technical limitations and sparse coverage can preclude this task. scMET is a hierarchical Bayesian model which overcomes sparsity, sharing information across cells and genomic features to robustly quantify genuine biological heterogeneity. scMET can identify highly variable features that drive epigenetic heterogeneity, and perform differential methylation and variability analyses. We illustrate how scMET facilitates the characterization of epigenetically distinct cell populations and how it enables the formulation of novel hypotheses on the epigenetic regulation of gene expression. scMET is available at https://github.com/andreaskapou/scMET.
arXiv

Model updating after interventions paradoxically introduces bias

James Liley, Samuel R Emerson, Bilal A Mateen, Catalina A Vallejos, Louis J M Aslett and Sebastian J Vollmer
arXiv (2021)
Abstract
Machine learning is increasingly being used to generate prediction models for use in a number of real-world settings, from credit risk assessment to clinical decision support. Recent discussions have highlighted potential problems in the updating of a predictive score for a binary outcome when an existing predictive score forms part of the standard workflow, driving interventions. In this setting, the existing score induces an additional causative pathway which leads to miscalibration when the original score is replaced. We propose a general causal framework to describe and address this problem, and demonstrate an equivalent formulation as a partially observed Markov decision process. We use this model to demonstrate the impact of such `naive updating' when performed repeatedly. Namely, we show that successive predictive scores may converge to a point where they predict their own effect, or may eventually tend toward a stable oscillation between two values, and we argue that neither outcome is desirable. Furthermore, we demonstrate that even if model-fitting procedures improve, actual performance may worsen. We complement these findings with a discussion of several potential routes to overcome these issues.
Nat. Commun.

Single-nucleus RNA-seq2 reveals functional crosstalk between liver zonation and ploidy

M. L. Richter, I. K. Deligiannis, K. Yin, A. Danese, E. Lleshi, P. Coupland, C. A. Vallejos, K. P. Matchett, N. C. Henderson, M. Colome-Tatche and C. P. Martinez-Jimenez
Nature Communications 12 (1) : 4264 (2021)
Abstract
Single-cell RNA-seq reveals the role of pathogenic cell populations in development and progression of chronic diseases. In order to expand our knowledge on cellular heterogeneity, we have developed a single-nucleus RNA-seq2 method tailored for the comprehensive analysis of the nuclear transcriptome from frozen tissues, allowing the dissection of all cell types present in the liver, regardless of cell size or cellular fragility. We use this approach to characterize the transcriptional profile of individual hepatocytes with different levels of ploidy, and have discovered that ploidy states are associated with different metabolic potential, and gene expression in tetraploid mononucleated hepatocytes is conditioned by their position within the hepatic lobule. Our work reveals a remarkable crosstalk between gene dosage and spatial distribution of hepatocytes.
medRxiv

Development and assessment of a machine learning tool for predicting emergency admission in Scotland

James Liley, Gergo Bohner, Samuel R. Emerson, Bilal A. Mateen, Katie Borland, David Carr, Scott Heald, Samuel D. Oduro, Jill Ireland, Keith Moffat, Rachel Porteous, Stephen Riddell, Nathan Cunningham, Chris Holmes, Katrina Payne, Sebastian J. Vollmer, Catalina A. Vallejos and Louis J. M. Aslett
medRxiv (2021)
Abstract
Avoiding emergency hospital admission (EA) is advantageous to individual health and the healthcare system. We develop a statistical model estimating risk of EA for most of the Scottish population (\&gt; 4.8M individuals) using electronic health records, such as hospital episodes and prescribing activity. We demonstrate good predictive accuracy (AUROC 0.80), calibration and temporal stability. We find strong prediction of respiratory and metabolic EA, show a substantial risk contribution from socioeconomic decile, and highlight an important problem in model updating. Our work constitutes a rare example of a population-scale machine learning score to be deployed in a healthcare setting.Competing Interest StatementThe authors have declared no competing interest.Funding StatementJL, CAV and LJMA were partially supported by Wave 1 of The UKRI Strategic Priorities Fund under the EPSRC Grant EP/T001569/1, particularly the "Health" theme within that grant and The Alan Turing Institute; JL, BAM, CAV, LJMA and SJV were partially supported by Health Data Research UK, an initiative funded by UK Research and Innovation, Department of Health and Social Care (England), the devolved administrations, and leading medical research charities; SJV, NC and GB were partially supported by the University of Warwick Impact Fund. SRE is funded by the EPSRC doctoral training partnership (DTP) at Durham University, grant reference EP/R513039/1; LJMA was partially supported by a Health Programme Fellowship at The Alan Turing Institute; CAV was supported by a Chancellor's Fellowship provided by the University of Edinburgh.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:This study and the use of NHS data was approved by the Public Benefit and Privacy Panel for Health and Social Care (study number 1718-0370; approval evidenced in application outcome minutes for 2018/19 at https://www.informationgovernance.scot.nhs.uk/pbpphsc/application-outcomes/ ). In addition, accessing data was approved by the Public Health Scotland National Safe Haven, through the the electronic Data Research and Innovation Service (eDRIS) and the Public Benefit and Privacy Panel (PBPP) (study number 1718-0370). All studies have been conducted in accordance with information governance standards; data had no patient identifiers available to the researchers. This work was conducted in accordance with UK data governance regulations under PBPP application number eDRIS 1718-0370 All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesRaw data for this project are patient-level NHS Scotland health records, and are confidential. Due to the confidential nature of the data used, all analysis took place on remote 'safe havens', without access to internet, software updates or unpublished software. Information Governance training was required for all researchers accessing the analysis environment. Moreover, to avoid the risk of accidental disclosure of sensitive information, an independent team carried out statistical disclosure control checks to all data exports, including the outputs presented in this manuscript. All analysis code and co-ordinates required to reproduce our Figures are available in github.com/jamesliley/SPARRAv4 https://github.com/jamesliley/SPARRAv4

2020

Genome Biol.

Eleven grand challenges in single-cell data science

David Lähnemann, Johannes Köster, Ewa Szczurek, Davis J. McCarthy, Stephanie C. Hicks, Mark D. Robinson, Catalina A. Vallejos, Kieran R. Campbell, Niko Beerenwinkel, Ahmed Mahfouz, Luca Pinello, Pavel Skums, Alexandros Stamatakis, Camille Stephan-Otto Attolini, Samuel Aparicio, Jasmijn Baaijens, Marleen Balvert, Buys de Barbanson, Antonio Cappuccio, Giacomo Corleone, Bas E. Dutilh, Maria Florescu, Victor Guryev, Rens Holmer, Katharina Jahn, Thamar Jessurun Lobo, Emma M. Keizer, Indu Khatri, Szymon M. Kielbasa, Jan O. Korbel, Alexey M. Kozlov, Tzu-Hao Kuo, Boudewijn P. F. Lelieveldt, Ion I. Mandoiu, John C. Marioni, Tobias Marschall, Felix Mölder, Amir Niknejad, Lukasz Raczkowski, Marcel Reinders, Jeroen de Ridder, Antoine-Emmanuel Saliba, Antonios Somarakis, Oliver Stegle, Fabian J. Theis, Huan Yang, Alex Zelikovsky, Alice C. McHardy, Benjamin J. Raphael, Sohrab P. Shah and Alexander Schönhuth
Genome Biology 21 (1) : 31 (2020)
Abstract
The recent boom in microfluidics and combinatorial indexing strategies, combined with low sequencing costs, has empowered single-cell sequencing technology. Thousands---or even millions---of cells analyzed in a single experiment amount to a data revolution in single-cell biology and pose unique data science problems. Here, we outline eleven challenges that will be central to bringing this emerging field of single-cell data science forward. For each challenge, we highlight motivating research questions, review prior work, and formulate open problems. This compendium is for established researchers, newcomers, and students alike, highlighting interesting and rewarding problems for the coming years.
Circulation

High-Sensitivity Cardiac Troponin and the Universal Definition of Myocardial Infarction

Andrew R. Chapman, Philip D. Adamson, Anoop S.V. Shah, Atul Anand, Fiona E. Strachan, Amy V. Ferry, Kuan Ken Lee, Colin Berry, Iain Findlay, Anne Cruikshank, Alan Reid, Alasdair Gray, Paul O. Collinson, Fred Apple, David A. McAllister, Donogh Maguire, Keith A.A. Fox, Catalina A. Vallejos, Catriona Keerie, Christopher J. Weir, David E. Newby, Nicholas L. Mills, Christopher Tuck, Anda Bularga, Ryan Wereski, Dennis Sandeman, Catherine L. Stables, Athanasios Tsanasis, Lucy Marshall, Stacey D. Stewart, Takeshi Fujisawa, Mischa Hautvast, Jean McPherson, Lynn McKinlay, Simon Walker, Ian Ford, Simon Walker, Shannon Amoils, Jennifer Stevens, John Norrie, Jack Andrews, Phil Adamson, Alastair Moss, Mohamed Anwar, John Hung, Simon Walker, Jonathan Malo, Colin Fischbacher, Bernard Croal, Stephen J. Leslie, Richard Parker, Allan Walker, Ronnie Harkess, Chris Tuck, Tony Wackett, Roma Armstrong, Marion Flood, Laura Stirling, Claire MacDonald, Imran Sadat, Frank Finlay, Heather Charles, Pamela Linksted, Stephen Young, Bill Alexander and Chris Duncan
Circulation 141 (3) : 161-171 (2020)
Abstract
The introduction of more sensitive cardiac troponin assays has led to increased recognition of myocardial injury in acute illnesses other than acute coronary syndrome. The Universal Definition of Myocardial Infarction recommends high-sensitivity cardiac troponin testing and classification of patients with myocardial injury based on pathogenesis, but the clinical implications of implementing this guideline are not well understood. In a stepped-wedge cluster randomized, controlled trial, we implemented a high-sensitivity cardiac troponin assay and the recommendations of the Universal Definition in 48 282 consecutive patients with suspected acute coronary syndrome. In a prespecified secondary analysis, we compared the primary outcome of myocardial infarction or cardiovascular death and secondary outcome of noncardiovascular death at 1 year across diagnostic categories. Implementation increased the diagnosis of type 1 myocardial infarction by 11% (510/4471), type 2 myocardial infarction by 22% (205/916), and acute and chronic myocardial injury by 36% (443/1233) and 43% (389/898), respectively. Compared with those without myocardial injury, the rate of the primary outcome was highest in those with type 1 myocardial infarction (cause-specific hazard ratio [HR] 5.64 [95% CI, 5.12–6.22]), but was similar across diagnostic categories, whereas noncardiovascular deaths were highest in those with acute myocardial injury (cause specific HR 2.65 [95% CI, 2.33–3.01]). Despite modest increases in antiplatelet therapy and coronary revascularization after implementation in patients with type 1 myocardial infarction, the primary outcome was unchanged (cause specific HR 1.00 [95% CI, 0.82–1.21]). Increased recognition of type 2 myocardial infarction and myocardial injury did not lead to changes in investigation, treatment or outcomes. Implementation of high-sensitivity cardiac troponin assays and the recommendations of the Universal Definition of Myocardial Infarction identified patients at high-risk of cardiovascular and noncardiovascular events but was not associated with consistent increases in treatment or improved outcomes. Trials of secondary prevention are urgently required to determine whether this risk is modifiable in patients without type 1 myocardial infarction.

2018

Cell Syst.

Correcting the Mean-Variance Dependency for Differential Variability Testing Using Single-Cell RNA Sequencing Data

Nils Eling, Arianne C Richard, Sylvia Richardson, John C Marioni and Catalina A Vallejos
Cell Systems 7 (3) : 284-294.e12 (2018)
Abstract
Cell-to-cell transcriptional variability in otherwise homogeneous cell populations plays an important role in tissue function and development. Single-cell RNA sequencing can characterize this variability in a transcriptome-wide manner. However, technical variation and the confounding between variability and mean expression estimates hinder meaningful comparison of expression variability between cell populations. To address this problem, we introduce an analysis approach that extends the BASiCS statistical framework to derive a residual measure of variability that is not confounded by mean expression. This includes a robust procedure for quantifying technical noise in experiments where technical spike-in molecules are not available. We illustrate how our method provides biological insight into the dynamics of cell-to-cell expression variability, highlighting a synchronization of biosynthetic machinery components in immune cells upon activation. In contrast to the uniform up-regulation of the biosynthetic machinery, CD4+ T cells show heterogeneous up-regulation of immune-related and lineage-defining genes during activation and differentiation.
Lancet

High-sensitivity troponin in the evaluation of patients with suspected acute coronary syndrome: a stepped-wedge, cluster-randomised controlled trial

Anoop S V Shah, Atul Anand, Fiona E Strachan, Amy V Ferry, Kuan Ken Lee, Andrew R Chapman, Dennis Sandeman, Catherine L Stables, Philip D Adamson, Jack P M Andrews, Mohamed S Anwar, John Hung, Alistair J Moss, Rachel O'Brien, Colin Berry, Iain Findlay, Simon Walker, Anne Cruickshank, Alan Reid, Alasdair Gray, Paul O Collinson, Fred S Apple, David A McAllister, Donogh Maguire, Keith A A Fox, David E Newby, Christopher Tuck, Ronald Harkess, Richard A Parker, Catriona Keerie, Christopher J Weir, Nicholas L Mills, Lucy Marshall, Stacey D Stewart, Takeshi Fujisawa, Catalina A Vallejos, Athanasios Tsanas, Mischa Hautvast, Jean McPherson, Lynn McKinlay, Jonathan Malo, Colin M Fischbacher, Bernard L Croal, Stephen J Leslie, Allan Walker, Tony Wackett, Roma Armstrong, Laura Stirling, Claire MacDonald, Imran Sadat, Frank Finlay, Heather Charles, Pamela Linksted, Stephen Young, Bill Alexander and Chris Duncan
The Lancet 392 (10151) : 919-928 (2018)
Abstract
High-sensitivity cardiac troponin assays permit use of lower thresholds for the diagnosis of myocardial infarction, but whether this improves clinical outcomes is unknown. We aimed to determine whether the introduction of a high-sensitivity cardiac troponin I (hs-cTnI) assay with a sex-specific 99th centile diagnostic threshold would reduce subsequent myocardial infarction or cardiovascular death in patients with suspected acute coronary syndrome. In this stepped-wedge, cluster-randomised controlled trial across ten secondary or tertiary care hospitals in Scotland, we evaluated the implementation of an hs-cTnI assay in consecutive patients who had been admitted to the hospitals' emergency departments with suspected acute coronary syndrome. Patients were eligible for inclusion if they presented with suspected acute coronary syndrome and had paired cardiac troponin measurements from the standard care and trial assays. During a validation phase of 6–12 months, results from the hs-cTnI assay were concealed from the attending clinician, and a contemporary cardiac troponin I (cTnI) assay was used to guide care. Hospitals were randomly allocated to early (n=5 hospitals) or late (n=5 hospitals) implementation, in which the high-sensitivity assay and sex-specific 99th centile diagnostic threshold was introduced immediately after the 6-month validation phase or was deferred for a further 6 months. Patients reclassified by the high-sensitivity assay were defined as those with an increased hs-cTnI concentration in whom cTnI concentrations were below the diagnostic threshold on the contemporary assay. The primary outcome was subsequent myocardial infarction or death from cardiovascular causes at 1 year after initial presentation. Outcomes were compared in patients reclassified by the high-sensitivity assay before and after its implementation by use of an adjusted generalised linear mixed model. This trial is registered with ClinicalTrials.gov, number NCT01852123. Between June 10, 2013, and March 3, 2016, we enrolled 48,282 consecutive patients (61 [SD 17] years, 47% women) of whom 10,360 (21%) patients had cTnI concentrations greater than those of the 99th centile of the normal range of values, who were identified by the contemporary assay or the high-sensitivity assay. The high-sensitivity assay reclassified 1771 (17%) of 10,360 patients with myocardial injury or infarction who were not identified by the contemporary assay. In those reclassified, subsequent myocardial infarction or cardiovascular death within 1 year occurred in 105 (15%) of 720 patients in the validation phase and 131 (12%) of 1051 patients in the implementation phase (adjusted odds ratio for implementation vs validation phase 1·10, 95% CI 0·75 to 1·61; p=0·620). Use of a high-sensitivity assay prompted reclassification of 1771 (17%) of 10,360 patients with myocardial injury or infarction, but was not associated with a lower subsequent incidence of myocardial infarction or cardiovascular death at 1 year. Our findings question whether the diagnostic threshold for myocardial infarction should be based on the 99th centile derived from a normal reference population.

2017

Nat. Methods

Normalizing single-cell RNA sequencing data: challenges and opportunities

Catalina A Vallejos, Davide Risso, Antonio Scialdone, Sandrine Dudoit and John C Marioni
Nature Methods 14 (6) : 565--571 (2017)
Abstract
Single-cell transcriptomics is becoming an important component of the molecular biologist's toolkit. A critical step when analyzing data generated using this technology is normalization. However, normalization is typically performed using methods developed for bulk RNA sequencing or even microarray data, and the suitability of these methods for single-cell transcriptomics has not been assessed. We here discuss commonly used normalization approaches and illustrate how these can produce misleading results. Finally, we present alternative approaches and provide recommendations for single-cell RNA sequencing users.
Econom. Stat.

Incorporating unobserved heterogeneity in Weibull survival models: A Bayesian approach

Catalina A. Vallejos and Mark F.J. Steel
Econometrics and Statistics 3 : 73-88 (2017)
Abstract
Outlying observations and other forms of unobserved heterogeneity can distort inference for survival datasets. The family of Rate Mixtures of Weibull distributions includes subject-level frailty terms as a solution to this issue. With a parametric mixing distribution assigned to the frailties, this family generates flexible hazard functions. Covariates are introduced via an Accelerated Failure Time specification for which the interpretation of the regression coefficients does not depend on the choice of mixing distribution. A weakly informative prior is proposed by combining the structure of the Jeffreys prior with a proper prior on some model parameters. This improper prior is shown to lead to a proper posterior distribution under easily satisfied conditions. By eliciting the proper component of the prior through the coefficient of variation of the survival times, prior information is matched for different mixing distributions. Posterior inference on subject-level frailty terms is exploited as a tool for outlier detection. Finally, the proposed methodology is illustrated using two real datasets, one concerning bone marrow transplants and another on cerebral palsy.
Science

Aging increases cell-to-cell transcriptional variability upon immune stimulation

Celia Pilar Martinez-Jimenez, Nils Eling, Hung-Chang Chen, Catalina A. Vallejos, Aleksandra A. Kolodziejczyk, Frances Connor, Lovorka Stojic, Timothy F. Rayner, Michael J. T. Stubbington, Sarah A. Teichmann, Maike de la Roche, John C. Marioni and Duncan T. Odom
Science 355 (6332) : 1433-1436 (2017)
Abstract
Single-cell sequencing of mouse immune cells reveals how aging destabilizes a conserved transcriptional activation program. How and why the immune system becomes less effective with age are not well understood. Martinez-Jimenez et al. performed single-cell sequencing of CD4+ T cells in old and young mice of two species. In young mice, the gene expression program of early immune activation was tightly regulated and conserved between species. However, as mice aged, the expression of genes involved in pathways responding to immune cell stimulation was not as robust and exhibited increased cell-to-cell variability. Science, this issue p. 1433 Aging is characterized by progressive loss of physiological and cellular functions, but the molecular basis of this decline remains unclear. We explored how aging affects transcriptional dynamics using single-cell RNA sequencing of unstimulated and stimulated naïve and effector memory CD4+ T cells from young and old mice from two divergent species. In young animals, immunological activation drives a conserved transcriptomic switch, resulting in tightly controlled gene expression characterized by a strong up-regulation of a core activation program, coupled with a decrease in cell-to-cell variability. Aging perturbed the activation of this core program and increased expression heterogeneity across populations of cells in both species. These discoveries suggest that increased cell-to-cell transcriptional variability will be a hallmark feature of aging across most, if not all, mammalian tissues.

2016

Genome Biol.

Beyond comparisons of means: understanding changes in gene expression at the single-cell level

Catalina A. Vallejos, Sylvia Richardson and John C. Marioni
Genome Biology 17 (1) : 70 (2016)
Abstract
Traditional differential expression tools are limited to detecting changes in overall expression, and fail to uncover the rich information provided by single-cell level data sets. We present a Bayesian hierarchical model that builds upon BASiCS to study changes that lie beyond comparisons of means, incorporating built-in normalization and quantifying technical artifacts by borrowing information from spike-in genes. Using a probabilistic approach, we highlight genes undergoing changes in cell-to-cell heterogeneity but whose overall expression remains unchanged. Control experiments validate our method's performance and a case study suggests that novel biological insights can be revealed. Our method is implemented in R and available at https://github.com/catavallejos/BASiCS.
J. R. Stat. Soc. Ser. A

Bayesian survival modelling of university outcomes

Catalina A. Vallejos and Mark F. J. Steel
Journal of the Royal Statistical Society: Series A (Statistics in Society) 180 (2) : 613--631 (2016)
Abstract
Dropouts and delayed graduations are critical issues in higher education systems world wide. A key task in this context is to identify risk factors associated with these events, providing potential targets for mitigating policies. For this, we employ a discrete time competing risks survival model, dealing simultaneously with university outcomes and its associated temporal component. We define survival times as the duration of the student's enrolment at university and possible outcomes as graduation or two types of dropout (voluntary and involuntary), exploring the information recorded at admission time (e.g. educational level of the parents) as potential predictors. Although similar strategies have been previously implemented, we extend the previous methods by handling covariate selection within a Bayesian variable selection framework, where model uncertainty is formally addressed through Bayesian model averaging. Our methodology is general; however, here we focus on undergraduate students enrolled in three selected degree programmes of the Pontificia Universidad Católica de Chile during the period 2000–2011. Our analysis reveals interesting insights, highlighting the main covariates that influence students’ risk of dropout and delayed graduation.

2015

PLoS Comput. Biol.

BASiCS: Bayesian Analysis of Single-Cell Sequencing Data

Vallejos Catalina A, Marioni John C and Richardson Sylvia
PLOS Computational Biology 11 (6) : 1-18 (2015)
Abstract
Single-cell mRNA sequencing can uncover novel cell-to-cell heterogeneity in gene expression levels in seemingly homogeneous populations of cells. However, these experiments are prone to high levels of unexplained technical noise, creating new challenges for identifying genes that show genuine heterogeneous expression within the population of cells under study. BASiCS (Bayesian Analysis of Single-Cell Sequencing data) is an integrated Bayesian hierarchical model where: (i) cell-specific normalisation constants are estimated as part of the model parameters, (ii) technical variability is quantified based on spike-in genes that are artificially introduced to each analysed cell’s lysate and (iii) the total variability of the expression counts is decomposed into technical and biological components. BASiCS also provides an intuitive detection criterion for highly (or lowly) variable genes within the population of cells under study. This is formalised by means of tail posterior probabilities associated to high (or low) biological cell-to-cell variance contributions, quantities that can be easily interpreted by users. We demonstrate our method using gene expression measurements from mouse Embryonic Stem Cells. Cross-validation and meaningful enrichment of gene ontology categories within genes classified as highly (or lowly) variable supports the efficacy of our approach.
J. Am. Stat. Assoc.

Objective Bayesian Survival Analysis Using Shape Mixtures of Log-Normal Distributions

Catalina A Vallejos and Mark FJ Steel
Journal of the American Statistical Association 110 (510) : 697-710 (2015)
Abstract
Survival models such as the Weibull or log-normal lead to inference that is not robust to the presence of outliers. They also assume that all heterogeneity between individuals can be modeled through covariates. This article considers the use of infinite mixtures of lifetime distributions as a solution for these two issues. This can be interpreted as the introduction of a random effect in the survival distribution. We introduce the family of shape mixtures of log-normal distributions, which covers a wide range of density and hazard functions. Bayesian inference under nonsubjective priors based on the Jeffreys’ rule is examined and conditions for posterior propriety are established. The existence of the posterior distribution on the basis of a sample of point observations is not always guaranteed and a solution through set observations is implemented. In addition, we propose a method for outlier detection based on the mixture structure. A simulation study illustrates the performance of our methods under different scenarios and an application to a real dataset is provided. Supplementary materials for the article, which include R code, are available online.