Publications

2024

  1. NPJ DM
    Development and assessment of a machine learning tool for predicting emergency admission in Scotland
    James Liley, Gergo Bohner, Samuel R. Emerson, Bilal A. Mateen, Katie Borland, David Carr, Scott Heald, Samuel D. Oduro, Jill Ireland, Keith Moffat, Rachel Porteous, Stephen Riddell, Simon Rogers,  Ioanna Thoma, Nathan Cunningham, Chris Holmes, Katrina Payne, Sebastian J. Vollmer,  Catalina A. Vallejos, and Louis J. M. Aslett
    npj Digital Medicine Oct 2024

    Emergency admissions (EA), where a patient requires urgent in-hospital care, are a major challenge for healthcare systems. The development of risk prediction models can partly alleviate this problem by supporting primary care interventions and public health planning. Here, we introduce SPARRAv4, a predictive score for EA risk that will be deployed nationwide in Scotland. SPARRAv4 was derived using supervised and unsupervised machine-learning methods applied to routinely collected electronic health records from approximately 4.8M Scottish residents (2013-18). We demonstrate improvements in discrimination and calibration with respect to previous scores deployed in Scotland, as well as stability over a 3-year timeframe. Our analysis also provides insights about the epidemiology of EA risk in Scotland, by studying predictive performance across different population sub-groups and reasons for admission, as well as by quantifying the effect of individual input features. Finally, we discuss broader challenges including reproducibility and how to safely update risk prediction models that are already deployed at population level.

    @article{Liley_2024,
      abbr = {NPJ DM},
      bibtex_show = {true},
      selected = {true},
      title = {Development and assessment of a machine learning tool for predicting emergency admission in Scotland},
      volume = {7},
      issn = {2398-6352},
      url = {http://dx.doi.org/10.1038/s41746-024-01250-1},
      doi = {10.1038/s41746-024-01250-1},
      number = {1},
      journal = {npj Digital Medicine},
      publisher = {Springer Science and Business Media LLC},
      author = {Liley, James and Bohner, Gergo and Emerson, Samuel R. and Mateen, Bilal A. and Borland, Katie and Carr, David and Heald, Scott and Oduro, Samuel D. and Ireland, Jill and Moffat, Keith and Porteous, Rachel and Riddell, Stephen and Rogers, Simon and Thoma, Ioanna and Cunningham, Nathan and Holmes, Chris and Payne, Katrina and Vollmer, Sebastian J. and Vallejos, Catalina A. and Aslett, Louis J. M.},
      year = {2024},
      month = oct
    }
  2. AI & Ethics
    Ethical considerations of use of hold-out sets in clinical prediction model management
    Louis ChislettLouis J. M. Aslett, Alisha R. Davies,  Catalina A. Vallejos, and James Liley
    AI and Ethics Sep 2024

    Clinical prediction models are statistical or machine learning models used to quantify the risk of a certain health outcome using patient data. These can then inform potential interventions on patients, causing an effect called performative prediction: predictions inform interventions which influence the outcome they were trying to predict, leading to a potential underestimation of risk in some patients if a model is updated on this data. One suggested resolution to this is the use of hold-out sets, in which a set of patients do not receive model derived risk scores, such that a model can be safely retrained. We present an overview of clinical and research ethics regarding potential implementation of hold-out sets for clinical prediction models in health settings. We focus on the ethical principles of beneficence, non-maleficence, autonomy and justice. We also discuss informed consent, clinical equipoise, and truth-telling. We present illustrative cases of potential hold-out set implementations and discuss statistical issues arising from different hold-out set sampling methods. We also discuss differences between hold-out sets and randomised control trials, in terms of ethics and statistical issues. Finally, we give practical recommendations for researchers interested in the use hold-out sets for clinical prediction models.

    @article{Chislett_2024,
      abbr = {AI & Ethics},
      bibtex_show = {true},
      selected = {true},
      title = {Ethical considerations of use of hold-out sets in clinical prediction model management},
      issn = {2730-5961},
      url = {http://dx.doi.org/10.1007/s43681-024-00561-z},
      doi = {10.1007/s43681-024-00561-z},
      journal = {AI and Ethics},
      publisher = {Springer Science and Business Media LLC},
      author = {Chislett, Louis and Aslett, Louis J. M. and Davies, Alisha R. and Vallejos, Catalina A. and Liley, James},
      year = {2024},
      month = sep
    }
  3. medRxiv
    Blood-based DNA methylation study of alcohol consumption
    Elena Bernabeu, Aleksandra D Chybowska, Jacob K. Kresovich, Matthew Suderman, Daniel L McCartney, Robert F Hillary, Janie Corley, Maria Del C. Valdés-Hernández, Susana Muñoz Maniega, Mark E. Bastin, Joanna M. Wardlaw, Zongli Xu, Dale P. Sandler, Archie Campbell, Sarah E Harris, Andrew M McIntosh, Jack A. Taylor, Paul Yousefi, Simon R Cox, Kathryn L Evans, Matthew R Robinson,  Catalina A Vallejos, and Riccardo E Marioni
    Feb 2024

    Alcohol consumption is an important risk factor for multiple diseases. It is typically assessed via self-report, which is open to measurement error and bias. Instead, molecular data such as blood-based DNA methylation (DNAm) could be used to derive a more objective measure of alcohol consumption by incorporating information from cytosine-phosphate-guanine (CpG) sites known to be linked to the trait. Here, we explore the epigenetic architecture of self-reported weekly units of alcohol consumption in the Generation Scotland study. We first create a blood-based epigenetic score (EpiScore) of alcohol consumption using elastic net penalised linear regression. We explore the effect of pre-filtering for CpG features ahead of elastic net, as well as differential patterns by sex and by units consumed in the last week relative to an average week. The final EpiScore was trained on 16,717 individuals and tested in four external cohorts: the Lothian Birth Cohorts (LBC) of 1921 and 1936, the Sister Study, and the Avon Longitudinal Study of Parents and Children (total N across studies > 10,000). The maximum Pearson correlation between the EpiScore and self-reported alcohol consumption within cohort ranged from 0.41 to 0.53. In LBC1936, higher EpiScore levels had significant associations with poorer global brain imaging metrics, whereas self-reported alcohol consumption did not. Finally, we identified two novel CpG loci via a Bayesian penalized regression epigenome-wide association study (EWAS) of alcohol consumption. Together, these findings show how DNAm can objectively characterize patterns of alcohol consumption that associate with brain health, unlike self-reported estimates.

    @article{Bernabeu_2024,
      abbr = {medRxiv},
      bibtex_show = {true},
      selected = {true},
      title = {Blood-based DNA methylation study of alcohol consumption},
      url = {http://dx.doi.org/10.1101/2024.02.26.24303397},
      doi = {10.1101/2024.02.26.24303397},
      publisher = {Cold Spring Harbor Laboratory},
      author = {Bernabeu, Elena and Chybowska, Aleksandra D and Kresovich, Jacob K. and Suderman, Matthew and McCartney, Daniel L and Hillary, Robert F and Corley, Janie and Valdés-Hernández, Maria Del C. and Muñoz Maniega, Susana and Bastin, Mark E. and Wardlaw, Joanna M. and Xu, Zongli and Sandler, Dale P. and Campbell, Archie and Harris, Sarah E and McIntosh, Andrew M and Taylor, Jack A. and Yousefi, Paul and Cox, Simon R and Evans, Kathryn L and Robinson, Matthew R and Vallejos, Catalina A and Marioni, Riccardo E},
      year = {2024},
      month = feb
    }
  4. medRxiv
    Differential behaviour of a risk score for emergency hospital admission by demographics in Scotland — a retrospective study
    Ioanna Thoma, Simon Rogers, Jill Ireland, Rachel Porteous, Katie Borland,  Catalina A. VallejosLouis J. M. Aslett, and James Liley
    Feb 2024

    The Scottish Patients at Risk of Re-Admission and Admission (SPARRA) score predicts individual risk of emergency hospital admission for approximately 80% of the Scottish population. It was developed using routinely collected electronic health records, and is used by primary care practitioners to inform anticipatory care, particularly for individuals with high healthcare needs. We comprehensively assess the SPARRA score across population subgroups defined by age, sex, ethnicity, socioeconomic deprivation, and geographic location. For these subgroups, we consider differences in overall performance, score distribution, and false positive and negative rates, using causal methods to identify effects mediated through age, sex, and deprivation. We show that the score is well-calibrated across subgroups, but that rates of false positives and negatives vary widely, mediated by a range of causes. Our work assists practitioners in the application and interpretation of the SPARRA score in population subgroups. Evidence before this study: There is considerable literature on the general topic of differential performance of risk scores across population subgroups and its implications. A shared theme is the importance of identifying and quantifying such differential performance. We performed a MedLine and Google Scholar search with the single term ’SPARRA’, and consulted colleagues at Public Health Scotland about any previous internal analyses. Several articles assessed the accuracy of SPARRA and discussed its role in the Scottish healthcare system since its introduction in 2006, but none looked in detail at differential performance between specific demographic groups. Added value of this study: We provide a comprehensive assessment of the performance of the SPARRA score across a range of population subgroups in several ways. We systematically examined differences in performance using a range of metrics. We identify notable areas of differential performance associated with age, sex, socioeconomic deprivation, ethnicity and residence location (mainland versus island; urban versus rural). We also examined the pattern of errors in prediction across medical causes of emergency admission, finding that, to variable degrees across groups, cardiac and respiratory admissions are more likely to be correctly predicted from electronic health records. Overall, our work provides an atlas of performance measures for SPARRA and partly explains how between-group performance differences arise. Implications of all the available evidence: The precision by which the SPARRA score can predict emergency hospital admissions differs between population subgroups. These differences are largely driven by variation in performance across age and sex, as well as the predictability of different causes of admission. Awareness of these differences is important when making decisions based on the SPARRA score.

    @article{Thoma_2024,
      abbr = {medRxiv},
      bibtex_show = {true},
      selected = {true},
      title = {Differential behaviour of a risk score for emergency hospital admission by demographics in Scotland — a retrospective study},
      url = {http://dx.doi.org/10.1101/2024.02.13.24302753},
      doi = {10.1101/2024.02.13.24302753},
      publisher = {Cold Spring Harbor Laboratory},
      author = {Thoma, Ioanna and Rogers, Simon and Ireland, Jill and Porteous, Rachel and Borland, Katie and Vallejos, Catalina A. and Aslett, Louis J. M. and Liley, James},
      year = {2024},
      month = feb
    }
  5. Biomet
    A review on statistical and machine learning competing risks methods
    Karla Monterrubio-GómezNathan Constantine-Cooke, and Catalina A. Vallejos
    Biometrical Journal Feb 2024

    Abstract When modeling competing risks (CR) survival data, several techniques have been proposed in both the statistical and machine learning literature. State-of-the-art methods have extended classical approaches with more flexible assumptions that can improve predictive performance, allow high-dimensional data and missing values, among others. Despite this, modern approaches have not been widely employed in applied settings. This article aims to aid the uptake of such methods by providing a condensed compendium of CR survival methods with a unified notation and interpretation across approaches. We highlight available software and, when possible, demonstrate their usage via reproducible R vignettes. Moreover, we discuss two major concerns that can affect benchmark studies in this context: the choice of performance metrics and reproducibility.

    @article{https://doi.org/10.1002/bimj.202300060,
      abbr = {Biomet},
      bibtex_show = {true},
      author = {Monterrubio-Gómez, Karla and Constantine-Cooke, Nathan and Vallejos, Catalina A.},
      title = {A review on statistical and machine learning competing risks methods},
      journal = {Biometrical Journal},
      volume = {66},
      number = {2},
      pages = {2300060},
      keywords = {competing risks, risk prediction, survival analysis, time-to-event data},
      doi = {https://doi.org/10.1002/bimj.202300060},
      url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/bimj.202300060},
      eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/bimj.202300060},
      year = {2024},
      month = feb,
      selected = {true}
    }

2023

  1. CGH
    Longitudinal Fecal Calprotectin Profiles Characterize Disease Course Heterogeneity in Crohn’s Disease
    Nathan Constantine-CookeKarla Monterrubio-Gómez, Nikolas Plevris, Lauranne A.A.P. Derikx, Beatriz Gros, Gareth-Rhys Jones, Riccardo E. Marioni, Charlie W. Lees, and Catalina A. Vallejos
    Clinical Gastroenterology and Hepatology Oct 2023

    Background and Aims: The progressive nature of Crohn’s disease is highly variable and hard to predict. In addition, symptoms correlate poorly with mucosal inflammation. There is therefore an urgent need to better characterize the heterogeneity of disease trajectories in Crohn’s disease by utilizing objective markers of inflammation. We aimed to better understand this heterogeneity by clustering Crohn’s disease patients with similar longitudinal fecal calprotectin profiles. Methods: We performed a retrospective cohort study at the Edinburgh IBD Unit, a tertiary referral center, and used latent class mixed models to cluster Crohn’s disease subjects using fecal calprotectin observed within 5 years of diagnosis. Information criteria, alluvial plots, and cluster trajectories were used to decide the optimal number of clusters. Chi-square test, Fisher’s exact test, and analysis of variance were used to test for associations with variables commonly assessed at diagnosis. Results: Our study cohort comprised 356 patients with newly diagnosed Crohn’s disease and 2856 fecal calprotectin measurements taken within 5 years of diagnosis (median 7 per subject). Four distinct clusters were identified by characteristic calprotectin profiles: a cluster with consistently high fecal calprotectin and 3 clusters characterized by different downward longitudinal trends. Cluster membership was significantly associated with smoking (P = .015), upper gastrointestinal involvement (P < .001), and early biologic therapy (P < .001). Conclusions: Our analysis demonstrates a novel approach to characterizing the heterogeneity of Crohn’s disease by using fecal calprotectin. The group profiles do not simply reflect different treatment regimens and do not mirror classical disease progression endpoints.

    @article{Constantine-Cooke2023,
      abbr = {CGH},
      bibtex_show = {true},
      selected = {false},
      pdf = {Constantine-Cooke2023.pdf},
      title = {Longitudinal Fecal Calprotectin Profiles Characterize Disease Course Heterogeneity in {{Crohn}}'s Disease},
      author = {{Constantine-Cooke}, Nathan and {Monterrubio-G{\'o}mez}, Karla and Plevris, Nikolas and Derikx, Lauranne A.A.P. and Gros, Beatriz and Jones, Gareth-Rhys and Marioni, Riccardo E. and Lees, Charlie W. and Vallejos, Catalina A.},
      journal = {Clinical Gastroenterology and Hepatology},
      volume = {21},
      number = {11},
      pages = {2918-2927.e6},
      publisher = {{Elsevier}},
      issn = {1542-3565},
      doi = {10.1016/j.cgh.2023.03.026},
      urldate = {2023-05-24},
      month = oct,
      year = {2023}
    }
  2. NatAge
    Development and validation of DNA methylation scores in two European cohorts augment 10-year risk prediction of type 2 diabetes
    Yipeng Cheng, Danni A. Gadd, Christian Gieger,  Karla Monterrubio-Gómez, Yufei Zhang, Imrich Berta, Michael J. Stam, Natalia Szlachetka, Evgenii Lobzaev, Nicola Wrobel, Lee Murphy, Archie Campbell, Cliff Nangle, Rosie M. Walker, Chloe Fawns-Ritchie, Annette Peters, Wolfgang Rathmann, David J. Porteous, Kathryn L. Evans, Andrew M. McIntosh, Timothy I. Cannings, Melanie Waldenberger, Andrea Ganna, Daniel L. McCartney,  Catalina A. Vallejos, and Riccardo E. Marioni
    Nature Aging Apr 2023

    Type 2 diabetes mellitus (T2D) presents a major health and economic burden that could be alleviated with improved early prediction and intervention. While standard risk factors have shown good predictive performance, we show that the use of blood-based DNA methylation information leads to a significant improvement in the prediction of 10-year T2D incidence risk. Previous studies have been largely constrained by linear assumptions, the use of cytosine–guanine pairs one-at-a-time and binary outcomes. We present a flexible approach (via an R package, MethylPipeR) based on a range of linear and tree-ensemble models that incorporate time-to-event data for prediction. Using the Generation Scotland cohort (training set ncases = 374, ncontrols = 9,461; test set ncases = 252, ncontrols = 4,526) our best-performing model (area under the receiver operating characteristic curve (AUC) = 0.872, area under the precision-recall curve (PRAUC) = 0.302) showed notable improvement in 10-year onset prediction beyond standard risk factors (AUC = 0.839, precision-recall AUC = 0.227). Replication was observed in the German-based KORA study (n = 1,451, ncases = 142, P = 1.6x10-5).

    @article{Cheng_2023,
      doi = {10.1038/s43587-023-00391-4},
      url = {https://doi.org/10.1038%2Fs43587-023-00391-4},
      year = {2023},
      month = apr,
      publisher = {Springer Science and Business Media {LLC}},
      volume = {3},
      number = {4},
      pages = {450--458},
      author = {Cheng, Yipeng and Gadd, Danni A. and Gieger, Christian and Monterrubio-G{\'{o}}mez, Karla and Zhang, Yufei and Berta, Imrich and Stam, Michael J. and Szlachetka, Natalia and Lobzaev, Evgenii and Wrobel, Nicola and Murphy, Lee and Campbell, Archie and Nangle, Cliff and Walker, Rosie M. and Fawns-Ritchie, Chloe and Peters, Annette and Rathmann, Wolfgang and Porteous, David J. and Evans, Kathryn L. and McIntosh, Andrew M. and Cannings, Timothy I. and Waldenberger, Melanie and Ganna, Andrea and McCartney, Daniel L. and Vallejos, Catalina A. and Marioni, Riccardo E.},
      title = {Development and validation of {DNA} methylation scores in two European cohorts augment 10-year risk prediction of type 2 diabetes},
      journal = {Nature Aging},
      abbr = {NatAge},
      pdf = {cheng2021.pdf},
      selected = {false}
    }
  3. JACC
    Improving Risk Stratification for Patients With Type 2 Myocardial Infarction
    Caelan Taggart,  Karla Monterrubio-Gómez, Andreas Roos, Jasper Boeddinghaus, Dorien M. Kimenai, Erik Kadesjo, Anda Bularga, Ryan Wereski, Amy Ferry, Matthew Lowry, Atul Anand, Kuan Ken Lee, Dimitrios Doudesis, Ioanna Manolopoulou, Thomas Nestelberger, Luca Koechlin, Pedro Lopez-Ayala, Christian Mueller, Nicholas L. Mills,  Catalina A. Vallejos, and Andrew R. Chapman
    Journal of the American College of Cardiology Jan 2023

    Background: Despite poor cardiovascular outcomes, there are no dedicated, validated risk stratification tools to guide investigation or treatment in type 2 myocardial infarction. Objectives: The goal of this study was to derive and validate a risk stratification tool for the prediction of death or future myocardial infarction in patients with type 2 myocardial infarction. Methods: The T2-risk score was developed in a prospective multicenter cohort of consecutive patients with type 2 myocardial infarction. Cox proportional hazards models were constructed for the primary outcome of myocardial infarction or death at 1 year using variables selected a priori based on clinical importance. Discrimination was assessed by area under the receiving-operating characteristic curve (AUC). Calibration was investigated graphically. The tool was validated in a single-center cohort of consecutive patients and in a multicenter cohort study from sites across Europe. Results: There were 1,121, 250, and 253 patients in the derivation, single-center, and multicenter validation cohorts, with the primary outcome occurring in 27% (297 of 1,121), 26% (66 of 250), and 14% (35 of 253) of patients, respectively. The T2-risk score incorporating age, ischemic heart disease, heart failure, diabetes mellitus, myocardial ischemia on electrocardiogram, heart rate, anemia, estimated glomerular filtration rate, and maximal cardiac troponin concentration had good discrimination (AUC: 0.76; 95% CI: 0.73-0.79) for the primary outcome and was well calibrated. Discrimination was similar in the consecutive patient (AUC: 0.83; 95% CI: 0.77-0.88) and multicenter (AUC: 0.74; 95% CI: 0.64-0.83) cohorts. T2-risk provided improved discrimination over the Global Registry of Acute Coronary Events 2.0 risk score in all cohorts. Conclusions: The T2-risk score performed well in different health care settings and could help clinicians to prognosticate, as well as target investigation and preventative therapies more effectively. (High-Sensitivity Troponin in the Evaluation of Patients With Suspected Acute Coronary Syndrome [High-STEACS]; NCT01852123)

    @article{doi:10.1016/j.jacc.2022.10.025,
      author = {Taggart, Caelan and Monterrubio-Gómez, Karla and Roos, Andreas and Boeddinghaus, Jasper and Kimenai, Dorien M. and Kadesjo, Erik and Bularga, Anda and Wereski, Ryan and Ferry, Amy and Lowry, Matthew and Anand, Atul and Lee, Kuan Ken and Doudesis, Dimitrios and Manolopoulou, Ioanna and Nestelberger, Thomas and Koechlin, Luca and Lopez-Ayala, Pedro and Mueller, Christian and Mills, Nicholas L. and Vallejos, Catalina A. and Chapman, Andrew R.},
      title = {Improving Risk Stratification for Patients With Type 2 Myocardial Infarction},
      journal = {Journal of the American College of Cardiology},
      abbr = {JACC},
      volume = {81},
      number = {2},
      pages = {156-168},
      year = {2023},
      month = jan,
      doi = {10.1016/j.jacc.2022.10.025},
      url = {https://www.jacc.org/doi/abs/10.1016/j.jacc.2022.10.025},
      eprint = {https://www.jacc.org/doi/pdf/10.1016/j.jacc.2022.10.025}
    }

2021

  1. GenBio
    scMET: Bayesian modeling of DNA methylation heterogeneity at single-cell resolution
    Chantriolnt-Andreas Kapourani, Ricard Argelaguet, Guido Sanguinetti, and Catalina A Vallejos
    Genome Biology Jan 2021

    High-throughput single-cell measurements of DNA methylomes can quantify methylation heterogeneity and uncover its role in gene regulation. However, technical limitations and sparse coverage can preclude this task. scMET is a hierarchical Bayesian model which overcomes sparsity, sharing information across cells and genomic features to robustly quantify genuine biological heterogeneity. scMET can identify highly variable features that drive epigenetic heterogeneity, and perform differential methylation and variability analyses. We illustrate how scMET facilitates the characterization of epigenetically distinct cell populations and how it enables the formulation of novel hypotheses on the epigenetic regulation of gene expression. scMET is available at https://github.com/andreaskapou/scMET.

    @article{Kapourani2021,
      abbr = {GenBio},
      author = {Kapourani, Chantriolnt-Andreas and Argelaguet, Ricard and Sanguinetti, Guido and Vallejos, Catalina A},
      date = {2021/04/20},
      date-added = {2022-02-26 12:43:46 +0000},
      date-modified = {2022-02-26 12:43:46 +0000},
      doi = {10.1186/s13059-021-02329-8},
      id = {Kapourani2021},
      isbn = {1474-760X},
      journal = {Genome Biology},
      number = {1},
      pages = {114},
      title = {scMET: Bayesian modeling of DNA methylation heterogeneity at single-cell resolution},
      url = {https://doi.org/10.1186/s13059-021-02329-8},
      volume = {22},
      year = {2021},
      bdsk-url-1 = {https://doi.org/10.1186/s13059-021-02329-8},
      selected = {false}
    }
  2. arXiv
    Model updating after interventions paradoxically introduces bias
    James Liley, Samuel R Emerson, Bilal A Mateen,  Catalina A Vallejos, Louis J M Aslett, and Sebastian J Vollmer
    arXiv Jan 2021

    Machine learning is increasingly being used to generate prediction models for use in a number of real-world settings, from credit risk assessment to clinical decision support. Recent discussions have highlighted potential problems in the updating of a predictive score for a binary outcome when an existing predictive score forms part of the standard workflow, driving interventions. In this setting, the existing score induces an additional causative pathway which leads to miscalibration when the original score is replaced. We propose a general causal framework to describe and address this problem, and demonstrate an equivalent formulation as a partially observed Markov decision process. We use this model to demonstrate the impact of such ‘naive updating’ when performed repeatedly. Namely, we show that successive predictive scores may converge to a point where they predict their own effect, or may eventually tend toward a stable oscillation between two values, and we argue that neither outcome is desirable. Furthermore, we demonstrate that even if model-fitting procedures improve, actual performance may worsen. We complement these findings with a discussion of several potential routes to overcome these issues.

    @article{liley2021,
      title = {Model updating after interventions paradoxically introduces bias},
      pdf = {liley2021.pdf},
      journal = {arXiv},
      author = {Liley, James and Emerson, Samuel R and Mateen, Bilal A and Vallejos, Catalina A and Aslett, Louis J M and Vollmer, Sebastian J},
      year = {2021},
      arxiv = {2010.11530},
      abbr = {arXiv},
      primaryclass = {stat.ML},
      selected = {false}
    }
  3. NatCom
    Single-nucleus RNA-seq2 reveals functional crosstalk between liver zonation and ploidy
    M. L. Richter, I. K. Deligiannis, K. Yin, A. Danese, E. Lleshi, P. Coupland,  C. A. Vallejos, K. P. Matchett, N. C. Henderson, M. Colome-Tatche, and C. P. Martinez-Jimenez
    Nature Communications Jan 2021

    Single-cell RNA-seq reveals the role of pathogenic cell populations in development and progression of chronic diseases. In order to expand our knowledge on cellular heterogeneity, we have developed a single-nucleus RNA-seq2 method tailored for the comprehensive analysis of the nuclear transcriptome from frozen tissues, allowing the dissection of all cell types present in the liver, regardless of cell size or cellular fragility. We use this approach to characterize the transcriptional profile of individual hepatocytes with different levels of ploidy, and have discovered that ploidy states are associated with different metabolic potential, and gene expression in tetraploid mononucleated hepatocytes is conditioned by their position within the hepatic lobule. Our work reveals a remarkable crosstalk between gene dosage and spatial distribution of hepatocytes.

    @article{Richter2021,
      abbr = {NatCom},
      author = {Richter, M. L. and Deligiannis, I. K. and Yin, K. and Danese, A. and Lleshi, E. and Coupland, P. and Vallejos, C. A. and Matchett, K. P. and Henderson, N. C. and Colome-Tatche, M. and Martinez-Jimenez, C. P.},
      date = {2021/07/12},
      doi = {10.1038/s41467-021-24543-5},
      id = {Richter2021},
      isbn = {2041-1723},
      journal = {Nature Communications},
      number = {1},
      pages = {4264},
      title = {Single-nucleus RNA-seq2 reveals functional crosstalk between liver zonation and ploidy},
      url = {https://doi.org/10.1038/s41467-021-24543-5},
      volume = {12},
      year = {2021}
    }
  4. medRxiv
    Development and assessment of a machine learning tool for predicting emergency admission in Scotland
    James Liley, Gergo Bohner, Samuel R. Emerson, Bilal A. Mateen, Katie Borland, David Carr, Scott Heald, Samuel D. Oduro, Jill Ireland, Keith Moffat, Rachel Porteous, Stephen Riddell, Nathan Cunningham, Chris Holmes, Katrina Payne, Sebastian J. Vollmer,  Catalina A. Vallejos, and Louis J. M. Aslett
    medRxiv Jan 2021

    Avoiding emergency hospital admission (EA) is advantageous to individual health and the healthcare system. We develop a statistical model estimating risk of EA for most of the Scottish population (> 4.8M individuals) using electronic health records, such as hospital episodes and prescribing activity. We demonstrate good predictive accuracy (AUROC 0.80), calibration and temporal stability. We find strong prediction of respiratory and metabolic EA, show a substantial risk contribution from socioeconomic decile, and highlight an important problem in model updating. Our work constitutes a rare example of a population-scale machine learning score to be deployed in a healthcare setting.Competing Interest StatementThe authors have declared no competing interest.Funding StatementJL, CAV and LJMA were partially supported by Wave 1 of The UKRI Strategic Priorities Fund under the EPSRC Grant EP/T001569/1, particularly the "Health" theme within that grant and The Alan Turing Institute; JL, BAM, CAV, LJMA and SJV were partially supported by Health Data Research UK, an initiative funded by UK Research and Innovation, Department of Health and Social Care (England), the devolved administrations, and leading medical research charities; SJV, NC and GB were partially supported by the University of Warwick Impact Fund. SRE is funded by the EPSRC doctoral training partnership (DTP) at Durham University, grant reference EP/R513039/1; LJMA was partially supported by a Health Programme Fellowship at The Alan Turing Institute; CAV was supported by a Chancellor’s Fellowship provided by the University of Edinburgh.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:This study and the use of NHS data was approved by the Public Benefit and Privacy Panel for Health and Social Care (study number 1718-0370; approval evidenced in application outcome minutes for 2018/19 at https://www.informationgovernance.scot.nhs.uk/pbpphsc/application-outcomes/ ). In addition, accessing data was approved by the Public Health Scotland National Safe Haven, through the the electronic Data Research and Innovation Service (eDRIS) and the Public Benefit and Privacy Panel (PBPP) (study number 1718-0370). All studies have been conducted in accordance with information governance standards; data had no patient identifiers available to the researchers. This work was conducted in accordance with UK data governance regulations under PBPP application number eDRIS 1718-0370 All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesRaw data for this project are patient-level NHS Scotland health records, and are confidential. Due to the confidential nature of the data used, all analysis took place on remote ’safe havens’, without access to internet, software updates or unpublished software. Information Governance training was required for all researchers accessing the analysis environment. Moreover, to avoid the risk of accidental disclosure of sensitive information, an independent team carried out statistical disclosure control checks to all data exports, including the outputs presented in this manuscript. All analysis code and co-ordinates required to reproduce our Figures are available in github.com/jamesliley/SPARRAv4 https://github.com/jamesliley/SPARRAv4

    @article{Liley2021.08.06.21261593,
      abbr = {medRxiv},
      author = {Liley, James and Bohner, Gergo and Emerson, Samuel R. and Mateen, Bilal A. and Borland, Katie and Carr, David and Heald, Scott and Oduro, Samuel D. and Ireland, Jill and Moffat, Keith and Porteous, Rachel and Riddell, Stephen and Cunningham, Nathan and Holmes, Chris and Payne, Katrina and Vollmer, Sebastian J. and Vallejos, Catalina A. and Aslett, Louis J. M.},
      title = {Development and assessment of a machine learning tool for predicting emergency admission in Scotland},
      elocation-id = {2021.08.06.21261593},
      year = {2021},
      doi = {10.1101/2021.08.06.21261593},
      publisher = {Cold Spring Harbor Laboratory Press},
      url = {https://www.medrxiv.org/content/early/2021/08/10/2021.08.06.21261593},
      eprint = {https://www.medrxiv.org/content/early/2021/08/10/2021.08.06.21261593.full.pdf},
      journal = {medRxiv}
    }
  5. bioRxiv
    SCRaPL: hierarchical Bayesian modelling of associations in single cell multi-omics data
    Christos ManiatisCatalina A Vallejos, and Guido Sanguinetti
    bioRxiv Jan 2021

    Single-cell multi-omics assays offer unprecedented opportunities to explore gene regulation at cellular level. However, high levels of technical noise and data sparsity frequently lead to a lack of statistical power in correlative analyses, identifying very few, if any, significant associations between different molecular layers. Here we propose SCRaPL, a novel computational tool that increases power by carefully modelling noise in the experimental systems. We show on real and simulated multi-omics single-cell data sets that SCRaPL achieves higher sensitivity and better robustness in identifying correlations, while maintaining a similar level of false positives as standard analyses based on Pearson correlation.Competing Interest StatementThe authors have declared no competing interest.

    @article{Maniatis2021.05.13.443959,
      abbr = {bioRxiv},
      author = {Maniatis, Christos and Vallejos, Catalina A and Sanguinetti, Guido},
      title = {SCRaPL: hierarchical Bayesian modelling of associations in single cell multi-omics data},
      elocation-id = {2021.05.13.443959},
      year = {2021},
      doi = {10.1101/2021.05.13.443959},
      publisher = {Cold Spring Harbor Laboratory},
      url = {https://www.biorxiv.org/content/early/2021/05/14/2021.05.13.443959},
      eprint = {https://www.biorxiv.org/content/early/2021/05/14/2021.05.13.443959.full.pdf},
      journal = {bioRxiv}
    }

2020

  1. GenBio
    Eleven grand challenges in single-cell data science
    David Lähnemann, Johannes Köster, Ewa Szczurek, Davis J. McCarthy, Stephanie C. Hicks, Mark D. Robinson,  Catalina A. Vallejos, Kieran R. Campbell, Niko Beerenwinkel, Ahmed Mahfouz, Luca Pinello, Pavel Skums, Alexandros Stamatakis, Camille Stephan-Otto Attolini, Samuel Aparicio, Jasmijn Baaijens, Marleen Balvert, Buys de Barbanson, Antonio Cappuccio, Giacomo Corleone, Bas E. Dutilh, Maria Florescu, Victor Guryev, Rens Holmer, Katharina Jahn, Thamar Jessurun Lobo, Emma M. Keizer, Indu Khatri, Szymon M. Kielbasa, Jan O. Korbel, Alexey M. Kozlov, Tzu-Hao Kuo, Boudewijn P. F. Lelieveldt, Ion I. Mandoiu, John C. Marioni, Tobias Marschall, Felix Mölder, Amir Niknejad, Lukasz Raczkowski, Marcel Reinders, Jeroen de Ridder, Antoine-Emmanuel Saliba, Antonios Somarakis, Oliver Stegle, Fabian J. Theis, Huan Yang, Alex Zelikovsky, Alice C. McHardy, Benjamin J. Raphael, Sohrab P. Shah, and Alexander Schönhuth
    Genome Biology Jan 2020

    The recent boom in microfluidics and combinatorial indexing strategies, combined with low sequencing costs, has empowered single-cell sequencing technology. Thousands—or even millions—of cells analyzed in a single experiment amount to a data revolution in single-cell biology and pose unique data science problems. Here, we outline eleven challenges that will be central to bringing this emerging field of single-cell data science forward. For each challenge, we highlight motivating research questions, review prior work, and formulate open problems. This compendium is for established researchers, newcomers, and students alike, highlighting interesting and rewarding problems for the coming years.

    @article{Lahnemann2020,
      abbr = {GenBio},
      author = {L{\"a}hnemann, David and K{\"o}ster, Johannes and Szczurek, Ewa and McCarthy, Davis J. and Hicks, Stephanie C. and Robinson, Mark D. and Vallejos, Catalina A. and Campbell, Kieran R. and Beerenwinkel, Niko and Mahfouz, Ahmed and Pinello, Luca and Skums, Pavel and Stamatakis, Alexandros and Attolini, Camille Stephan-Otto and Aparicio, Samuel and Baaijens, Jasmijn and Balvert, Marleen and Barbanson, Buys de and Cappuccio, Antonio and Corleone, Giacomo and Dutilh, Bas E. and Florescu, Maria and Guryev, Victor and Holmer, Rens and Jahn, Katharina and Lobo, Thamar Jessurun and Keizer, Emma M. and Khatri, Indu and Kielbasa, Szymon M. and Korbel, Jan O. and Kozlov, Alexey M. and Kuo, Tzu-Hao and Lelieveldt, Boudewijn P. F. and Mandoiu, Ion I. and Marioni, John C. and Marschall, Tobias and M{\"o}lder, Felix and Niknejad, Amir and Raczkowski, Lukasz and Reinders, Marcel and Ridder, Jeroen de and Saliba, Antoine-Emmanuel and Somarakis, Antonios and Stegle, Oliver and Theis, Fabian J. and Yang, Huan and Zelikovsky, Alex and McHardy, Alice C. and Raphael, Benjamin J. and Shah, Sohrab P. and Sch{\"o}nhuth, Alexander},
      date = {2020/02/07},
      doi = {10.1186/s13059-020-1926-6},
      id = {L{\"a}hnemann2020},
      isbn = {1474-760X},
      journal = {Genome Biology},
      number = {1},
      pages = {31},
      title = {Eleven grand challenges in single-cell data science},
      url = {https://doi.org/10.1186/s13059-020-1926-6},
      volume = {21},
      year = {2020},
      bdsk-url-1 = {https://doi.org/10.1186/s13059-020-1926-6}
    }
  2. Circ
    High-Sensitivity Cardiac Troponin and the Universal Definition of Myocardial Infarction
    Andrew R. Chapman, Philip D. Adamson, Anoop S.V. Shah, Atul Anand, Fiona E. Strachan, Amy V. Ferry, Kuan Ken Lee, Colin Berry, Iain Findlay, Anne Cruikshank, Alan Reid, Alasdair Gray, Paul O. Collinson, Fred Apple, David A. McAllister, Donogh Maguire, Keith A.A. Fox,  Catalina A. Vallejos, Catriona Keerie, Christopher J. Weir, David E. Newby, Nicholas L. Mills, Christopher Tuck, Anda Bularga, Ryan Wereski, Dennis Sandeman, Catherine L. Stables, Athanasios Tsanasis, Lucy Marshall, Stacey D. Stewart, Takeshi Fujisawa, Mischa Hautvast, Jean McPherson, Lynn McKinlay, Simon Walker, Ian Ford, Simon Walker, Shannon Amoils, Jennifer Stevens, John Norrie, Jack Andrews, Phil Adamson, Alastair Moss, Mohamed Anwar, John Hung, Simon Walker, Jonathan Malo, Colin Fischbacher, Bernard Croal, Stephen J. Leslie, Richard Parker, Allan Walker, Ronnie Harkess, Chris Tuck, Tony Wackett, Roma Armstrong, Marion Flood, Laura Stirling, Claire MacDonald, Imran Sadat, Frank Finlay, Heather Charles, Pamela Linksted, Stephen Young, Bill Alexander, and Chris Duncan
    Circulation Jan 2020

    Background: The introduction of more sensitive cardiac troponin assays has led to increased recognition of myocardial injury in acute illnesses other than acute coronary syndrome. The Universal Definition of Myocardial Infarction recommends high-sensitivity cardiac troponin testing and classification of patients with myocardial injury based on pathogenesis, but the clinical implications of implementing this guideline are not well understood. Methods: In a stepped-wedge cluster randomized, controlled trial, we implemented a high-sensitivity cardiac troponin assay and the recommendations of the Universal Definition in 48 282 consecutive patients with suspected acute coronary syndrome. In a prespecified secondary analysis, we compared the primary outcome of myocardial infarction or cardiovascular death and secondary outcome of noncardiovascular death at 1 year across diagnostic categories. Results: Implementation increased the diagnosis of type 1 myocardial infarction by 11% (510/4471), type 2 myocardial infarction by 22% (205/916), and acute and chronic myocardial injury by 36% (443/1233) and 43% (389/898), respectively. Compared with those without myocardial injury, the rate of the primary outcome was highest in those with type 1 myocardial infarction (cause-specific hazard ratio [HR] 5.64 [95% CI, 5.12–6.22]), but was similar across diagnostic categories, whereas noncardiovascular deaths were highest in those with acute myocardial injury (cause specific HR 2.65 [95% CI, 2.33–3.01]). Despite modest increases in antiplatelet therapy and coronary revascularization after implementation in patients with type 1 myocardial infarction, the primary outcome was unchanged (cause specific HR 1.00 [95% CI, 0.82–1.21]). Increased recognition of type 2 myocardial infarction and myocardial injury did not lead to changes in investigation, treatment or outcomes. Conclusions: Implementation of high-sensitivity cardiac troponin assays and the recommendations of the Universal Definition of Myocardial Infarction identified patients at high-risk of cardiovascular and noncardiovascular events but was not associated with consistent increases in treatment or improved outcomes. Trials of secondary prevention are urgently required to determine whether this risk is modifiable in patients without type 1 myocardial infarction.

    @article{doi:10.1161/CIRCULATIONAHA.119.042960,
      abbr = {Circ},
      author = {Chapman, Andrew R. and Adamson, Philip D. and Shah, Anoop S.V. and Anand, Atul and Strachan, Fiona E. and Ferry, Amy V. and Lee, Kuan Ken and Berry, Colin and Findlay, Iain and Cruikshank, Anne and Reid, Alan and Gray, Alasdair and Collinson, Paul O. and Apple, Fred and McAllister, David A. and Maguire, Donogh and Fox, Keith A.A. and Vallejos, Catalina A. and Keerie, Catriona and Weir, Christopher J. and Newby, David E. and Mills, Nicholas L. and Tuck, Christopher and Bularga, Anda and Wereski, Ryan and Sandeman, Dennis and Stables, Catherine L. and Tsanasis, Athanasios and Marshall, Lucy and Stewart, Stacey D. and Fujisawa, Takeshi and Hautvast, Mischa and McPherson, Jean and McKinlay, Lynn and Walker, Simon and Ford, Ian and Walker, Simon and Amoils, Shannon and Stevens, Jennifer and Norrie, John and Andrews, Jack and Adamson, Phil and Moss, Alastair and Anwar, Mohamed and Hung, John and Walker, Simon and Malo, Jonathan and Fischbacher, Colin and Croal, Bernard and Leslie, Stephen J. and Parker, Richard and Walker, Allan and Harkess, Ronnie and Tuck, Chris and Wackett, Tony and Armstrong, Roma and Flood, Marion and Stirling, Laura and MacDonald, Claire and Sadat, Imran and Finlay, Frank and Charles, Heather and Linksted, Pamela and Young, Stephen and Alexander, Bill and Duncan, Chris},
      title = {High-Sensitivity Cardiac Troponin and the Universal Definition of Myocardial Infarction},
      journal = {Circulation},
      volume = {141},
      number = {3},
      pages = {161-171},
      year = {2020},
      doi = {10.1161/CIRCULATIONAHA.119.042960},
      url = {https://www.ahajournals.org/doi/abs/10.1161/CIRCULATIONAHA.119.042960},
      eprint = {https://www.ahajournals.org/doi/pdf/10.1161/CIRCULATIONAHA.119.042960}
    }

2018

  1. CellSys
    Correcting the Mean-Variance Dependency for Differential Variability Testing Using Single-Cell RNA Sequencing Data
    Nils Eling, Arianne C Richard, Sylvia Richardson, John C Marioni, and Catalina A Vallejos
    Cell Systems Jan 2018

    Cell-to-cell transcriptional variability in otherwise homogeneous cell populations plays an important role in tissue function and development. Single-cell RNA sequencing can characterize this variability in a transcriptome-wide manner. However, technical variation and the confounding between variability and mean expression estimates hinder meaningful comparison of expression variability between cell populations. To address this problem, we introduce an analysis approach that extends the BASiCS statistical framework to derive a residual measure of variability that is not confounded by mean expression. This includes a robust procedure for quantifying technical noise in experiments where technical spike-in molecules are not available. We illustrate how our method provides biological insight into the dynamics of cell-to-cell expression variability, highlighting a synchronization of biosynthetic machinery components in immune cells upon activation. In contrast to the uniform up-regulation of the biosynthetic machinery, CD4+ T cells show heterogeneous up-regulation of immune-related and lineage-defining genes during activation and differentiation.

    @article{Eling2018,
      abbr = {CellSys},
      title = {Correcting the Mean-Variance Dependency for Differential Variability Testing Using Single-Cell RNA Sequencing Data},
      journal = {Cell Systems},
      volume = {7},
      number = {3},
      pages = {284-294.e12},
      year = {2018},
      issn = {2405-4712},
      doi = {https://doi.org/10.1016/j.cels.2018.06.011},
      url = {https://www.sciencedirect.com/science/article/pii/S2405471218302783},
      author = {Eling, Nils and Richard, Arianne C and Richardson, Sylvia and Marioni, John C and Vallejos, Catalina A},
      keywords = {single-cell RNA sequencing, transcriptional noise, variability, immune activation, statistics, Bayesian}
    }
  2. Lancet
    High-sensitivity troponin in the evaluation of patients with suspected acute coronary syndrome: a stepped-wedge, cluster-randomised controlled trial
    Anoop S V Shah, Atul Anand, Fiona E Strachan, Amy V Ferry, Kuan Ken Lee, Andrew R Chapman, Dennis Sandeman, Catherine L Stables, Philip D Adamson, Jack P M Andrews, Mohamed S Anwar, John Hung, Alistair J Moss, Rachel O’Brien, Colin Berry, Iain Findlay, Simon Walker, Anne Cruickshank, Alan Reid, Alasdair Gray, Paul O Collinson, Fred S Apple, David A McAllister, Donogh Maguire, Keith A A Fox, David E Newby, Christopher Tuck, Ronald Harkess, Richard A Parker, Catriona Keerie, Christopher J Weir, Nicholas L Mills, Lucy Marshall, Stacey D Stewart, Takeshi Fujisawa,  Catalina A Vallejos, Athanasios Tsanas, Mischa Hautvast, Jean McPherson, Lynn McKinlay, Jonathan Malo, Colin M Fischbacher, Bernard L Croal, Stephen J Leslie, Allan Walker, Tony Wackett, Roma Armstrong, Laura Stirling, Claire MacDonald, Imran Sadat, Frank Finlay, Heather Charles, Pamela Linksted, Stephen Young, Bill Alexander, and Chris Duncan
    The Lancet Jan 2018

    Background: High-sensitivity cardiac troponin assays permit use of lower thresholds for the diagnosis of myocardial infarction, but whether this improves clinical outcomes is unknown. We aimed to determine whether the introduction of a high-sensitivity cardiac troponin I (hs-cTnI) assay with a sex-specific 99th centile diagnostic threshold would reduce subsequent myocardial infarction or cardiovascular death in patients with suspected acute coronary syndrome. Methods: In this stepped-wedge, cluster-randomised controlled trial across ten secondary or tertiary care hospitals in Scotland, we evaluated the implementation of an hs-cTnI assay in consecutive patients who had been admitted to the hospitals’ emergency departments with suspected acute coronary syndrome. Patients were eligible for inclusion if they presented with suspected acute coronary syndrome and had paired cardiac troponin measurements from the standard care and trial assays. During a validation phase of 6–12 months, results from the hs-cTnI assay were concealed from the attending clinician, and a contemporary cardiac troponin I (cTnI) assay was used to guide care. Hospitals were randomly allocated to early (n=5 hospitals) or late (n=5 hospitals) implementation, in which the high-sensitivity assay and sex-specific 99th centile diagnostic threshold was introduced immediately after the 6-month validation phase or was deferred for a further 6 months. Patients reclassified by the high-sensitivity assay were defined as those with an increased hs-cTnI concentration in whom cTnI concentrations were below the diagnostic threshold on the contemporary assay. The primary outcome was subsequent myocardial infarction or death from cardiovascular causes at 1 year after initial presentation. Outcomes were compared in patients reclassified by the high-sensitivity assay before and after its implementation by use of an adjusted generalised linear mixed model. This trial is registered with ClinicalTrials.gov, number NCT01852123. Findings: Between June 10, 2013, and March 3, 2016, we enrolled 48 282 consecutive patients (61 [SD 17] years, 47% women) of whom 10 360 (21%) patients had cTnI concentrations greater than those of the 99th centile of the normal range of values, who were identified by the contemporary assay or the high-sensitivity assay. The high-sensitivity assay reclassified 1771 (17%) of 10 360 patients with myocardial injury or infarction who were not identified by the contemporary assay. In those reclassified, subsequent myocardial infarction or cardiovascular death within 1 year occurred in 105 (15%) of 720 patients in the validation phase and 131 (12%) of 1051 patients in the implementation phase (adjusted odds ratio for implementation vs validation phase 1·10, 95% CI 0·75 to 1·61; p=0·620). Interpretation: Use of a high-sensitivity assay prompted reclassification of 1771 (17%) of 10 360 patients with myocardial injury or infarction, but was not associated with a lower subsequent incidence of myocardial infarction or cardiovascular death at 1 year. Our findings question whether the diagnostic threshold for myocardial infarction should be based on the 99th centile derived from a normal reference population.

    @article{SHAH2018919,
      abbr = {Lancet},
      title = {High-sensitivity troponin in the evaluation of patients with suspected acute coronary syndrome: a stepped-wedge, cluster-randomised controlled trial},
      journal = {The Lancet},
      volume = {392},
      number = {10151},
      pages = {919-928},
      year = {2018},
      issn = {0140-6736},
      doi = {https://doi.org/10.1016/S0140-6736(18)31923-8},
      url = {https://www.sciencedirect.com/science/article/pii/S0140673618319238},
      author = {Shah, Anoop S V and Anand, Atul and Strachan, Fiona E and Ferry, Amy V and Lee, Kuan Ken and Chapman, Andrew R and Sandeman, Dennis and Stables, Catherine L and Adamson, Philip D and Andrews, Jack P M and Anwar, Mohamed S and Hung, John and Moss, Alistair J and O'Brien, Rachel and Berry, Colin and Findlay, Iain and Walker, Simon and Cruickshank, Anne and Reid, Alan and Gray, Alasdair and Collinson, Paul O and Apple, Fred S and McAllister, David A and Maguire, Donogh and Fox, Keith A A and Newby, David E and Tuck, Christopher and Harkess, Ronald and Parker, Richard A and Keerie, Catriona and Weir, Christopher J and Mills, Nicholas L and Marshall, Lucy and Stewart, Stacey D and Fujisawa, Takeshi and Vallejos, Catalina A and Tsanas, Athanasios and Hautvast, Mischa and McPherson, Jean and McKinlay, Lynn and Malo, Jonathan and Fischbacher, Colin M and Croal, Bernard L and Leslie, Stephen J and Walker, Allan and Wackett, Tony and Armstrong, Roma and Stirling, Laura and MacDonald, Claire and Sadat, Imran and Finlay, Frank and Charles, Heather and Linksted, Pamela and Young, Stephen and Alexander, Bill and Duncan, Chris}
    }

2017

  1. NatMet
    Normalizing single-cell RNA sequencing data: challenges and opportunities
    Catalina A Vallejos, Davide Risso, Antonio Scialdone, Sandrine Dudoit, and John C Marioni
    Nature Methods Jan 2017

    Single-cell transcriptomics is becoming an important component of the molecular biologist’s toolkit. A critical step when analyzing data generated using this technology is normalization. However, normalization is typically performed using methods developed for bulk RNA sequencing or even microarray data, and the suitability of these methods for single-cell transcriptomics has not been assessed. We here discuss commonly used normalization approaches and illustrate how these can produce misleading results. Finally, we present alternative approaches and provide recommendations for single-cell RNA sequencing users.

    @article{Vallejos2017,
      abbr = {NatMet},
      title = {Normalizing single-cell RNA sequencing data: challenges and opportunities},
      author = {Vallejos, Catalina A and Risso, Davide and Scialdone, Antonio and Dudoit, Sandrine and Marioni, John C},
      date = {2017/06/01},
      date-added = {2022-02-26 11:24:03 +0000},
      date-modified = {2022-02-26 11:24:03 +0000},
      doi = {10.1038/nmeth.4292},
      isbn = {1548-7105},
      journal = {Nature Methods},
      number = {6},
      pages = {565--571},
      url = {https://doi.org/10.1038/nmeth.4292},
      volume = {14},
      year = {2017}
    }
  2. EconStat
    Incorporating unobserved heterogeneity in Weibull survival models: A Bayesian approach
    Catalina A. Vallejos, and Mark F.J. Steel
    Econometrics and Statistics Jan 2017

    Outlying observations and other forms of unobserved heterogeneity can distort inference for survival datasets. The family of Rate Mixtures of Weibull distributions includes subject-level frailty terms as a solution to this issue. With a parametric mixing distribution assigned to the frailties, this family generates flexible hazard functions. Covariates are introduced via an Accelerated Failure Time specification for which the interpretation of the regression coefficients does not depend on the choice of mixing distribution. A weakly informative prior is proposed by combining the structure of the Jeffreys prior with a proper prior on some model parameters. This improper prior is shown to lead to a proper posterior distribution under easily satisfied conditions. By eliciting the proper component of the prior through the coefficient of variation of the survival times, prior information is matched for different mixing distributions. Posterior inference on subject-level frailty terms is exploited as a tool for outlier detection. Finally, the proposed methodology is illustrated using two real datasets, one concerning bone marrow transplants and another on cerebral palsy.

    @article{VALLEJOS201773,
      abbr = {EconStat},
      title = {Incorporating unobserved heterogeneity in Weibull survival models: A Bayesian approach},
      journal = {Econometrics and Statistics},
      volume = {3},
      pages = {73-88},
      year = {2017},
      issn = {2452-3062},
      doi = {https://doi.org/10.1016/j.ecosta.2017.01.005},
      url = {https://www.sciencedirect.com/science/article/pii/S2452306217300072},
      author = {Vallejos, Catalina A. and Steel, Mark F.J.},
      keywords = {Survival analysis, Frailty model, Robust modelling, Outlier detection, Posterior existence}
    }
  3. Science
    Aging increases cell-to-cell transcriptional variability upon immune stimulation
    Celia Pilar Martinez-Jimenez, Nils Eling, Hung-Chang Chen,  Catalina A. Vallejos, Aleksandra A. Kolodziejczyk, Frances Connor, Lovorka Stojic, Timothy F. Rayner, Michael J. T. Stubbington, Sarah A. Teichmann, Maike Roche, John C. Marioni, and Duncan T. Odom
    Science Jan 2017

    Single-cell sequencing of mouse immune cells reveals how aging destabilizes a conserved transcriptional activation program. How and why the immune system becomes less effective with age are not well understood. Martinez-Jimenez et al. performed single-cell sequencing of CD4+ T cells in old and young mice of two species. In young mice, the gene expression program of early immune activation was tightly regulated and conserved between species. However, as mice aged, the expression of genes involved in pathways responding to immune cell stimulation was not as robust and exhibited increased cell-to-cell variability. Science, this issue p. 1433 Aging is characterized by progressive loss of physiological and cellular functions, but the molecular basis of this decline remains unclear. We explored how aging affects transcriptional dynamics using single-cell RNA sequencing of unstimulated and stimulated naïve and effector memory CD4+ T cells from young and old mice from two divergent species. In young animals, immunological activation drives a conserved transcriptomic switch, resulting in tightly controlled gene expression characterized by a strong up-regulation of a core activation program, coupled with a decrease in cell-to-cell variability. Aging perturbed the activation of this core program and increased expression heterogeneity across populations of cells in both species. These discoveries suggest that increased cell-to-cell transcriptional variability will be a hallmark feature of aging across most, if not all, mammalian tissues.

    @article{doi10.1126/science.aah4115,
      abbr = {Science},
      author = {Martinez-Jimenez, Celia Pilar and Eling, Nils and Chen, Hung-Chang and Vallejos, Catalina A. and Kolodziejczyk, Aleksandra A. and Connor, Frances and Stojic, Lovorka and Rayner, Timothy F. and Stubbington, Michael J. T. and Teichmann, Sarah A. and de la Roche, Maike and Marioni, John C. and Odom, Duncan T.},
      title = {Aging increases cell-to-cell transcriptional variability upon immune stimulation},
      journal = {Science},
      volume = {355},
      number = {6332},
      pages = {1433-1436},
      year = {2017},
      doi = {10.1126/science.aah4115},
      url = {https://www.science.org/doi/abs/10.1126/science.aah4115},
      eprint = {https://www.science.org/doi/pdf/10.1126/science.aah4115}
    }

2016

  1. RSS A
    Bayesian survival modelling of university outcomes
    Catalina A. Vallejos, and Mark F. J. Steel
    Journal of the Royal Statistical Society: Series A (Statistics in Society) Jul 2016

    Dropouts and delayed graduations are critical issues in higher education systems world wide. A key task in this context is to identify risk factors associated with these events, providing potential targets for mitigating policies. For this, we employ a discrete time competing risks survival model, dealing simultaneously with university outcomes and its associated temporal component. We define survival times as the duration of the student’s enrolment at university and possible outcomes as graduation or two types of dropout (voluntary and involuntary), exploring the information recorded at admission time (e.g. educational level of the parents) as potential predictors. Although similar strategies have been previously implemented, we extend the previous methods by handling covariate selection within a Bayesian variable selection framework, where model uncertainty is formally addressed through Bayesian model averaging. Our methodology is general; however, here we focus on undergraduate students enrolled in three selected degree programmes of the Pontificia Universidad Católica de Chile during the period 2000–2011. Our analysis reveals interesting insights, highlighting the main covariates that influence students’ risk of dropout and delayed graduation.

    @article{Vallejos_2016,
      abbr = {RSS A},
      author = {Vallejos, Catalina A. and Steel, Mark F. J.},
      journal = {Journal of the Royal Statistical Society: Series A (Statistics in Society)},
      title = {Bayesian survival modelling of university outcomes},
      year = {2016},
      month = jul,
      number = {2},
      pages = {613--631},
      volume = {180},
      doi = {10.1111/rssa.12211},
      url = {https://doi.org/10.1111/rssa.12211},
      publisher = {Wiley}
    }
  2. GenBio
    Beyond comparisons of means: understanding changes in gene expression at the single-cell level
    Catalina A. Vallejos, Sylvia Richardson, and John C. Marioni
    Genome Biology Jul 2016

    Traditional differential expression tools are limited to detecting changes in overall expression, and fail to uncover the rich information provided by single-cell level data sets. We present a Bayesian hierarchical model that builds upon BASiCS to study changes that lie beyond comparisons of means, incorporating built-in normalization and quantifying technical artifacts by borrowing information from spike-in genes. Using a probabilistic approach, we highlight genes undergoing changes in cell-to-cell heterogeneity but whose overall expression remains unchanged. Control experiments validate our method’s performance and a case study suggests that novel biological insights can be revealed. Our method is implemented in R and available at https://github.com/catavallejos/BASiCS.

    @article{Vallejos2018,
      abbr = {GenBio},
      author = {Vallejos, Catalina A. and Richardson, Sylvia and Marioni, John C.},
      date = {2016/04/15},
      doi = {10.1186/s13059-016-0930-3},
      id = {Vallejos2016},
      isbn = {1474-760X},
      journal = {Genome Biology},
      number = {1},
      pages = {70},
      title = {Beyond comparisons of means: understanding changes in gene expression at the single-cell level},
      url = {https://doi.org/10.1186/s13059-016-0930-3},
      volume = {17},
      year = {2016},
      bdsk-url-1 = {https://doi.org/10.1186/s13059-016-0930-3}
    }

2015

  1. PLOS
    BASiCS: Bayesian Analysis of Single-Cell Sequencing Data
    Vallejos Catalina A, Marioni John C, and Richardson Sylvia
    PLOS Computational Biology Jun 2015

    Single-cell mRNA sequencing can uncover novel cell-to-cell heterogeneity in gene expression levels in seemingly homogeneous populations of cells. However, these experiments are prone to high levels of unexplained technical noise, creating new challenges for identifying genes that show genuine heterogeneous expression within the population of cells under study. BASiCS (Bayesian Analysis of Single-Cell Sequencing data) is an integrated Bayesian hierarchical model where: (i) cell-specific normalisation constants are estimated as part of the model parameters, (ii) technical variability is quantified based on spike-in genes that are artificially introduced to each analysed cell’s lysate and (iii) the total variability of the expression counts is decomposed into technical and biological components. BASiCS also provides an intuitive detection criterion for highly (or lowly) variable genes within the population of cells under study. This is formalised by means of tail posterior probabilities associated to high (or low) biological cell-to-cell variance contributions, quantities that can be easily interpreted by users. We demonstrate our method using gene expression measurements from mouse Embryonic Stem Cells. Cross-validation and meaningful enrichment of gene ontology categories within genes classified as highly (or lowly) variable supports the efficacy of our approach.

    @article{Vallejos2015,
      abbr = {PLOS},
      doi = {10.1371/journal.pcbi.1004333},
      author = {A, Vallejos Catalina and C, Marioni John and Sylvia, Richardson},
      journal = {PLOS Computational Biology},
      publisher = {Public Library of Science},
      title = {BASiCS: Bayesian Analysis of Single-Cell Sequencing Data},
      year = {2015},
      month = jun,
      volume = {11},
      url = {https://doi.org/10.1371/journal.pcbi.1004333},
      pages = {1-18},
      selected = {false},
      number = {6}
    }
  2. JASS
    Objective Bayesian Survival Analysis Using Shape Mixtures of Log-Normal Distributions
    Catalina A Vallejos, and Mark FJ Steel
    Journal of the American Statistical Association Jun 2015

    Survival models such as the Weibull or log-normal lead to inference that is not robust to the presence of outliers. They also assume that all heterogeneity between individuals can be modeled through covariates. This article considers the use of infinite mixtures of lifetime distributions as a solution for these two issues. This can be interpreted as the introduction of a random effect in the survival distribution. We introduce the family of shape mixtures of log-normal distributions, which covers a wide range of density and hazard functions. Bayesian inference under nonsubjective priors based on the Jeffreys’ rule is examined and conditions for posterior propriety are established. The existence of the posterior distribution on the basis of a sample of point observations is not always guaranteed and a solution through set observations is implemented. In addition, we propose a method for outlier detection based on the mixture structure. A simulation study illustrates the performance of our methods under different scenarios and an application to a real dataset is provided. Supplementary materials for the article, which include R code, are available online.

    @article{Vallejos2016,
      abbr = {JASS},
      author = {Vallejos, Catalina A and Steel, Mark FJ},
      title = {Objective Bayesian Survival Analysis Using Shape Mixtures of Log-Normal Distributions},
      journal = {Journal of the American Statistical Association},
      volume = {110},
      number = {510},
      pages = {697-710},
      year = {2015},
      publisher = {Taylor & Francis},
      doi = {10.1080/01621459.2014.923316},
      url = {https://doi.org/10.1080/01621459.2014.923316},
      eprint = {https://doi.org/10.1080/01621459.2014.923316}
    }