Avatar

Vallejos Group

Understanding heterogeneity in complex biomedical data


About

While biomedical data sometimes classifies as “big data” (where the number of samples and/or variables is large), complexity is its most prominent feature. This arises from a combination of different sources of heterogeneity: heterogeneity across individuals in a population (e.g. response to treatment), heterogeneity in terms of the type of data we collect (e.g. health records & genomics) and heterogeneity that is introduced by the data collection process (e.g. measurement error).

We focus on the development of novel statistical methodology to address and study these sources of heterogeneity. This is a highly multidisciplinary task: from the understanding of complex biomedical problems and technologies, to the development of new methodology and the implementation of open-source analysis tools. Our current research focuses on two areas of application. Firstly, single-cell RNA-sequencing, a cutting-edge experimental technique that allows genome-wide quantification of gene expression on a cell-by-cell basis. Secondly, electronic health records research, to develop predictive models based on observational data that is routinely collected by health providers (e.g. NHS). Developing computational tools that can make full advantage of the rich information provided by these data sources is ought to improve our understanding of health and disease, playing an important role in precision medicine initiatives.


Selected Publications

  1. NPJ DM
    Development and assessment of a machine learning tool for predicting emergency admission in Scotland
    James Liley, Gergo Bohner, Samuel R. Emerson, Bilal A. Mateen, Katie Borland, David Carr, Scott Heald, Samuel D. Oduro, Jill Ireland, Keith Moffat, Rachel Porteous, Stephen Riddell, Simon Rogers,  Ioanna Thoma, Nathan Cunningham, Chris Holmes, Katrina Payne, Sebastian J. Vollmer,  Catalina A. Vallejos, and Louis J. M. Aslett
    npj Digital Medicine Oct 2024

    Emergency admissions (EA), where a patient requires urgent in-hospital care, are a major challenge for healthcare systems. The development of risk prediction models can partly alleviate this problem by supporting primary care interventions and public health planning. Here, we introduce SPARRAv4, a predictive score for EA risk that will be deployed nationwide in Scotland. SPARRAv4 was derived using supervised and unsupervised machine-learning methods applied to routinely collected electronic health records from approximately 4.8M Scottish residents (2013-18). We demonstrate improvements in discrimination and calibration with respect to previous scores deployed in Scotland, as well as stability over a 3-year timeframe. Our analysis also provides insights about the epidemiology of EA risk in Scotland, by studying predictive performance across different population sub-groups and reasons for admission, as well as by quantifying the effect of individual input features. Finally, we discuss broader challenges including reproducibility and how to safely update risk prediction models that are already deployed at population level.

    @article{Liley_2024,
      abbr = {NPJ DM},
      bibtex_show = {true},
      selected = {true},
      title = {Development and assessment of a machine learning tool for predicting emergency admission in Scotland},
      volume = {7},
      issn = {2398-6352},
      url = {http://dx.doi.org/10.1038/s41746-024-01250-1},
      doi = {10.1038/s41746-024-01250-1},
      number = {1},
      journal = {npj Digital Medicine},
      publisher = {Springer Science and Business Media LLC},
      author = {Liley, James and Bohner, Gergo and Emerson, Samuel R. and Mateen, Bilal A. and Borland, Katie and Carr, David and Heald, Scott and Oduro, Samuel D. and Ireland, Jill and Moffat, Keith and Porteous, Rachel and Riddell, Stephen and Rogers, Simon and Thoma, Ioanna and Cunningham, Nathan and Holmes, Chris and Payne, Katrina and Vollmer, Sebastian J. and Vallejos, Catalina A. and Aslett, Louis J. M.},
      year = {2024},
      month = oct
    }
  2. AI & Ethics
    Ethical considerations of use of hold-out sets in clinical prediction model management
    Louis ChislettLouis J. M. Aslett, Alisha R. Davies,  Catalina A. Vallejos, and James Liley
    AI and Ethics Sep 2024

    Clinical prediction models are statistical or machine learning models used to quantify the risk of a certain health outcome using patient data. These can then inform potential interventions on patients, causing an effect called performative prediction: predictions inform interventions which influence the outcome they were trying to predict, leading to a potential underestimation of risk in some patients if a model is updated on this data. One suggested resolution to this is the use of hold-out sets, in which a set of patients do not receive model derived risk scores, such that a model can be safely retrained. We present an overview of clinical and research ethics regarding potential implementation of hold-out sets for clinical prediction models in health settings. We focus on the ethical principles of beneficence, non-maleficence, autonomy and justice. We also discuss informed consent, clinical equipoise, and truth-telling. We present illustrative cases of potential hold-out set implementations and discuss statistical issues arising from different hold-out set sampling methods. We also discuss differences between hold-out sets and randomised control trials, in terms of ethics and statistical issues. Finally, we give practical recommendations for researchers interested in the use hold-out sets for clinical prediction models.

    @article{Chislett_2024,
      abbr = {AI & Ethics},
      bibtex_show = {true},
      selected = {true},
      title = {Ethical considerations of use of hold-out sets in clinical prediction model management},
      issn = {2730-5961},
      url = {http://dx.doi.org/10.1007/s43681-024-00561-z},
      doi = {10.1007/s43681-024-00561-z},
      journal = {AI and Ethics},
      publisher = {Springer Science and Business Media LLC},
      author = {Chislett, Louis and Aslett, Louis J. M. and Davies, Alisha R. and Vallejos, Catalina A. and Liley, James},
      year = {2024},
      month = sep
    }
  3. medRxiv
    Blood-based DNA methylation study of alcohol consumption
    Elena Bernabeu, Aleksandra D Chybowska, Jacob K. Kresovich, Matthew Suderman, Daniel L McCartney, Robert F Hillary, Janie Corley, Maria Del C. Valdés-Hernández, Susana Muñoz Maniega, Mark E. Bastin, Joanna M. Wardlaw, Zongli Xu, Dale P. Sandler, Archie Campbell, Sarah E Harris, Andrew M McIntosh, Jack A. Taylor, Paul Yousefi, Simon R Cox, Kathryn L Evans, Matthew R Robinson,  Catalina A Vallejos, and Riccardo E Marioni
    Feb 2024

    Alcohol consumption is an important risk factor for multiple diseases. It is typically assessed via self-report, which is open to measurement error and bias. Instead, molecular data such as blood-based DNA methylation (DNAm) could be used to derive a more objective measure of alcohol consumption by incorporating information from cytosine-phosphate-guanine (CpG) sites known to be linked to the trait. Here, we explore the epigenetic architecture of self-reported weekly units of alcohol consumption in the Generation Scotland study. We first create a blood-based epigenetic score (EpiScore) of alcohol consumption using elastic net penalised linear regression. We explore the effect of pre-filtering for CpG features ahead of elastic net, as well as differential patterns by sex and by units consumed in the last week relative to an average week. The final EpiScore was trained on 16,717 individuals and tested in four external cohorts: the Lothian Birth Cohorts (LBC) of 1921 and 1936, the Sister Study, and the Avon Longitudinal Study of Parents and Children (total N across studies > 10,000). The maximum Pearson correlation between the EpiScore and self-reported alcohol consumption within cohort ranged from 0.41 to 0.53. In LBC1936, higher EpiScore levels had significant associations with poorer global brain imaging metrics, whereas self-reported alcohol consumption did not. Finally, we identified two novel CpG loci via a Bayesian penalized regression epigenome-wide association study (EWAS) of alcohol consumption. Together, these findings show how DNAm can objectively characterize patterns of alcohol consumption that associate with brain health, unlike self-reported estimates.

    @article{Bernabeu_2024,
      abbr = {medRxiv},
      bibtex_show = {true},
      selected = {true},
      title = {Blood-based DNA methylation study of alcohol consumption},
      url = {http://dx.doi.org/10.1101/2024.02.26.24303397},
      doi = {10.1101/2024.02.26.24303397},
      publisher = {Cold Spring Harbor Laboratory},
      author = {Bernabeu, Elena and Chybowska, Aleksandra D and Kresovich, Jacob K. and Suderman, Matthew and McCartney, Daniel L and Hillary, Robert F and Corley, Janie and Valdés-Hernández, Maria Del C. and Muñoz Maniega, Susana and Bastin, Mark E. and Wardlaw, Joanna M. and Xu, Zongli and Sandler, Dale P. and Campbell, Archie and Harris, Sarah E and McIntosh, Andrew M and Taylor, Jack A. and Yousefi, Paul and Cox, Simon R and Evans, Kathryn L and Robinson, Matthew R and Vallejos, Catalina A and Marioni, Riccardo E},
      year = {2024},
      month = feb
    }
  4. medRxiv
    Differential behaviour of a risk score for emergency hospital admission by demographics in Scotland — a retrospective study
    Ioanna Thoma, Simon Rogers, Jill Ireland, Rachel Porteous, Katie Borland,  Catalina A. VallejosLouis J. M. Aslett, and James Liley
    Feb 2024

    The Scottish Patients at Risk of Re-Admission and Admission (SPARRA) score predicts individual risk of emergency hospital admission for approximately 80% of the Scottish population. It was developed using routinely collected electronic health records, and is used by primary care practitioners to inform anticipatory care, particularly for individuals with high healthcare needs. We comprehensively assess the SPARRA score across population subgroups defined by age, sex, ethnicity, socioeconomic deprivation, and geographic location. For these subgroups, we consider differences in overall performance, score distribution, and false positive and negative rates, using causal methods to identify effects mediated through age, sex, and deprivation. We show that the score is well-calibrated across subgroups, but that rates of false positives and negatives vary widely, mediated by a range of causes. Our work assists practitioners in the application and interpretation of the SPARRA score in population subgroups. Evidence before this study: There is considerable literature on the general topic of differential performance of risk scores across population subgroups and its implications. A shared theme is the importance of identifying and quantifying such differential performance. We performed a MedLine and Google Scholar search with the single term ’SPARRA’, and consulted colleagues at Public Health Scotland about any previous internal analyses. Several articles assessed the accuracy of SPARRA and discussed its role in the Scottish healthcare system since its introduction in 2006, but none looked in detail at differential performance between specific demographic groups. Added value of this study: We provide a comprehensive assessment of the performance of the SPARRA score across a range of population subgroups in several ways. We systematically examined differences in performance using a range of metrics. We identify notable areas of differential performance associated with age, sex, socioeconomic deprivation, ethnicity and residence location (mainland versus island; urban versus rural). We also examined the pattern of errors in prediction across medical causes of emergency admission, finding that, to variable degrees across groups, cardiac and respiratory admissions are more likely to be correctly predicted from electronic health records. Overall, our work provides an atlas of performance measures for SPARRA and partly explains how between-group performance differences arise. Implications of all the available evidence: The precision by which the SPARRA score can predict emergency hospital admissions differs between population subgroups. These differences are largely driven by variation in performance across age and sex, as well as the predictability of different causes of admission. Awareness of these differences is important when making decisions based on the SPARRA score.

    @article{Thoma_2024,
      abbr = {medRxiv},
      bibtex_show = {true},
      selected = {true},
      title = {Differential behaviour of a risk score for emergency hospital admission by demographics in Scotland — a retrospective study},
      url = {http://dx.doi.org/10.1101/2024.02.13.24302753},
      doi = {10.1101/2024.02.13.24302753},
      publisher = {Cold Spring Harbor Laboratory},
      author = {Thoma, Ioanna and Rogers, Simon and Ireland, Jill and Porteous, Rachel and Borland, Katie and Vallejos, Catalina A. and Aslett, Louis J. M. and Liley, James},
      year = {2024},
      month = feb
    }
  5. Biomet
    A review on statistical and machine learning competing risks methods
    Karla Monterrubio-GómezNathan Constantine-Cooke, and Catalina A. Vallejos
    Biometrical Journal Feb 2024

    Abstract When modeling competing risks (CR) survival data, several techniques have been proposed in both the statistical and machine learning literature. State-of-the-art methods have extended classical approaches with more flexible assumptions that can improve predictive performance, allow high-dimensional data and missing values, among others. Despite this, modern approaches have not been widely employed in applied settings. This article aims to aid the uptake of such methods by providing a condensed compendium of CR survival methods with a unified notation and interpretation across approaches. We highlight available software and, when possible, demonstrate their usage via reproducible R vignettes. Moreover, we discuss two major concerns that can affect benchmark studies in this context: the choice of performance metrics and reproducibility.

    @article{https://doi.org/10.1002/bimj.202300060,
      abbr = {Biomet},
      bibtex_show = {true},
      author = {Monterrubio-Gómez, Karla and Constantine-Cooke, Nathan and Vallejos, Catalina A.},
      title = {A review on statistical and machine learning competing risks methods},
      journal = {Biometrical Journal},
      volume = {66},
      number = {2},
      pages = {2300060},
      keywords = {competing risks, risk prediction, survival analysis, time-to-event data},
      doi = {https://doi.org/10.1002/bimj.202300060},
      url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/bimj.202300060},
      eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/bimj.202300060},
      year = {2024},
      month = feb,
      selected = {true}
    }
© Copyright 2024