While biomedical data sometimes classifies as “big data” (where the number of samples and/or variables is large), complexity is its most prominent feature. This arises from a combination of different sources of heterogeneity: heterogeneity across individuals in a population (e.g. response to treatment), heterogeneity in terms of the type of data we collect (e.g. health records & genomics) and heterogeneity that is introduced by the data collection process (e.g. measurement error).
We focus on the development of novel statistical methodology to address and study these sources of heterogeneity. This is a highly multidisciplinary task: from the understanding of complex biomedical problems and technologies, to the development of new methodology and the implementation of open-source analysis tools. Our current research focuses on two areas of application. Firstly, single-cell RNA-sequencing, a cutting-edge experimental technique that allows genome-wide quantification of gene expression on a cell-by-cell basis. Secondly, electronic health records research, to develop predictive models based on observational data that is routinely collected by health providers (e.g. NHS). Developing computational tools that can make full advantage of the rich information provided by these data sources is ought to improve our understanding of health and disease, playing an important role in precision medicine initiatives.
- GenBioscMET: Bayesian modeling of DNA methylation heterogeneity at single-cell resolutionGenome Biology 2021
High-throughput single-cell measurements of DNA methylomes can quantify methylation heterogeneity and uncover its role in gene regulation. However, technical limitations and sparse coverage can preclude this task. scMET is a hierarchical Bayesian model which overcomes sparsity, sharing information across cells and genomic features to robustly quantify genuine biological heterogeneity. scMET can identify highly variable features that drive epigenetic heterogeneity, and perform differential methylation and variability analyses. We illustrate how scMET facilitates the characterization of epigenetically distinct cell populations and how it enables the formulation of novel hypotheses on the epigenetic regulation of gene expression. scMET is available at https://github.com/andreaskapou/scMET.
- medRxivDNA Methylation scores augment 10-year risk prediction of diabetesmedRxiv 2021
Type 2 diabetes mellitus (T2D) is one of the most prevalent diseases in the world and presents a major health and economic burden, a notable proportion of which could be alleviated with improved early prediction and intervention. While standard risk factors including age, obesity, and hypertension have shown good predictive performance, we show that the use of CpG DNA methylation information leads to a significant improvement in the prediction of 10-year T2D incidence risk. Whilst previous studies have been largely constrained by linear assumptions and the use of CpGs one-at-the-time, we have adopted a more flexible approach based on a range of linear and tree-ensemble models for classification and time-to-event prediction. Using the Generation Scotland cohort (n=9,537) our best performing model (Area Under the Curve (AUC)=0.880, Precision Recall AUC (PRAUC)=0.539, McFadden’s R2=0.316) used a LASSO Cox proportional-hazards predictor and showed notable improvement in onset prediction, above and beyond standard risk factors (AUC=0.860, PRAUC=0.444 R2=0.261). Replication of the main finding was observed in an external test dataset (the German-based KORA study, p=3.7x10-4). Tree-ensemble methods provided comparable performance and future improvements to these models are discussed. Finally, we introduce MethylPipeR, an R package with accompanying user interface, for systematic and reproducible development of complex trait and incident disease predictors. While MethylPipeR was applied to incident T2D prediction with DNA methylation in our experiments, the package is designed for generalised development of predictive models and is applicable to a wide range of omics data and target traits.
- arXivModel updating after interventions paradoxically introduces biasarXiv 2021
Machine learning is increasingly being used to generate prediction models for use in a number of real-world settings, from credit risk assessment to clinical decision support. Recent discussions have highlighted potential problems in the updating of a predictive score for a binary outcome when an existing predictive score forms part of the standard workflow, driving interventions. In this setting, the existing score induces an additional causative pathway which leads to miscalibration when the original score is replaced. We propose a general causal framework to describe and address this problem, and demonstrate an equivalent formulation as a partially observed Markov decision process. We use this model to demonstrate the impact of such ‘naive updating’ when performed repeatedly. Namely, we show that successive predictive scores may converge to a point where they predict their own effect, or may eventually tend toward a stable oscillation between two values, and we argue that neither outcome is desirable. Furthermore, we demonstrate that even if model-fitting procedures improve, actual performance may worsen. We complement these findings with a discussion of several potential routes to overcome these issues.
- medRxivDevelopment and assessment of a machine learning tool for predicting emergency admission in ScotlandmedRxiv 2021
Avoiding emergency hospital admission (EA) is advantageous to individual health and the healthcare system. We develop a statistical model estimating risk of EA for most of the Scottish population (> 4.8M individuals) using electronic health records, such as hospital episodes and prescribing activity. We demonstrate good predictive accuracy (AUROC 0.80), calibration and temporal stability. We find strong prediction of respiratory and metabolic EA, show a substantial risk contribution from socioeconomic decile, and highlight an important problem in model updating. Our work constitutes a rare example of a population-scale machine learning score to be deployed in a healthcare setting.Competing Interest StatementThe authors have declared no competing interest.Funding StatementJL, CAV and LJMA were partially supported by Wave 1 of The UKRI Strategic Priorities Fund under the EPSRC Grant EP/T001569/1, particularly the "Health" theme within that grant and The Alan Turing Institute; JL, BAM, CAV, LJMA and SJV were partially supported by Health Data Research UK, an initiative funded by UK Research and Innovation, Department of Health and Social Care (England), the devolved administrations, and leading medical research charities; SJV, NC and GB were partially supported by the University of Warwick Impact Fund. SRE is funded by the EPSRC doctoral training partnership (DTP) at Durham University, grant reference EP/R513039/1; LJMA was partially supported by a Health Programme Fellowship at The Alan Turing Institute; CAV was supported by a Chancellor’s Fellowship provided by the University of Edinburgh.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:This study and the use of NHS data was approved by the Public Benefit and Privacy Panel for Health and Social Care (study number 1718-0370; approval evidenced in application outcome minutes for 2018/19 at https://www.informationgovernance.scot.nhs.uk/pbpphsc/application-outcomes/ ). In addition, accessing data was approved by the Public Health Scotland National Safe Haven, through the the electronic Data Research and Innovation Service (eDRIS) and the Public Benefit and Privacy Panel (PBPP) (study number 1718-0370). All studies have been conducted in accordance with information governance standards; data had no patient identifiers available to the researchers. This work was conducted in accordance with UK data governance regulations under PBPP application number eDRIS 1718-0370 All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesRaw data for this project are patient-level NHS Scotland health records, and are confidential. Due to the confidential nature of the data used, all analysis took place on remote ’safe havens’, without access to internet, software updates or unpublished software. Information Governance training was required for all researchers accessing the analysis environment. Moreover, to avoid the risk of accidental disclosure of sensitive information, an independent team carried out statistical disclosure control checks to all data exports, including the outputs presented in this manuscript. All analysis code and co-ordinates required to reproduce our Figures are available in github.com/jamesliley/SPARRAv4 https://github.com/jamesliley/SPARRAv4
- medRxivLatent Crohn’s Disease Subgroups are Identified by Longitudinal Faecal Calprotectin ProfilesAug 2022
Background High faecal calprotectin is associated with poor outcomes in Crohn’s disease. Monitoring of faecal calprotectin trajectories could characterise disease progression before severe complications occur. Aims We undertook an unbiased assessment of a retrospective incident Crohn’s disease cohort to assess for inter-individual variability in faecal calprotectin levels over time. We aimed to explore whether latent classes of such profiles are associated with a composite endpoint consisting of surgery, hospitalisation, or Montreal behaviour progression and other clinical information. Methods Latent class mixed models were used to model faecal calprotectin trajectories within five years of diagnosis. Akaike information criterion, Bayesian information criterion, alluvial plots, and class-specific trajectories were used to decide the optimal number of classes. Log-rank tests of Kaplan-Meier estimators were used to test for associations between class membership and outcomes. Results Our study cohort comprised 365 subjects and 2856 faecal calprotectin measurements (median 7 per subject). Four latent classes were found and broadly described as a class with consistently high faecal calprotectin and three classes characterised by downward trends for calprotectin. Class membership was significantly associated with the composite endpoint, and separately, hospitalisation and Montreal disease progression, but not surgery. Early biologic therapy was strongly associated with class membership. Conclusions Our analysis provides a novel stratification approach for Crohn’s disease patients based on faecal calprotectin trajectories. Characterising this heterogeneity helps to better understand different patterns of disease progression and to identify those with a higher risk of worse outcomes. Ultimately, this information will assist the design of more targeted interventions.