Vallejos Group
Understanding heterogeneity in complex biomedical data
About
While biomedical data sometimes classifies as “big data” (where the number of samples and/or variables is large), complexity is its most prominent feature. This arises from a combination of different sources of heterogeneity: heterogeneity across individuals in a population (e.g. response to treatment), heterogeneity in terms of the type of data we collect (e.g. health records & genomics) and heterogeneity that is introduced by the data collection process (e.g. measurement error).
We focus on the development of novel statistical methodology to address and study these sources of heterogeneity. This is a highly multidisciplinary task: from the understanding of complex biomedical problems and technologies, to the development of new methodology and the implementation of open-source analysis tools. Our current research focuses on two areas of application. Firstly, single-cell RNA-sequencing, a cutting-edge experimental technique that allows genome-wide quantification of gene expression on a cell-by-cell basis. Secondly, electronic health records research, to develop predictive models based on observational data that is routinely collected by health providers (e.g. NHS). Developing computational tools that can make full advantage of the rich information provided by these data sources is ought to improve our understanding of health and disease, playing an important role in precision medicine initiatives.
News
Jan 31, 2024 | Nathan has passed his viva! |
---|---|
May 26, 2022 | Cata got tenure after a successful ESAT review. |
May 6, 2022 | Alan has passed his viva! |
Oct 1, 2021 | Linda and Elena have joined the group! |
Selected Publications
- BiometA review on statistical and machine learning competing risks methodsBiometrical Journal Feb 2024
Abstract When modeling competing risks (CR) survival data, several techniques have been proposed in both the statistical and machine learning literature. State-of-the-art methods have extended classical approaches with more flexible assumptions that can improve predictive performance, allow high-dimensional data and missing values, among others. Despite this, modern approaches have not been widely employed in applied settings. This article aims to aid the uptake of such methods by providing a condensed compendium of CR survival methods with a unified notation and interpretation across approaches. We highlight available software and, when possible, demonstrate their usage via reproducible R vignettes. Moreover, we discuss two major concerns that can affect benchmark studies in this context: the choice of performance metrics and reproducibility.
- CGHLongitudinal Fecal Calprotectin Profiles Characterize Disease Course Heterogeneity in Crohn’s DiseaseClinical Gastroenterology and Hepatology Oct 2023
Background and Aims: The progressive nature of Crohn’s disease is highly variable and hard to predict. In addition, symptoms correlate poorly with mucosal inflammation. There is therefore an urgent need to better characterize the heterogeneity of disease trajectories in Crohn’s disease by utilizing objective markers of inflammation. We aimed to better understand this heterogeneity by clustering Crohn’s disease patients with similar longitudinal fecal calprotectin profiles. Methods: We performed a retrospective cohort study at the Edinburgh IBD Unit, a tertiary referral center, and used latent class mixed models to cluster Crohn’s disease subjects using fecal calprotectin observed within 5 years of diagnosis. Information criteria, alluvial plots, and cluster trajectories were used to decide the optimal number of clusters. Chi-square test, Fisher’s exact test, and analysis of variance were used to test for associations with variables commonly assessed at diagnosis. Results: Our study cohort comprised 356 patients with newly diagnosed Crohn’s disease and 2856 fecal calprotectin measurements taken within 5 years of diagnosis (median 7 per subject). Four distinct clusters were identified by characteristic calprotectin profiles: a cluster with consistently high fecal calprotectin and 3 clusters characterized by different downward longitudinal trends. Cluster membership was significantly associated with smoking (P = .015), upper gastrointestinal involvement (P < .001), and early biologic therapy (P < .001). Conclusions: Our analysis demonstrates a novel approach to characterizing the heterogeneity of Crohn’s disease by using fecal calprotectin. The group profiles do not simply reflect different treatment regimens and do not mirror classical disease progression endpoints.
- NatAgeDevelopment and validation of DNA methylation scores in two European cohorts augment 10-year risk prediction of type 2 diabetesNature Aging Apr 2023
Type 2 diabetes mellitus (T2D) presents a major health and economic burden that could be alleviated with improved early prediction and intervention. While standard risk factors have shown good predictive performance, we show that the use of blood-based DNA methylation information leads to a significant improvement in the prediction of 10-year T2D incidence risk. Previous studies have been largely constrained by linear assumptions, the use of cytosine–guanine pairs one-at-a-time and binary outcomes. We present a flexible approach (via an R package, MethylPipeR) based on a range of linear and tree-ensemble models that incorporate time-to-event data for prediction. Using the Generation Scotland cohort (training set ncases = 374, ncontrols = 9,461; test set ncases = 252, ncontrols = 4,526) our best-performing model (area under the receiver operating characteristic curve (AUC) = 0.872, area under the precision-recall curve (PRAUC) = 0.302) showed notable improvement in 10-year onset prediction beyond standard risk factors (AUC = 0.839, precision-recall AUC = 0.227). Replication was observed in the German-based KORA study (n = 1,451, ncases = 142, P = 1.6x10-5).
- GenBioscMET: Bayesian modeling of DNA methylation heterogeneity at single-cell resolutionGenome Biology Apr 2021
High-throughput single-cell measurements of DNA methylomes can quantify methylation heterogeneity and uncover its role in gene regulation. However, technical limitations and sparse coverage can preclude this task. scMET is a hierarchical Bayesian model which overcomes sparsity, sharing information across cells and genomic features to robustly quantify genuine biological heterogeneity. scMET can identify highly variable features that drive epigenetic heterogeneity, and perform differential methylation and variability analyses. We illustrate how scMET facilitates the characterization of epigenetically distinct cell populations and how it enables the formulation of novel hypotheses on the epigenetic regulation of gene expression. scMET is available at https://github.com/andreaskapou/scMET.