Avatar

Vallejos Group

Understanding heterogeneity in complex biomedical data


About

While biomedical data sometimes classifies as “big data” (where the number of samples and/or variables is large), complexity is its most prominent feature. This arises from a combination of different sources of heterogeneity: heterogeneity across individuals in a population (e.g. response to treatment), heterogeneity in terms of the type of data we collect (e.g. health records & genomics) and heterogeneity that is introduced by the data collection process (e.g. measurement error).

We focus on the development of novel statistical methodology to address and study these sources of heterogeneity. This is a highly multidisciplinary task: from the understanding of complex biomedical problems and technologies, to the development of new methodology and the implementation of open-source analysis tools. Our current research focuses on two areas of application. Firstly, single-cell RNA-sequencing, a cutting-edge experimental technique that allows genome-wide quantification of gene expression on a cell-by-cell basis. Secondly, electronic health records research, to develop predictive models based on observational data that is routinely collected by health providers (e.g. NHS). Developing computational tools that can make full advantage of the rich information provided by these data sources is ought to improve our understanding of health and disease, playing an important role in precision medicine initiatives.


Selected Publications

  1. Biomet
    A review on statistical and machine learning competing risks methods
    Karla Monterrubio-GómezNathan Constantine-Cooke, and Catalina A. Vallejos
    Biometrical Journal Feb 2024

    Abstract When modeling competing risks (CR) survival data, several techniques have been proposed in both the statistical and machine learning literature. State-of-the-art methods have extended classical approaches with more flexible assumptions that can improve predictive performance, allow high-dimensional data and missing values, among others. Despite this, modern approaches have not been widely employed in applied settings. This article aims to aid the uptake of such methods by providing a condensed compendium of CR survival methods with a unified notation and interpretation across approaches. We highlight available software and, when possible, demonstrate their usage via reproducible R vignettes. Moreover, we discuss two major concerns that can affect benchmark studies in this context: the choice of performance metrics and reproducibility.

    @article{https://doi.org/10.1002/bimj.202300060,
      abbr = {Biomet},
      author = {Monterrubio-Gómez, Karla and Constantine-Cooke, Nathan and Vallejos, Catalina A.},
      title = {A review on statistical and machine learning competing risks methods},
      journal = {Biometrical Journal},
      volume = {66},
      number = {2},
      pages = {2300060},
      keywords = {competing risks, risk prediction, survival analysis, time-to-event data},
      doi = {https://doi.org/10.1002/bimj.202300060},
      url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/bimj.202300060},
      eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/bimj.202300060},
      year = {2024},
      month = feb,
      selected = {true}
    }
  2. CGH
    Longitudinal Fecal Calprotectin Profiles Characterize Disease Course Heterogeneity in Crohn’s Disease
    Nathan Constantine-CookeKarla Monterrubio-Gómez, Nikolas Plevris, Lauranne A.A.P. Derikx, Beatriz Gros, Gareth-Rhys Jones, Riccardo E. Marioni, Charlie W. Lees, and Catalina A. Vallejos
    Clinical Gastroenterology and Hepatology Oct 2023

    Background and Aims: The progressive nature of Crohn’s disease is highly variable and hard to predict. In addition, symptoms correlate poorly with mucosal inflammation. There is therefore an urgent need to better characterize the heterogeneity of disease trajectories in Crohn’s disease by utilizing objective markers of inflammation. We aimed to better understand this heterogeneity by clustering Crohn’s disease patients with similar longitudinal fecal calprotectin profiles. Methods: We performed a retrospective cohort study at the Edinburgh IBD Unit, a tertiary referral center, and used latent class mixed models to cluster Crohn’s disease subjects using fecal calprotectin observed within 5 years of diagnosis. Information criteria, alluvial plots, and cluster trajectories were used to decide the optimal number of clusters. Chi-square test, Fisher’s exact test, and analysis of variance were used to test for associations with variables commonly assessed at diagnosis. Results: Our study cohort comprised 356 patients with newly diagnosed Crohn’s disease and 2856 fecal calprotectin measurements taken within 5 years of diagnosis (median 7 per subject). Four distinct clusters were identified by characteristic calprotectin profiles: a cluster with consistently high fecal calprotectin and 3 clusters characterized by different downward longitudinal trends. Cluster membership was significantly associated with smoking (P = .015), upper gastrointestinal involvement (P < .001), and early biologic therapy (P < .001). Conclusions: Our analysis demonstrates a novel approach to characterizing the heterogeneity of Crohn’s disease by using fecal calprotectin. The group profiles do not simply reflect different treatment regimens and do not mirror classical disease progression endpoints.

    @article{Constantine-Cooke2023,
      abbr = {CGH},
      bibtex_show = {true},
      selected = {true},
      pdf = {Constantine-Cooke2023.pdf},
      title = {Longitudinal Fecal Calprotectin Profiles Characterize Disease Course Heterogeneity in {{Crohn}}'s Disease},
      author = {{Constantine-Cooke}, Nathan and {Monterrubio-G{\'o}mez}, Karla and Plevris, Nikolas and Derikx, Lauranne A.A.P. and Gros, Beatriz and Jones, Gareth-Rhys and Marioni, Riccardo E. and Lees, Charlie W. and Vallejos, Catalina A.},
      journal = {Clinical Gastroenterology and Hepatology},
      volume = {21},
      number = {11},
      pages = {2918-2927.e6},
      publisher = {{Elsevier}},
      issn = {1542-3565},
      doi = {10.1016/j.cgh.2023.03.026},
      urldate = {2023-05-24},
      month = oct,
      year = {2023}
    }
  3. NatAge
    Development and validation of DNA methylation scores in two European cohorts augment 10-year risk prediction of type 2 diabetes
    Yipeng Cheng, Danni A. Gadd, Christian Gieger,  Karla Monterrubio-Gómez, Yufei Zhang, Imrich Berta, Michael J. Stam, Natalia Szlachetka, Evgenii Lobzaev, Nicola Wrobel, Lee Murphy, Archie Campbell, Cliff Nangle, Rosie M. Walker, Chloe Fawns-Ritchie, Annette Peters, Wolfgang Rathmann, David J. Porteous, Kathryn L. Evans, Andrew M. McIntosh, Timothy I. Cannings, Melanie Waldenberger, Andrea Ganna, Daniel L. McCartney,  Catalina A. Vallejos, and Riccardo E. Marioni
    Nature Aging Apr 2023

    Type 2 diabetes mellitus (T2D) presents a major health and economic burden that could be alleviated with improved early prediction and intervention. While standard risk factors have shown good predictive performance, we show that the use of blood-based DNA methylation information leads to a significant improvement in the prediction of 10-year T2D incidence risk. Previous studies have been largely constrained by linear assumptions, the use of cytosine–guanine pairs one-at-a-time and binary outcomes. We present a flexible approach (via an R package, MethylPipeR) based on a range of linear and tree-ensemble models that incorporate time-to-event data for prediction. Using the Generation Scotland cohort (training set ncases = 374, ncontrols = 9,461; test set ncases = 252, ncontrols = 4,526) our best-performing model (area under the receiver operating characteristic curve (AUC) = 0.872, area under the precision-recall curve (PRAUC) = 0.302) showed notable improvement in 10-year onset prediction beyond standard risk factors (AUC = 0.839, precision-recall AUC = 0.227). Replication was observed in the German-based KORA study (n = 1,451, ncases = 142, P = 1.6x10-5).

    @article{Cheng_2023,
      doi = {10.1038/s43587-023-00391-4},
      url = {https://doi.org/10.1038%2Fs43587-023-00391-4},
      year = {2023},
      month = apr,
      publisher = {Springer Science and Business Media {LLC}},
      volume = {3},
      number = {4},
      pages = {450--458},
      author = {Cheng, Yipeng and Gadd, Danni A. and Gieger, Christian and Monterrubio-G{\'{o}}mez, Karla and Zhang, Yufei and Berta, Imrich and Stam, Michael J. and Szlachetka, Natalia and Lobzaev, Evgenii and Wrobel, Nicola and Murphy, Lee and Campbell, Archie and Nangle, Cliff and Walker, Rosie M. and Fawns-Ritchie, Chloe and Peters, Annette and Rathmann, Wolfgang and Porteous, David J. and Evans, Kathryn L. and McIntosh, Andrew M. and Cannings, Timothy I. and Waldenberger, Melanie and Ganna, Andrea and McCartney, Daniel L. and Vallejos, Catalina A. and Marioni, Riccardo E.},
      title = {Development and validation of {DNA} methylation scores in two European cohorts augment 10-year risk prediction of type 2 diabetes},
      journal = {Nature Aging},
      abbr = {NatAge},
      pdf = {cheng2021.pdf},
      selected = {true}
    }
  4. GenBio
    scMET: Bayesian modeling of DNA methylation heterogeneity at single-cell resolution
    Chantriolnt-Andreas Kapourani, Ricard Argelaguet, Guido Sanguinetti, and Catalina A Vallejos
    Genome Biology Apr 2021

    High-throughput single-cell measurements of DNA methylomes can quantify methylation heterogeneity and uncover its role in gene regulation. However, technical limitations and sparse coverage can preclude this task. scMET is a hierarchical Bayesian model which overcomes sparsity, sharing information across cells and genomic features to robustly quantify genuine biological heterogeneity. scMET can identify highly variable features that drive epigenetic heterogeneity, and perform differential methylation and variability analyses. We illustrate how scMET facilitates the characterization of epigenetically distinct cell populations and how it enables the formulation of novel hypotheses on the epigenetic regulation of gene expression. scMET is available at https://github.com/andreaskapou/scMET.

    @article{Kapourani2021,
      abbr = {GenBio},
      author = {Kapourani, Chantriolnt-Andreas and Argelaguet, Ricard and Sanguinetti, Guido and Vallejos, Catalina A},
      date = {2021/04/20},
      date-added = {2022-02-26 12:43:46 +0000},
      date-modified = {2022-02-26 12:43:46 +0000},
      doi = {10.1186/s13059-021-02329-8},
      id = {Kapourani2021},
      isbn = {1474-760X},
      journal = {Genome Biology},
      number = {1},
      pages = {114},
      title = {scMET: Bayesian modeling of DNA methylation heterogeneity at single-cell resolution},
      url = {https://doi.org/10.1186/s13059-021-02329-8},
      volume = {22},
      year = {2021},
      bdsk-url-1 = {https://doi.org/10.1186/s13059-021-02329-8},
      selected = {true}
    }
© Copyright 2024