Adi Laurentiu Tarca

Wayne State University

[intermediate] Machine Learning for Cross-Sectional and Longitudinal Omics Studies

Summary

The implementation of machine learning (ML) algorithms with patient genomics data (e.g. gene expression and proteomics) obtained using microarray and sequencing technologies is challenging due to platform- and batch-specific biases, and also due to the moderate sample size available in most research settings (p>>n). This course will review practical applications of ML for prediction of phenotypes using genomics data generated from cross-sectional and longitudinal (short time-series) studies. Using real datasets and R/Bioconductor packages, we will illustrate the use of ML for prediction of continuous (regression) and binary (classification) using methods such as linear models, elastic net, random forest, neural networks and other approaches. Issues specific to microarray and sequencing technologies, such as data normalization and transformation, and feature filtering via shrinkage methods implemented in limma and DESeq2 packages of Bioconductor will be illustrated. Genomics-based model development pipelines that received top awards in ML competitions including DREAM and sbv IMPROVER will be particularly emphasized. Finally, dealing with predictive feature dilution, which is typical in multi-omics studies, and improving generalization by using prior domain knowledge and multi-source data integration will be discussed.

Syllabus

Overview of a typical genomics study design for predictive modeling
A typical R session for prediction model development and evaluation
Shrinkage-based methods for univariate feature filtering
Preprocessing of genomic data:
- Data normalization and variance stabilization
- Batch effect removal
Repositories of genomics data
ML applications to longitudinal omics data
Meta-models for multi-omics datasets
Incorporating prior knowledge in ML pipelines

References

Tarca AL, et al. Crowdsourcing assessment of maternal blood multi-omics for predicting gestational age and preterm birth. Cell Reports. Medicine. 2021;2(6):100323. PMID: 34195686.

Ritchie ME et al. limma powers differential expression analyses for RNA-sequencing and microarray studies, Nucleic Acids Res 2015, 20;43(7):e47. PMID: 25605792.

Love MI et al. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2, Genome Biol 2014;15(12):550. PMID: 25516281.

Tarca AL et al. Strengths and limitations of microarray-based phenotype prediction: lessons learned from the IMPROVER Diagnostic Signature Challenge. Bioinformatics 2013;29(22):2892-9. PMID: 23966112.

Tarca AL, et al. Machine learning and its applications to biology. PLoS Computational Biology 2007; 3(6):e116. PMID: 17604446.

Pre-requisites

Familiarity with linear algebra, probability, machine learning, and the R statistical language.

Short bio

Dr. Adi L. Tarca’s research over the past two decades was at the interface of computational biology, machine learning and maternal-fetal medicine. He is currently a tenured professor in the School of Medicine at Wayne State University, and founding Head of Bioinformatics and Computational Biology Unit of the Perinatology Research Branch (NICHD/NIH). With his PhD work focused on embedding qualitative prior knowledge in the training of neural networks, at Laval University, Quebec, he transitioned to bioinformatics and developed several methods and R/Bioconductor packages for omics data analysis such as preprocessing, pathway analysis, and predictive model development. His machine learning pipelines for genomics data were ranked at the top in multiple machine learning competitions including sbv IMPROVER Diagnostic Signature Challenge (2012), Species Translation Challenge (2013), Systems Toxicology Challenge (2016) and DREAM Single-cell Transcriptomics Challenge (2018). More recently, he has led the crowdsourcing initiative DREAM Preterm Birth Prediction Challenge: Transcriptomics (2019), which attracted >500 participants. He co-authored >200 articles and patents, work that has received about 13,000 citations to date (h-index 57).

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gat_gtag_UA_74880351_9	1 minute	This cookie is set by Google and is used to distinguish users.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visted in an anonymous form.