Show simple item record

dc.contributor.advisorPan, Chongle
dc.contributor.authorBadré, Adrien
dc.date.accessioned2023-04-21T16:18:16Z
dc.date.available2023-04-21T16:18:16Z
dc.date.issued2023-05
dc.identifier.urihttps://hdl.handle.net/11244/337451
dc.description.abstractGenome-wide association studies (GWAS) and predictive genomics have become increasingly important in genetics research over the past decade. GWAS involves the analysis of the entire genome of a large group of individuals to identify genetic variants associated with a particular trait or disease. Predictive genomics combines information from multiple genetic variants to predict the polygenic risk score (PRS) of an individual for developing a disease. Machine learning is a branch of artificial intelligence that has revolutionized various fields of study, including computer vision, natural language processing, and robotics. Machine learning focuses on developing algorithms and models that enable computers to learn from data and make predictions or decisions without being explicitly programmed. Deep learning is a subset of machine learning that uses deep neural networks to recognize patterns and relationships. In this dissertation, we first compared various machine learning and statistical models for estimating breast cancer PRS. A deep neural network (DNN) was found to be the most effective, outperforming other techniques such as BLUP, BayesA, and LDpred. In the test cohort with 50% prevalence, the receiver operating characteristic curves area under the curves (ROC AUCs) were 67.4% for DNN, 64.2% for BLUP, 64.5% for BayesA, and 62.4% for LDpred. While BLUP, BayesA, and LDpred generated PRS that followed a normal distribution in the case population, the PRS generated by DNN followed a bimodal distribution. This allowed DNN to achieve a recall of 18.8% at 90% precision in the test cohort, which extrapolates to 65.4% recall at 20% precision in a general population. Interpretation of the DNN model identified significant variants that were previously overlooked by GWAS, highlighting their importance in predicting breast cancer risk. We then developed a linearizing neural network architecture (LINA) that provided first-order and second-order interpretations on both the instance-wise and model-wise levels, addressing the challenge of interpretability in neural networks. LINA outperformed other algorithms in providing accurate and versatile model interpretation, as demonstrated in synthetic datasets and real-world predictive genomics applications, by identifying salient features and feature interactions used for predictions. Finally, it has been observed that many complex diseases are related to each other through common genetic factors, such as pleiotropy or shared etiology. We hypothesized that this genetic overlap can be used to improve the accuracy of polygenic risk scores (PRS) for multiple diseases simultaneously. To test this hypothesis, we propose an interpretable multi-task learning approach based on the LINA architecture. We found that the parallel estimation of PRS for 17 prevalent cancers using a pan-cancer MTL model was generally more accurate than independent estimations for individual cancers using comparable single-task learning models. Similar performance improvements were observed for 60 prevalent non-cancer diseases in a pan-disease MTL model. Interpretation of the MTL models revealed significant genetic correlations between important sets of single nucleotide polymorphisms, suggesting that there is a well-connected network of diseases with a shared genetic basis.en_US
dc.languageen_USen_US
dc.rightsAttribution-ShareAlike 4.0 International*
dc.rights.urihttps://creativecommons.org/licenses/by-sa/4.0/*
dc.subjectMachine Learningen_US
dc.subjectGenomicsen_US
dc.subjectInterpretable AIen_US
dc.subjectGWASen_US
dc.subjectPRSen_US
dc.subjectDeep Learningen_US
dc.titleInterpretable deep neural networks for more accurate predictive genomics and genome-wide association studiesen_US
dc.contributor.committeeMemberHougen, Dean
dc.contributor.committeeMemberNeeman, Henry
dc.contributor.committeeMemberCheng, Qi
dc.contributor.committeeMemberSankaranarayanan, Krithivasan
dc.date.manuscript2023-04-21
dc.thesis.degreePh.D.en_US
ou.groupGallogly College of Engineering::School of Computer Scienceen_US


Files in this item

Thumbnail
Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record


Attribution-ShareAlike 4.0 International
Except where otherwise noted, this item's license is described as Attribution-ShareAlike 4.0 International