B.I.G. Summer 2019 – Institute for Quantitative and Computational Biosciences

2019 Bruins-In-Genomics Summer Undergraduate Research Program

2019 B.I.G. Summer Participants

Lab PIs	Mentors	Students
VALERIE ARBOLEDA	Leroy Bondhus	Carmelle Catamura, University of California, Santa Cruz
	Leroy Bondhus	Katherine Sanchez, University of Michigan, Ann Arbor
SIOBHAN BRAYBROOK	Lauren Dedow	Alejandro Espinoza, University of La Verne
HILARY COLLER	Mithun Mitra	Huiling Huang, University of California, Los Angeles
	Mithun Mitra	Daniel Jason Tan, University of California, Los Angeles
ERIC DEEDS	Shamus Cooley	Sandy Kim, University of California, Los Angeles
	Shamus Cooley	Yankai (Mark) Xiang, University of Massachusetts Amherst
JASON ERNST	Soo Bin Kwon	Grace Casarez, University of California, Santa Barbara
	Soo Bin Kwon	Trevor Ridgley, University of California Santa Cruz
	Shan Sabri	Rebecca (Becca) Castillo, New Mexico Institute of Mining and Technology
	Shan Sabri	Jeremy Wang, Brown University
ELEAZAR ESKIN	Serghei Mangul	Sei Chang, University of California, Los Angeles
	Serghei Mangul	Nicholas Darci-Maher, University of California, Los Angeles
	Serghei Mangul	Aaron Karlsberg, University of California, Los Angeles
	Serghei Mangul	Neha Rajkumar, University of California, Los Angeles
	Lisa Gai	Jingyuan Fu, University of California, Los Angeles
	Lisa Gai	Camille Huang, University of California, Los Angeles
	Kodi Collins and Nathan LaPierre	Rosemary He, University of California, Los Angeles
	Kodi Collins and Nathan LaPierre	Xin (Helen) Huang, University of California, Los Angeles
NANDITA GARUD	Nandita Garud	Daisy Chen, University of California, Los Angeles
	Nandita Garud	Sara Thornburgh, University of California, Los Angeles
ALEXANDER HOFFMANN	Katherine Sheu	Aditya Pimplaskar, University of California, Los Angeles
LEONID KRUGLYAK	Longhua Guo	Ryan Carney, Johns Hopkins University
	Longhua Guo	Isimeme Udu, Spellman College
JAMIE LLOYD-SMITH	Amandine Gamble	Natashia Benjamin, University of the District of Columbia
	Amandine Gamble	Jessica Kasamoto, Johns Hopkins University
KIRK LOHMUELLER	Jesse Garcia	Miguel Alberto Guardado, San Francisco State University
	Jesse Garcia	Jonathan Mah, University of Washington, Seattle
HANNA MIKKOLA	Sandra Capellera Garcia	Sophia Ekstrand, Harvard-Westlake High School
ROEL OPHOFF	Toni Boltz	Rachel Elting, University of Kansas
	Toni Boltz	Nicole Zeltser, California Polytechnic State University, San Luis Obispo
JENNY PAPP	Ben Chu	Francis Adusei, Jackson State University
	Ben Chu	Vivian Garcia, University of Florida
BOGDAN PASANIUC	Ruthie Johnson	Gary Hu, Duke University
	Ruthie Johnson	Hugo Mainguy, Stony Brook University
SRIRAM SANKARARAMAN	Rob Brown	Saurav Mathur, University of Wisconsin-Madison
	Rob Brown	Tiffany Phan, University of Colorado at Boulder
VAN SAVAGE	Mauricio Cruz Loya	Alhaji Foray, Fisk University
	Mauricio Cruz Loya	Eric Yeh, University of California, San Diego
BILL SPEIER	William Speier	Osita Keluo-Udeke, University of Arkansas at Pine Bluff
	William Speier	James Soetedjo, University of Washington, Seattle
TOM VALLIM	Jenny Link	Vivian Iloabuchi, Fisk University
	Jenny Link	Tyler Antonio Laws, North Carolina State University
WEI WANG	Yunsheng Bai and Chelsea Ju	James Zhang, Carnegie Mellon University
ROY WOLLMAN	Alon Oyler-Yaniv and Evan Malta	Maxim Ermoshkin, University of Richmond
XINSHU (GRACE) XIAO	Mudra Choudhury	Peter Nekrasov, Yale University
NOAH ZAITLEN	Christa Caggiano	Subhanik Purkayastha, Brown University
	Mike Thompson	Anchit Tandon, Indian Institute of Technology, Delhi
XIANGHONG (JASMINE) ZHOU	Jim Liu	Amanda Sun, Vanderbilt University
	Jim Liu	Tianna Truby, University of California, Santa Barbara

2019 B.I.G. Summer Poster Abstracts

ADUSEI: Estimating the nuisance parameter for negative binomial regression

FRANCIS ADUSEI¹, VIVIAN GARCIA¹, Benjamin Chu², Jeanette Papp^1,3, Kenneth Lange^2,3,4,5

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Dept of Biomathematics, David Geffen School of Medicine, UCLA
³ Dept of Human Genetics, David Geffen School of Medicine, UCLA
⁴ Dept of Biostatistics, David Geffen School of Medicine, UCLA
⁵ Bioinformatics Interdepartmental PhD Program, UCLA

Most genome-wide association studies (GWAS) perform univariate linear regressions rather than modeling all predictors simultaneously. Iterative Hard Thresholding (IHT) is an algorithm for multivariate regression that provides a way to model all covariates in unison. However, IHT currently does not estimate nuisance parameters for generalized linear models. Therefore, this project extends IHT by estimating the nuisance parameter for Negative Binomial models using maximum likelihood estimation (MLE). Using the Julia programming language, we conducted a systematic analysis comparing Majorization-Minimization (MM) algorithms and Newton’s method for estimating the nuisance parameter. Our results indicate that more regression coefficient estimates are recovered when using MLE’s for the nuisance parameter.

BENJAMIN, KASAMOTO: Using Mathematical Models to Investigate the Infection Dynamics of Henipaviruses in Cell Culture

NATASHIA J. BENJAMIN¹, JESSICA Y. KASAMOTO¹, Amandine Gamble², Christian T. Mason², James O. Lloyd-Smith²

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Dept of Ecology and Evolutionary Biology, UCLA

Henipaviruses are emerging pathogens that cause a range of neurological and respiratory disorders in humans. The two best-known henipaviruses, Hendra and Nipah viruses, have a lethality rate of 60% or higher. They are transmitted to humans from bats, their main wildlife reservoir, often via horses or pigs. However, the risks posed by other recently-discovered henipaviruses are unclear. Here we adapt a within-host compartmental model to incorporate unique aspects of henipavirus biology in order to understand infection patterns in cell culture experiments. We analyze how the population dynamics of viruses and infected cells depend on key biological rates, and also use the model as a platform to assess how accurately these rates can be estimated given different experimental designs. Our model will inform the design and analysis of future laboratory experiments studying henipavirus infection across host species and tissue types, and hence contribute to a new evidence-based framework for risk assessment.

CARNEY, UDU: Genetics of white color and tumor formation in “lemon frost” leopard geckos

RYAN CARNEY¹, ISIMEME UDU¹, Longhua Guo², Joshua Bloom², Elise Pham², Zain Kashif², Katarina Ho², Sandra Duarte-Vogel², Ana Alcaraz³, Leonid Kruglyak²

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Dept of Human Genetics, David Geffen School of Medicine, UCLA
³ Dept of Anatomic Pathology, College of Veterinary Medicine, Western University

Animal pigmentation serves crucial functions in protection (e.g., camouflage) and signaling (e.g., mate recognition). The genetic basis of variation in color and pattern has been described in only a few cases, leaving a big gap in our knowledge. We studied pigmentation variation in the leopard gecko, Eublepharis macularius. This lizard species has been bred in captivity for over 50 years, and dozens of color and pattern morphs exist. We focused our efforts on a semi-dominant mutation that results in extensive white and lemon color in the skin. This morph is known as “lemon frost.” Individuals carrying this mutation also develop tumors of white color that metastasize to internal organs, suggesting that the mutation leads to increased proliferation of white cells. We used genetic linkage analysis to localize the mutation to a 20 kb region of the leopard gecko genome that is syntenic with human chromosome 15 and green anole chromosome 1. We are analyzing RNA sequencing data from tumor and normal skin with the aim of validating the candidate genes in the causal region. In summary, our work identified a genetic locus that leads to white coloration and tumor formation.

CASAREZ, RIDGLEY: Learning a Human-Rat Functional Genomics Conservation Score

GRACE CASAREZ¹, TREVOR RIDGLEY¹, Soo Bin Kwon^2,3, Jason Ernst^2,3,4

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Bioinformatics Interdepartmental PhD Program, UCLA
³ Dept of Biological Chemistry, David Geffen School of Medicine, UCLA
⁴ Dept of Computer Science, UCLA

Our understanding of mammalian genomes remains largely limited to protein-coding regions. Recently, a comparative functional genomic approach between human and mouse generated a genome-wide score that estimated the strength of evidence of functional genomics conservation based on predictive genomic signals. Here we apply the same approach to the human and rat genomes by training an ensemble of pseudo-Siamese neural networks (EPSNN) on publicly-available DNase-Seq and ChIP-Seq experiments curated by ChIP-Atlas, as well as FANTOM5 Cap Analysis Gene Expression (CAGE) data. The human-rat Functional Genomics Conservation (FGC) score highlights the locations of transcription start sites (TSS), promoters, insulators and enhancers in the human genome. The score is also indicative of pairs of human-rat regions with similar regulatory activity. With additional data from the rat genome, we foresee that the score could be used to better understand cross-species differences in cis-acting elements.

CASTILLO, WANG: Machine Learning Approaches for Deconvolution of Bulk ATAC-Seq Signals from scATAC-Seq Atlases

REBECCA CASTILLO¹, JEREMY WANG¹, Shan Sabri², Jason Ernst²

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Dept of Biological Chemistry, David Geffen School of Medicine, UCLA

Current chromatin mapping technologies, such as ATAC-seq, yield averaged chromatin profiles that are insensitive to cellular heterogeneity in composite populations. Recent technical advancements have led to the feasibility of single-cell ATAC-seq, but bulk ATAC-seq is still preferred due to cost and accuracy considerations. To address this problem, we propose ATAC-DelFi, a method that accurately predicts chromatin accessibility at the single cell-type level from population-level ATAC-seq data. ATAC-DelFi leverages an ATAC-seq cell-type atlas as a feature set to infer cell-type proportions using machine learning. We demonstrate ATAC-DelFi’s ability to detect prominent cell types in various mouse tissues across different developmental stages and derive interesting mechanistic insights from analyses of unexplainable variance in bulk signals. Our study suggests that ATAC-DelFI has the potential to accurately deconvolve heterogeneous ATAC-seq signals and can assist in the characterization of unknown cell types, which provides a powerful approach to understanding the mechanisms underlying cell identity.

CATAMURA, SANCHEZ: Epigenomic Analysis of KAT6A Patients Reveals Enrichment of Differential Methylation at Epigenetic Features and Transcription Factor Binding Sites

CARMELLE CATAMURA¹, KATHERINE SANCHEZ¹, Leroy Bondhus², Valerie Arboleda^2,3

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Dept of Human Genetics, David Geffen School of Medicine, UCLA
³ Dept of Pathology and Laboratory Medicine, David Geffen School of Medicine, UCLA

KAT6A is a lysine acetyltransferase. Mutations in KAT6A result in a developmental syndrome characterized by symptoms such as cardiac defects, intellectual disability, and speech delay. KAT6A mutation may disrupt histone acetylation which could cascade to affect other epigenetic features. We hypothesized that epigenetic dysregulation contributes to KAT6A syndrome. We used ChIP-seq data from ENCODE for epigenetic and regulatory features to associate these and higher order epigenetic features with sites differentially methylated between KAT6A syndrome patients fibroblasts (n=12) and healthy control fibroblasts(n=13). Our results suggest that sites of differential methylation are enriched at specific epigenetic features (e.g H2AFZ, H3K9me3). Additionally, we found hypermethylated sites in KAT6A mutation samples to be enriched in binding sites of transcription factors EZH2, MAX, and IKZF1, and hypomethylated sites enriched in binding sites of EZH2 and RNF2, members of PRC1 and PRC2, suggesting a possible connection between KAT6A and the PRC complexes.

CHANG: Benchmarking precision and sensitivity of structural variant detection methods across multiple sequencing coverages

SEI CHANG¹, Varuni Sarwal², Ram Ayyala³, Nicholas Darci-Maher¹, Samantha Jensen⁴, Eleazar Eskin⁵, Jonathan Flint⁶, Serghei Mangul⁷

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Indian Institute of Technology Delhi, Hauz Khas, New Delhi, Delhi 110016, India
³ Undergraduate Interdepartmental Program for Neuroscience, UCLA
⁴ Genetics & Genomics BioSciences Program, UCLA
⁵ Dept of Computational Medicine, David Geffen School of Medicine, UCLA
⁶ Center for Neurobehavioral Genetics, Semel Institute for Neuroscience and Human Behavior, UCLA
⁷ Dept of Clinical Pharmacy, School of Pharmacy, University of Southern California

Discovery of structural variants, regions of alterations in the genome resulting from structural differences in chromosomes, promises insight into human diversity and disease susceptibility. Due to advances in whole genome sequencing, a plethora of methods have been developed in pursuit of accurate and comprehensive SV-detection. Currently, there lacks a rigorous standard that investigators can utilize to select the most appropriate SV-detection tools. In contrast to previous benchmarking studies, our gold standard dataset includes a complete set of SVs that allow accurate reporting of both precision and sensitivity of SV-detection methods. To provide an optimistic estimate of detection accuracy, our study examines the tools’ ability to detect deletions, a less complex type of SV. We found a wide variation of performance among tools and only several methods provide the desired balance between sensitivity and precision. Upon further analysis, we determined optimal SV callers for low and ultra-low coverage sequencing data.

CHEN, THORNBURGH: Genetic diversity and ecological-evolutionary dynamics in mother-infant gut microbiomes

DAISY CHEN¹, SARA THORNBURGH¹, Nandita Garud²

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Dept of Ecology and Evolutionary Biology, UCLA

Recent work by Garud and Good et al. (2019) showed that human adult gut microbiota can evolve on 6-month timescales, whereas replacement of microbial strains dominates on longer timescales. Here, we apply a similar analytical framework to infant gut microbiomes, for which early evolutionary dynamics are poorly understood. We obtained longitudinal shotgun metagenomic data for mother-infant stool samples from previous studies (Backhed et al. 2015, Ferretti et al. 2018, and Yassour et al. 2018) and used the MIDAS pipeline to estimate strain-level genomic variation. We find temporal trends in population structure; notably, one dominant strain at birth typically precedes the appearance of multiple strains days later. While signatures of strain replacement predominate within the first days postpartum, we find evidence that evolution of dominant strains occurs over the following months. Additionally, we explore the significance of hypermutability in bacterial genomes, which may reflect unique selective pressures in the infant gut.

DARCI-MAHER: Challenges and opportunities in reuse of public omics data

NICHOLAS DARCI-MAHER¹, Dat Duong², Richard J. Abdill³, Eleazar Eskin², Serghei Mangul⁴

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Dept of Computer Science, UCLA
³ Dept of Genetics, Cell Biology, and Development, University of Minnesota
⁴ Dept of Clinical Pharmacy, School of Pharmacy, University of Southern California

The constant decrease in the cost of sequencing has resulted in the creation and exponential growth of online sequence repositories such as the Sequence Read Archive. Reuse of this public omics data has been demonstrated to provide key insights into complex biological systems. However, a significant amount of data in these repositories is deposited by the lab that generated it and never reanalyzed. We have conducted an analysis of over two million full texts and preprints to investigate the reuse patterns of omics data. There are far more papers generating their own data than papers reusing data, resulting in a shallow depth of analysis per sample. We aim to illuminate the barriers causing scientists to shy away from reusing data, including missing metadata, confusing online systems, and stigma. We provide a comprehensive picture of the current landscape of genetic repositories, as well as a quantitative analysis of genetic data reusability.

EKSTRAND: A Single-cell Resolution Map of Developmental Human Hematopoiesis

SOPHIA EKSTRAND¹, Sandra Capellera Garcia², Feiyang Ma², Hanna Mikkola²

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Dept of Molecular, Cell and Developmental Biology, UCLA

Developmental hematopoiesis evolves from lineage-primed progenitors to self-renewing hematopoietic stem cells (HSCs). Despite extensive studies in mice, lack of access to tissues and methods to identify emerging HSCs limit our understanding of human hematopoietic development. We created a single-cell transcriptome map of hemato-vascular cells from first and second trimester human hematopoietic tissues. By utilizing a molecular signature of self-renewing HSCs, we identified CD34+Thy1+RUNX1+HOXA+MLLT3+HLF+ HSCs emerging from HOXA patterned hemogenic endothelium in 5 week embryos. We also discovered SPINK2 as a novel marker of HSCs throughout development. In early fetal liver and yolk sac, SPINK2 also marks lympho-myeloid progenitors, which lack HOXA expression, suggesting HSC-independent origin. Additionally, we found unexpected macrophage populations with endothelial signature and tissue-specific HOXA code in various organs. This data set provides an unparalleled resource to investigate human developmental hematopoiesis, and a reference for the generating distinct hematopoietic cells in vitro for therapeutic purposes.

ELTING, ZELTSER: Genome-wide quantitative trait loci (QTL) mapping of metabolites in human cerebrospinal fluid

RACHEL ELTING¹, NICOLE ZELTSER¹, Toni Boltz², Loes Olde Loohuis³, Roel Ophoff^2,3

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Dept of Human Genetics, David Geffen School of Medicine, UCLA
³ Center for Neurobehavioral Genetics, Semel Institute for Neuroscience and Human Behavior, UCLA

Genome-wide association studies (GWAS) are used to identify genetic loci that are associated with a trait of interest in order to decipher underlying biological mechanisms. Metabolite measures in cerebrospinal fluid (CSF), the fluid that surrounds the brain, provides insight into brain function that may be relevant for neurobehavioral traits and neuropsychiatric disorders. We performed a GWAS of 600 metabolites in a sample of 500 human subjects, the largest set of CSF data used in a GWAS to date. We applied standard quality control of genetic data including missingness, minor allele frequency cut-off, and population stratification. Phenotype data were checked for outliers, and non-normal data were transformed using inverse rank normalization. A linear association, including age and sex as additional covariates, was performed using the PLINK toolset. Preliminary results show significant associations with a number of metabolites. Future work includes quality control and functional annotations of these results.

ERMOSHKIN: Building an Image Analysis Pipeline to measure the Effect of TNF-α on Viral spread

MAXIM ERMOSHKIN¹, Evan Maltz², Alon Oyler-Yaniv², Jennifer Oyler-Yaniv², Roy Wollman^2,3,4

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Molecular Biology Institute, UCLA
³ Dept of Integrative Biology and Physiology, UCLA
⁴ Dept of Chemistry and Biochemistry, UCLA

Computer vision is a set of tools for extracting quantitative information from images in a systematic, reproducible, and unbiased manner. In the context of biological imaging, a typical image analysis pipeline often consists of binary masking, single-cell segmentation, feature extraction, and tracking over time. Here, we implement such a pipeline and apply it to a dataset of HSV-1 (Herpes Simplex Virus-1) infected NIH3T3 fibroblasts. The experimental setup included wells with HSV-1 either present or absent and treated with different TNF (Tumor Necrosis Factor) concentrations. The pipeline was used to identify each cell in each experimental set up and track its properties throughout the course of the time-lapse experiment. The processed data was then used to fit a death-proliferation model that predicts if infected cells will undergo apoptosis based on TNF concentration and infection with HSV-1. We see that TNF dramatically increases the rate of death in HSV-1 infected cells.

ESPINOZA: Identification of differentially expressed genes in fast and slow elongating regions of the dark-grown Arabidopsis thaliana hypocotyl

ALEJANDRO ESPINOZA¹, Lauren Dedow², Joanna Landymore³, Firas Bou Daher^2,3, Siobhan Braybrook^2,3

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Dept of Molecular, Cell and Developmental Biology, UCLA
³ Sainsbury Laboratory, University of Cambridge, UK

Growth of the Arabidopsis thaliana dark grown hypocotyl is largely due to cell elongation, which occurs at a faster rate in lower cells compared to upper cells. Cell elongation is influenced by factors such as light, gravity and hormones, however, little is known about gene expression in respect to the elongation wave seen in dark-grown hypocotyls. Dark-grown hypocotyls were harvested at 24, 36, and 48 hours and dissected into fast and slow growing regions. RNA-seq was performed to evaluate the transcriptome and identify differentially expressed genes (DEGs). We compared two common aligners, STAR and Tophat2, paired with two differential expression analyzers, DeSeq2 and EdgeR, to determine an efficient and accurate methodology. STAR and Tophat2 identified 55% of the same DEGs. Furthermore, DeSeq2 identified 32% more unique DEGs compared to EdgeR. The combination of the STAR and EdgeR packages will be used for analysis of gene ontology and cell wall modifying genes.

FORAY, YEH: Modeling Allosterism Through Protein Interaction Networks

ALHAJI FORAY¹, ERIC YEH¹, Mauricio Cruz^1,2, Eric Deeds³, Van Savage⁴

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Dept of Computational Medicine, David Geffen School of Medicine, UCLA
³ Dept of Integrative Biology and Physiology, UCLA
⁴ Dept of Ecology and Evolutionary Biology, UCLA

The activity of an enzyme can be regulated by a molecule binding to a different site than the active site, despite it being far away in the protein structure. This phenomenon is known as allosterism. We perform network analysis of the three-dimensional structure of human liver pyruvate kinase (hL-PYK) to predict residues important for allosterism. Our method is based on variants of betweenness centrality, which measure how often residues lie in short paths between the allosteric and catalytic sites. We compare our predictions to an experimental dataset where every non-alanine and non-glycine residue in hL-PYK was mutated to alanine. We find that communicability betweenness, which considers weighted paths of different lengths, outperforms shortest-path betweenness in predicting residues important to allostery. Our predictions improve substantially when analyzing the hL-PYK tetramer compared to the monomer, which suggests that communication between different polypeptide chains is important for hL-PYK allostery.

FU, C. HUANG: A Multi-trait approach to improving polygenic risk score prediction

JINGYUAN FU¹, CAMILLE HUANG¹, Lisa Gai², Eleazar Eskin^2,3,4

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Dept of Computer Science, UCLA
³ Dept of Computational Medicine, David Geffen School of Medicine, UCLA
⁴ Dept of Human Genetics, David Geffen School of Medicine, UCLA

Multi-trait methods for analyzing genome-wide association studies (GWASs) boost the predictive power of polygenic risk scores (PRS) by harnessing information in summary statistics of genetically related traits, but are yet to be as developed as single-trait approaches. One such method is Turley et. al.’s MTAG, which relies on an often-broken assumption of a homogeneous variance-covariance matrix of SNP effects across the genome. We present MT, an extension of MTAG which allows SNP effects to be drawn from a two-component mixture model. We set up a pipeline to run MT and generate PRS and applied it to UK Biobank data on four sets of anthropometric and blood pressure measurements with varying degrees of correlation between traits: high positive correlation, high negative correlation, low correlation, and different measures of the same trait. In future work, we will stratify SNPs by allele frequency or degree of linkage disequilibrium when re-estimating SNP effects.

GUADARDO, MAH: Quantifying the Statistical Power in the Inference of the Evolution of the Distribution of Fitness Effects in Canine Lineages

MIGUEL GUARDADO¹, JONATHAN MAH¹, Jesse Garcia², Eduardo Amorim³, Kirk Lohmueller^2,3,4

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Bioinformatics Interdepartmental PhD Program, UCLA
³ Dept of Ecology and Evolutionary Biology, UCLA
⁴ Dept of Human Genetics, David Geffen School of Medicine, UCLA

Previous work on inferring the distribution of fitness effects (DFE), or the amount of deleterious, neutral, or adaptive mutations entering a population, has shown that distantly related species have distinct DFEs. However, using genomic resequencing data from arctic wolves and breed dogs, there was no detectable difference in their inferred DFE. Here, we sought to determine if current state-of-the-art methods for DFE inference had sufficient statistical power to detect a change in the DFE between canine populations. We performed forward population genetics simulations modeling canine evolution, and compared the inferred DFE and demographic parameters of simulated and empirical data. We have modeled ancestral wolf DFEs and demographic histories and are awaiting the results of our dog DFE simulations for comparison. Understanding if we can detect a difference in DFE will provide insight towards the impact of domestication on the DFE and help confirm the results found with reported empirical data.

HE, H. HUANG: Identifying Causal Variants by Fine Mapping Across Multiple Studies

ROSEMARY HE¹, HELEN HUANG¹, Kodi Collins², Nathan LaPierre², Eleazar Eskin^2,3,4

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Dept of Computer Science, UCLA
³ Dept of Human Genetics, David Geffen School of Medicine, UCLA
⁴ Dept of Computational Medicine, David Geffen School of Medicine, UCLA

Genome-Wide Association Studies (GWAS) have successfully discovered tens of thousands of associations between genetic variants and human traits; however, many of these variants have no true genetic effect on the complex trait. Fine mapping refines GWAS results by selecting a small subset of associated SNPs with high probability of containing the causal variant(s). In 2014, Hormozdiari et. al., introduced CAVIAR, a Bayesian method that predicts a minimal set containing the causal variant(s). CAVIAR does this by selecting SNPs with the highest probability and adds them to the minimal set until a posterior probability threshold is reached. Here we extend the CAVIAR framework to multiple studies and introduce MCAVIAR. While identical to CAVIAR in a single study setting, MCAVIAR utilizes random effects meta-analysis to account for heterogeneity across studies and leverages varying Linkage Disequilibrium (LD) structure across populations to increase power to identify causal variants.

HU: Estimating cross-population casual SNPs and genetic correlation under the UNITY framework

GARY HU¹, Hugo Mainguy¹, Ruth Johnson², Kathryn Burch³, Bogdan Pasaniuc^3,4,5,6

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Dept of Computer Science, UCLA
³ Bioinformatics Interdepartmental PhD Program, UCLA
⁴ Dept of Human Genetics, David Geffen School of Medicine, UCLA
⁵ Dept of Pathology and Laboratory Medicine, David Geffen School of Medicine, UCLA
⁶ Dept of Computational Medicine, David Geffen School of Medicine, UCLA

Estimating genetic overlap between traits is an important and ongoing problem in statistical genetics. The application of these methods has mostly been limited to within one population. Here, we apply UNITY (Unifying Non-Infinitesimal Trait analYsis) to genome wide association data from Biobank-Japan and the UK-Biobank. UNITY is a Bayesian method that estimates the genetic correlation and the proportion of shared causal SNPs between two traits. We utilize this method in a multi-ethnic setting where we estimate these quantities for the same trait across two populations. We find that the proportion of causal SNPs for BMI is 6.5%, with 3.8% of SNPs being causal and shared between populations. For mean corpuscular volume, 2.5% of SNPs are causal, with 1.8% of SNPs being causal and shared. Further analyses show that this method is highly sensitive to different patterns of linkage disequilibrium, motivating the need for more principled methods that explicitly model distinct LD patterns from different populations.

H. HUANG: Analyzing the Chromosome Conformation Differences between Proliferating and Quiescent Cells in Relation to Changes in Gene Expression

HUILING HUANG¹, Mithun Mitra^2,3, Hilary A. Coller^2,3

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Dept of Molecular, Cell, and Developmental Biology, UCLA
³ Dept of Biological Chemistry, UCLA

Quiescent cells have reversibly exited the cell cycle and show differential gene expression compared to proliferating cells. How chromosome conformation regulates gene expression in these two cellular states is not clearly known. We compared chromosome conformation between quiescent and proliferating human dermal fibroblasts by analyzing Hi-C contact maps derived from publicly available Hi-C datasets by running a HiC-Pro pipeline. Examining the genome-level cis (within chromosomes) and trans (between chromosomes) interactions in the Hi-C contact maps showed a significantly higher cis/trans ratio in quiescent cells compared to proliferating cells. Eleven percent of genes that were previously shown to switch genomic compartments (A-to-B or B-to-A) during quiescence also displayed a significant change (downregulated or upregulated) in our RNA-seq data between quiescent and proliferating fibroblasts. These genes were found to be enriched in cell cycle genes. Our results, therefore, suggest a possible link between chromosome architecture and gene expression during quiescence.

ILOABUCHI, LAWS: Transcriptome analysis of mice treated with anti-miR-144

VIVIAN C. ILOABUCHI¹, TYLER LAWS¹, Jenny Link², Thomas Vallim², Elizabeth Tarling²

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Dept of Medicine/Cardiology, David Geffen School of Medicine, UCLA

Heart disease is a leading cause of death each year in the United States. In this study, we investigated the regulation of reverse cholesterol transport, a process shown to attenuate heart disease. Our lab has previously shown that mice treated with anti-miR-144 for 4 weeks increased ABCA1, which is involved in increased levels of high-density lipoprotein and less fat in the heart. We hypothesized that long-term treatment with anti-miR-144 would reveal other key genes in reverse cholesterol transport. Mice were placed on a high fat, high cholesterol diet for 16 weeks, and treated with either saline, control anti-miRNA, or anti-miR-144. Liver tissues were collected and processed for RNA sequencing. Using differential gene expression analysis, we found very few differences in gene expression among the treatment groups. It is likely that long-term treatment of anti-miR-144 suppressed acute effects at the mRNA transcript level.

KARLSBERG: Assembling a Comprehensive Microbial DNA Reference Sequence Database

AARON KARLSBERG¹, Caitlin Loeffler², Eleazar Eskin², David Koslicki³, Serghei Mangul⁴

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Dept of Computer Science, University of California Los Angeles, USA
³ Dept of Mathematics, Oregon State University, USA
⁴ Dept of Clinical Pharmacy, School of Pharmacy, University of Southern California

Reference genomes are essential for metagenomics studies, which require comparing metagenomic reads with available reference genomes to identify organisms and their functions within a larger sample. However, existing databases have failed to properly integrate new microbial reference sequences. Here we report the development of Djoin, a novel computational method for the rapid and accurate merging of individual microbial reference databases. On average, the number of contigs represented by each strain was reduced by 82.61% while the length per contig increased by 84.60%. Additionally, the overall length of genomes increased by roughly 2.01%. Using Djoin, we created a systematic library of reference genomes called SLRG which extends across various domains of the microbiome including bacteria, virus, fungi and protozoa species. SLRG has increased microbial representation by a minimum of 20% and is the largest collection of reference genomes to date. Djoin and SLRG are freely available at https://github.com/smangul1/SLRG and https://github.com/smangul1/djoin.

KELUO-UDEKE: Incorporating a Character Encoding That Works for Greek Language (Non ASCII Characters) in P300 Speller

SEAN KELUO-UDEKE¹, William Speier²

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Dept of Medical Imaging Informatics, David Geffen School of Medicine, UCLA

The P300 speller is a common brain–computer interface application designed to communicate language by detecting event related potentials in a subject’s electroencephalogram signal. With much success seen in incorporating various language models to improve speed and accuracy of P300 speller by exploiting existing knowledge of the linguistic domain, expanding the range of the human languages covered in the P300 speller became seemingly useful because of non-English speaking subjects. Previously, The American Standard Code for Information Interchange (ASCII) was the character encoding used in the system. ASCII presented a problem of not having non-English character set. To overcome this, a mapping between the ASCII based language model characters and Unicode (UTF-8) characters was incorporated into the system. With this, we were able to adapt the system to include a Greek language model. Future work will include online experiments to validate the Greek language model and accommodating more languages into the system.

KIM, XIANG: Generating and Simulating Large Random Gene Regulatory Networks

SANDY KIM¹, MARK XIANG¹, Shamus Cooley², Eric J Deeds³

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Bioinformatics Interdepartmental PhD Program, UCLA
³ Dept of Integrative Biology and Physiology, UCLA

Gene expression levels are often regulated by large and complex regulatory networks. The global structures of these networks are often modeled using graphs with nodes and edges, but such models lack dynamic information. Dynamic models of specific Gene Regulatory Networks (GRNs) based on differential equations consider dynamics, but are much smaller than the GRNs found in most organisms. As such, little is currently known about the dynamic properties of GRNs at the whole-genome scale. Here we generate Random GRNs (RGRNs) by randomly sampling both the interactions between genes in the network as well as the logic governing each regulatory interaction. We have demonstrated that current hardware can simulate the long time-scale behavior of networks of similar size and complexity to those that govern expression dynamics within the Human genome. This software will be applied in future work to better understand the general dynamic properties of GRNs at this scale.

MAINGUY: Assessing the overlap of complex traits through the shared proportion of causal SNPs and genetic correlation

HUGO MAINGUY¹, Gary Hu¹, Ruth Johnson², Bogdan Pasaniuc^3,4,5

Recent large-scale genome-wide association studies (GWAS) have produced a rich resource of GWAS summary statistics that can be used to systematically assess the genetic overlap across numerous complex traits. Assessing the shared genetic component across traits can aid in investigating potential treatments across multiple diseases. Here, we use UNITY to quantify the proportion of shared causal SNPs and genetic correlation between pairs of traits from the UK Biobank. Our analyses from UNITY show similarities in correlation between pairs of traits compared to LDSC. For example, LDSC reports a genetic correlation of -0.01 for BMI and sitting height, and UNITY reports a correlation of 0.035. Additionally, we found that 8.3% of the shared SNPs have a non-zero effect for both traits. We use UNITY to create an atlas of genetic correlation and shared causal SNPs across pairs of traits.

MATHUR, PHAN: Inference of the environmental factors to increase power in genome wide association studies

SAURAV MATHUR¹, TIFFANY PHAN¹, Robert Brown², Sriram Sankararaman^2,3

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Dept of Computer Science, UCLA
3 Dept of Human Genetics, David Geffen School of Medicine, UCLA

Complex human phenotypes, such as diabetes or height, are affected by both genetics and the environment. Although genome-wide association studies (GWAS) have been successful in finding genetic variants that affect phenotypes and increase disease risks, the effect of environmental exposures on phenotypes is difficult to ascertain accurately, resulting in a reduction in statistical power. Previous work has failed to address this problem because measuring diverse environmental factors accurately is difficult, and therefore, hard to account for when making disease risk predictions. In this study, we show how to infer the environmental factors and use the inferred factors to improve our understanding of how genetic variants affect traits and disease risk. By better understanding the genetic factors affecting disease, our method will facilitate improvements in personalized precision healthcare and medicine where treatments are targeted to the genetic and environmental risk factors that are specific to an individual.

NEKRASOV: Investigating RNA Editing Across Different Environmental Conditions

PETER NEKRASOV¹, Mudra Choudhury², Grace Xiao³

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Bioinformatics Interdepartmental PhD Program, UCLA
³ Dept of Integrative Biology and Physiology, UCLA

RNA editing is the modification of single nucleotides by RNA binding proteins, leading to substitutions, insertions, and deletions in RNA transcripts. The predominant form of editing converts adenosine to inosine (A-to-I), which is recognized as guanosine by subsequent cellular machinery. Catalyzed by adenosine deaminase (ADAR) enzymes, A-to-I editing affects RNA structure and function, making it an important regulatory mechanism. Previous studies have shown that RNA editing is a highly adaptive mechanism that regulates the innate immune system. Many environmental perturbations require adaptive responses and have been shown to affect posttranscriptional processing, such as alternative splicing. However, the impact of these environmental stimuli on RNA editing remains unclear. In this study, we used publicly available RNA-seq data to investigate RNA editing in various human cell types and chemical treatments. We used an in-house pipeline to detect differential RNA editing between treatments. This study aims to profile the effects of various environmental perturbations on RNA editing.

PIMPLASKAR: Predicting drug sensitivity profiles in cancer through multivariate analysis of genomic features

ADITYA PIMPLASKAR¹, Katherine Sheu², Van M. Savage³, Pamela J. Yeh³, Thomas G. Graeber⁴, Alexander Hoffmann³

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Dept of Microbiology Immunology, and Molecular Genetics, UCLA
³ Dept of Ecology and Evolutionary Biology, UCLA
⁴ Dept of Molecular and Medical Pharmacology, David Geffen School of Medicine, UCLA

Genomic alterations can confer sensitivity to drugs, but questions remain about how combinations of genomic alterations influence drug sensitivity. Here we examine exome sequencing and SNP array data to identify mutational and copy-number alterations, for over 150 blood cancer cell lines, that may correlate with drug sensitivity. Unsupervised analysis of drug screening data across these cell lines suggests that blood cancer subtype is a poor predictor of drug sensitivity. However, certain genomic features decompose cancers by subtype. Our approach relates genomic features to drug sensitivities, leveraging mutational profiles as predictors of drug sensitivity. We analyze mutational patterns to find candidate epistatic interactions, and utilize a multivariate approach to find correlated drug-mutation pairs. We consider pairwise mutational epistasis to build preliminary drug sensitivity models. Using drug sensitivity data and statistical clustering, we aim to model how gene interactions influence drug sensitivities, and infer drug mechanisms and pathways by clustering interaction profiles.

PURKAYASTHA: Exploring Effects of Metabolic Dysfunction in Cell-Free DNA as Potential Biomarkers for ALS

SUBHANIK PURKAYASTHA¹, Christa Caggiano², Barbara Celona³, Fleur Garton⁴, Brian Black³, Naomi Wray⁴, Catherine Lomen-Hoerth⁵, Noah Zaitlen⁶

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Bioinformatics Interdepartmental PhD Program, UCLA
³ Cardiovascular Research Institute, UCSF
⁴ The University of Queensland, Institute for Molecular Bioscience, Queensland, Australia
⁵ Dept of Neurology, UCSF
⁶ Dept of Neurology, David Geffen School of Medicine, UCLA

Amyotrophic Lateral Sclerosis (ALS) is a debilitating disease, with a three-year mean survival1, affecting nearly 30,000 Americans. Current diagnosis involves a lengthy exclusion process and measures of progression are subjective. Therefore, discovering a reliable biomarker for ALS is imperative for patient care and drug development. Metabolic dysfunction and alteration of mitochondrial DNA (mtDNA) levels have been linked to neurodegeneration2, but sampling degenerative tissue is complex. Cell-free DNA (cfDNA) is found in bodily fluids after cellular decay3, and therefore, cell-free mitochondrial DNA (cf-mtDNA) may be a potential biomarker for disease monitoring. We sequenced cfDNA from 20 ALS patients and 20 age-matched controls to examine differences in abundance and presence of structural variation in cf-mtDNA. We observed a suggestive trend of diminished mtDNA levels in ALS patients, potentially driven by metabolic dysfunction in mtDNA. Going forward, we will examine cf-mtDNA in a 10,000-patient cohort to determine its suitability for a clinical biomarker.

RAJKUMAR: Packaging and Containerizing of Bioinformatics Software: Advances, Challenges and Opportunities

NEHA RAJKUMAR¹, Ram Ayyala², Qiyang Hu³, Richard J. Abdill⁴, Eleazar Eskin², Serghei Mangul⁵

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Dept of Computer Science, UCLA
³ Institute for Digital Research and Education, UCLA
⁴ Dept of Genetics, Cell Biology, and Development, University of Minnesota
⁵ Dept of Clinical Pharmacy, School of Pharmacy, University of Southern California

Efficiency of bioinformatics software is crucial to the advancement of the field, researchers depend on these state of the art technologies to be able to perform analyses on large amounts of genomic data. These technologies, however, require great computational skills, something which isn’t expected from the existing biological sciences curriculum. Package managers and containers provide a solution, offering an interface where the installation, configuration, updates and removals of tools are all handled. However, the implementation has not met the community’s needs, due to lack of information. This paper provides an overview of the challenges, advantages, and limitations of the current software from user and developer perspectives. By observing trends in both industry and academia and analyzing how installation time and run time are affected by the software and method of installation, we can find what methods work the best, and use that to generate recommendations for future software updates.

SOETEDJO: Analysis of Machine Learning Models for the P300 Speller

JAMES SOETEDJO¹, William Speier²

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Depts of Radiological Sciences and Bioinformatics, David Geffen School of Medicine, UCLA

Amyotrophic lateral sclerosis (ALS) is a rare neurodegenerative disease. In late stages of this disease, patients are cognitively aware, but cannot move or speak. To restore this ability, brain-computer interfaces such as the P300 speller have been developed, which produce written language based on electroencephalogram (EEG) signals recorded from the user. This system flashes rows and columns of a 6×6 character grid and decodes attended characters via a machine learning model to output the target text. We compared five different machine learning models created in MATLAB across 14 healthy subjects. Each subject typed a 30-character phrase, which was decoded in a three-fold cross-validation analysis. The average number of stimuli and the average error rates were recorded for each model and information transfer rates (ITR) were computed. Overall, the best models were the step-wise linear discriminant analysis and the multilayer perceptron, with average ITR values of 52 and 44 bits/minute, respectively.

SUN, TRUBY: Using Neoantigen Analysis Pipeline to Predict Cancer Immunotherapy Response

AMANDA SUN¹, TIANNA TRUBY¹, Jim Liu², Jasmine Zhou²

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Dept of Pathology and Laboratory Medicine, David Geffen School of Medicine, UCLA

Weaponizing the body’s own immune system to fight against cancer, immunotherapy is a recent development revolutionizing the landscape for cancer therapies. However, dramatic variations in response necessitates an ability to accurately identify biomarkers predictive for treatment success. A promising biomarker to identify are neoantigens—tumor-cell-generated peptide mutations recognized by the immune system. Here, we compare various bioinformatic methods—including variant discovery, somatic SNP calling, and MHC affinity binding predictions—to develop an optimal pipeline to predict tumor neoantigen burden (TNB). Using publicly available data from immunotherapy-treated lung, skin, and rectal cancer patients (n=5; n=73; n=14, respectively), we analyze the correlation between treatment success and TNB to determine the most accurate pipeline. Although TNB shows promise as a predictive biomarker in some cancers, even the best pipeline did not precisely indicate immunotherapy efficacy. However, further optimization has the potential to greatly increase the capabilities of TNB as an effective biomarker.

TAN: Intron retention is a robust marker of intertumoral heterogeneity in pancreatic ductal adenocarcinoma

DANIEL J. TAN¹, Mithun Mitra^2,3, Alec M. Chiu⁴, Hilary A. Coller^2,3,4

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Dept of Molecular, Cell, and Developmental Biology, UCLA
³ Dept of Biological Chemistry, David Geffen School of Medicine, UCLA
⁴ Bioinformatics Interdepartmental PhD Program, UCLA

Pancreatic Ductal Adenocarcinoma (PDAC) is a highly heterogeneous cancer of the pancreatic exocrine gland. It is the fourth leading cause of cancer-related deaths in the US, with a 5-year survival rate of 8%. This study seeks to explain PDAC intertumoral heterogeneity on the basis of different types of alternative splicing (AS) events. Unsupervised clustering of 76 PDAC patients was performed based on AS events derived using RNA-Seq data from TCGA. Intron retention (IR) was the most robust AS event and patterns of intron retention separated patients into two clusters with different survival outcomes. Genes undergoing differential IR between the two clusters were overrepresented in splicing factors. Also, IR events in the cluster with worse survival were enriched in tumor-suppressor genes. Taken together, our study shows that differences in IR among PDACs could be a strong determinant of PDAC heterogeneity.

TANDON: Tissue-specific gene expression prediction using penalized linear regression models

ANCHIT TANDON¹, Mike Thompson^2,3, Noah Zaitlen⁴

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Bioinformatics Interdepartmental PhD Program, UCLA
³ Dept of Computer Science, UCLA
⁴ Dept of Neurology, David Geffen School of Medicine, UCLA

Large-scale datasets such as the Genotype-Tissue Expression (GTEx) project have assisted researchers in characterizing the relationship between gene expression, genotypes, and complex traits. Nonetheless, current genotype-expression imputation methods either learn separate models for each tissue or do not focus on the homogeneous effect of genetics that is shared across tissues. Here, we implement prediction models for imputing gene expression in various tissues from simulated expression-genotype data using a method that considers both heterogeneous and homogeneous components of genetic effects across all tissues simultaneously. In simulations, we compare our joint model with two other models, (i) an entirely heterogeneous Tissue-by-tissue model, and (ii) an entirely Homogeneous-in-all-tissues model. For each model, we tried the following penalized regression methods – ridge regression, least absolute shrinkage, and selection operator (or LASSO) and Elastic Net. Across the entire range of simulation parameters, our joint model more accurately deconvolved the homogeneous and heterogeneous components of the genetic effects on expression.

ZHANG: Dense vector representations of proteins based on Gene Ontology annotations using word and sentence embedding tools

JAMES ZHANG¹, Chelsea Ju², Dat Duong², Yunsheng Bai², Wei Wang²

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Dept of Computer Science, UCLA

Recent advances of machine learning algorithms make possible more accurate predictions of protein-protein interactions; however, proteins must first be embedded into numerical vectors. Recently, the Gene Ontology (GO), a database of biological terms arranged in a hierarchical graph structure, has been considered a potential source of extracting vector representations of proteins. Here, we utilize two natural language processing methods to embed proteins into dense vector representations. In the first method, the structure of the GO graph, with accompanying protein annotations, is described in a series of sentences. Word embeddings of each protein is generated using Word2Vec. In the second method, sentence embeddings of GO term definitions are inputted into a graph convolutional network trained on entailment relationships of GO terms. Protein embeddings are calculated from GO term embeddings taken from the embedding layer. We find the dense vectors perform well in binary protein-protein interaction and multiclass enzymatic function classification tasks.

2019 Bruins-In-Genomics Summer Undergraduate Research Program

2019 B.I.G. Summer Participants

2019 B.I.G. Summer Poster Abstracts

Interesting links

Pages

Categories

Archive