2019 Bruins-In-Genomics Summer Undergraduate Research Program

2019 B.I.G. Summer Participants

Lab PIsMentorsStudents
VALERIE ARBOLEDALeroy BondhusMa. Carmelle Catamura, UC Santa Cruz
Katherine Sanchez, University of Michigan, Ann Arbor
SIOBHAN BRAYBROOKLauren DedowAlejandro Espinoza, University of La Verne
HILARY COLLERMithun MitraHuiling Huang, UCLA
Daniel Jason Tan, UCLA
ERIC DEEDSShamus CooleySandy Kim, UCLA
Yankai (Mark) Xiang, University of Massachusetts Amherst
JASON ERNSTSoo Bin KwonGrace Casarez, UC Santa Barbara
Trevor Ridgley, UC Santa Cruz
Shan SabriRebecca (Becca) Castillo, New Mexico Institute of Mining and Technology
Jeremy Wang, Brown University
ELEAZAR ESKINKodi Collins and Nathan LaPierreRosemary (Rose) He, UCLA
Xin (Helen) Huang, UCLA
Lisa GaiCamille Huang, UCLA
Jingyuan Fu, UCLA
Serghei MangulSei Chang, UCLA
Nicholas (Nico) Darci-Maher, UCLA
Aaron Karlsberg, UCLA
Neha Rajkumar, UCLA
NANDITA GARUDDaisy Chen, UCLA
Sara Thornburgh, UCLA
ALEXANDER HOFFMANNKatherine SheuAditya Pimplaskar, UCLA
LEONID KRUGLYAKLonghua GuoRyan Carney, Johns Hopkins University
Isimeme (Naomi) Udu, Spelman College
JAMIE LLOYD-SMITHAmadine GambleNatashia Benjamin, University of the District of Columbia
Jessica Kasamoto, Johns Hopkins University
KIRK LOHMUELLERJesse GarciaMiguel Guardado, San Francisco State University
Jonathan (Jon) Mah, University of Washington, Seattle
HANNA MIKKOLASandra Capellera GarciaSophia Ekstrand, Harvard-Westlake High School
ROEL OPHOFFToni BoltzRachel Elting, University of Kansas
Nicole Zeltser, California Polytechnic State University, San Luis Obispo
JENNY PAPPBenjamin ChuFrancis Adusei, Jackson State University
Vivian Garcia, University of Florida
BOGDAN PASANIUCRuthie JohnsonGary Hu, Duke University
Hugo Mainguy, Stony Brook University
SRIRAM SANKARARAMANRob BrownSaurav Mathur, University of Wisconsin-Madison
Tiffany Phan, University of Colorado at Boulder
VAN SAVAGEMauricio Cruz LoyaAlhaji Foray, Fisk University
Eric Yeh, UC San Diego
BILL SPEIEROsita (Sean) Keluo-Udeke, University of Arkansas at Pine Bluff
James Soetedjo, University of Washington, Seattle
TOM VALLIMJenny LinkVivian (Veev) Iloabuchi, Fisk University
Tyler Laws, North Carolina State University
WEI WANGYunsheng Bai and Chelsea JuJames Zhang, Carnegie Mellon University
ROY WOLLMANAlon Oyler-Yaniv and Evan MaltzMaxim Ermoshkin, University of Richmond
XINSHU (GRACE) XIAOMudra ChoudhuryPeter Nekrasov, Yale University
NOAH ZAITLENChrista CaggianoSubhanik Purkayastha, Brown University
Mike ThompsonAnchit Tandon, Indian Institute of Technology, Delhi
XIANGHONG (JASMINE) ZHOUJim LiuAmanda Sun, Vanderbilt University
Tianna Truby, UC Santa Barbara

2019 B.I.G. Summer Poster Abstracts

FRANCIS ADUSEI1, VIVIAN GARCIA1, Benjamin Chu2, Jeanette Papp1,3, Kenneth Lange2,3,4,5

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Dept of Biomathematics, David Geffen School of Medicine, UCLA
3 Dept of Human Genetics, David Geffen School of Medicine, UCLA
4 Dept of Biostatistics, David Geffen School of Medicine, UCLA
5 Bioinformatics Interdepartmental PhD Program, UCLA

Most genome-wide association studies (GWAS) perform univariate linear regressions rather than modeling all predictors simultaneously. Iterative Hard Thresholding (IHT) is an algorithm for multivariate regression that provides a way to model all covariates in unison. However, IHT currently does not estimate nuisance parameters for generalized linear models. Therefore, this project extends IHT by estimating the nuisance parameter for Negative Binomial models using maximum likelihood estimation (MLE). Using the Julia programming language, we conducted a systematic analysis comparing Majorization-Minimization (MM) algorithms and Newton’s method for estimating the nuisance parameter. Our results indicate that more regression coefficient estimates are recovered when using MLE’s for the nuisance parameter.

NATASHIA J. BENJAMIN1, JESSICA Y. KASAMOTO1, Amandine Gamble2, Christian T. Mason2, James O. Lloyd-Smith2

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Dept of Ecology and Evolutionary Biology, UCLA

Henipaviruses are emerging pathogens that cause a range of neurological and respiratory disorders in humans. The two best-known henipaviruses, Hendra and Nipah viruses, have a lethality rate of 60% or higher. They are transmitted to humans from bats, their main wildlife reservoir, often via horses or pigs. However, the risks posed by other recently-discovered henipaviruses are unclear. Here we adapt a within-host compartmental model to incorporate unique aspects of henipavirus biology in order to understand infection patterns in cell culture experiments. We analyze how the population dynamics of viruses and infected cells depend on key biological rates, and also use the model as a platform to assess how accurately these rates can be estimated given different experimental designs. Our model will inform the design and analysis of future laboratory experiments studying henipavirus infection across host species and tissue types, and hence contribute to a new evidence-based framework for risk assessment.

RYAN CARNEY1, ISIMEME UDU1, Longhua Guo2, Joshua Bloom2, Elise Pham2, Zain Kashif2, Katarina Ho2, Sandra Duarte-Vogel2, Ana Alcaraz3, Leonid Kruglyak2

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Dept of Human Genetics, David Geffen School of Medicine, UCLA
3 Dept of Anatomic Pathology, College of Veterinary Medicine, Western University

Animal pigmentation serves crucial functions in protection (e.g., camouflage) and signaling (e.g., mate recognition). The genetic basis of variation in color and pattern has been described in only a few cases, leaving a big gap in our knowledge. We studied pigmentation variation in the leopard gecko, Eublepharis macularius. This lizard species has been bred in captivity for over 50 years, and dozens of color and pattern morphs exist. We focused our efforts on a semi-dominant mutation that results in extensive white and lemon color in the skin. This morph is known as “lemon frost.” Individuals carrying this mutation also develop tumors of white color that metastasize to internal organs, suggesting that the mutation leads to increased proliferation of white cells. We used genetic linkage analysis to localize the mutation to a 20 kb region of the leopard gecko genome that is syntenic with human chromosome 15 and green anole chromosome 1. We are analyzing RNA sequencing data from tumor and normal skin with the aim of validating the candidate genes in the causal region. In summary, our work identified a genetic locus that leads to white coloration and tumor formation.

GRACE CASAREZ1, TREVOR RIDGLEY1, Soo Bin Kwon2,3, Jason Ernst2,3,4

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Bioinformatics Interdepartmental PhD Program, UCLA
3 Dept of Biological Chemistry, David Geffen School of Medicine, UCLA
4 Dept of Computer Science, UCLA

Our understanding of mammalian genomes remains largely limited to protein-coding regions. Recently, a comparative functional genomic approach between human and mouse generated a genome-wide score that estimated the strength of evidence of functional genomics conservation based on predictive genomic signals. Here we apply the same approach to the human and rat genomes by training an ensemble of pseudo-Siamese neural networks (EPSNN) on publicly-available DNase-Seq and ChIP-Seq experiments curated by ChIP-Atlas, as well as FANTOM5 Cap Analysis Gene Expression (CAGE) data. The human-rat Functional Genomics Conservation (FGC) score highlights the locations of transcription start sites (TSS), promoters, insulators and enhancers in the human genome. The score is also indicative of pairs of human-rat regions with similar regulatory activity. With additional data from the rat genome, we foresee that the score could be used to better understand cross-species differences in cis-acting elements.

REBECCA CASTILLO1, JEREMY WANG1, Shan Sabri2, Jason Ernst2

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Dept of Biological Chemistry, David Geffen School of Medicine, UCLA

Current chromatin mapping technologies, such as ATAC-seq, yield averaged chromatin profiles that are insensitive to cellular heterogeneity in composite populations. Recent technical advancements have led to the feasibility of single-cell ATAC-seq, but bulk ATAC-seq is still preferred due to cost and accuracy considerations. To address this problem, we propose ATAC-DelFi, a method that accurately predicts chromatin accessibility at the single cell-type level from population-level ATAC-seq data. ATAC-DelFi leverages an ATAC-seq cell-type atlas as a feature set to infer cell-type proportions using machine learning. We demonstrate ATAC-DelFi’s ability to detect prominent cell types in various mouse tissues across different developmental stages and derive interesting mechanistic insights from analyses of unexplainable variance in bulk signals. Our study suggests that ATAC-DelFI has the potential to accurately deconvolve heterogeneous ATAC-seq signals and can assist in the characterization of unknown cell types, which provides a powerful approach to understanding the mechanisms underlying cell identity.

CARMELLE CATAMURA1, KATHERINE SANCHEZ1, Leroy Bondhus2, Valerie Arboleda2,3

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Dept of Human Genetics, David Geffen School of Medicine, UCLA
3 Dept of Pathology and Laboratory Medicine, David Geffen School of Medicine, UCLA

KAT6A is a lysine acetyltransferase. Mutations in KAT6A result in a developmental syndrome characterized by symptoms such as cardiac defects, intellectual disability, and speech delay. KAT6A mutation may disrupt histone acetylation which could cascade to affect other epigenetic features. We hypothesized that epigenetic dysregulation contributes to KAT6A syndrome. We used ChIP-seq data from ENCODE for epigenetic and regulatory features to associate these and higher order epigenetic features with sites differentially methylated between KAT6A syndrome patients fibroblasts (n=12)  and healthy control fibroblasts(n=13). Our results suggest that sites of differential methylation are enriched at specific epigenetic features (e.g H2AFZ, H3K9me3). Additionally, we found hypermethylated sites in KAT6A mutation samples to be enriched in binding sites of transcription factors EZH2, MAX, and IKZF1, and hypomethylated sites enriched in binding sites of EZH2 and RNF2, members of PRC1 and PRC2, suggesting a possible connection between KAT6A and the PRC complexes.

SEI CHANG1, Varuni Sarwal2, Ram Ayyala3,  Nicholas Darci-Maher1, Samantha Jensen4, Eleazar Eskin5, Jonathan Flint6, Serghei Mangul7

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Indian Institute of Technology Delhi, Hauz Khas, New Delhi, Delhi 110016, India
3 Undergraduate Interdepartmental Program for Neuroscience, UCLA
4 Genetics & Genomics BioSciences Program, UCLA
5 Dept of Computational Medicine, David Geffen School of Medicine, UCLA
6 Center for Neurobehavioral Genetics, Semel Institute for Neuroscience and Human Behavior, UCLA
7 Dept of Clinical Pharmacy, School of Pharmacy, University of Southern California

Discovery of structural variants, regions of alterations in the genome resulting from structural differences in chromosomes, promises insight into human diversity and disease susceptibility. Due to advances in whole genome sequencing, a plethora of methods have been developed in pursuit of accurate and comprehensive SV-detection. Currently, there lacks a rigorous standard that investigators can utilize to select the most appropriate SV-detection tools. In contrast to previous benchmarking studies, our gold standard dataset includes a complete set of SVs that allow accurate reporting of both precision and sensitivity of SV-detection methods. To provide an optimistic estimate of detection accuracy, our study examines the tools’ ability to detect deletions, a less complex type of SV. We found a wide variation of performance among tools and only several methods provide the desired balance between sensitivity and precision. Upon further analysis, we determined optimal SV callers for low and ultra-low coverage sequencing data.

DAISY CHEN1, SARA THORNBURGH1, Nandita Garud2

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Dept of Ecology and Evolutionary Biology, UCLA

Recent work by Garud and Good et al. (2019) showed that human adult gut microbiota can evolve on 6-month timescales, whereas replacement of microbial strains dominates on longer timescales. Here, we apply a similar analytical framework to infant gut microbiomes, for which early evolutionary dynamics are poorly understood. We obtained longitudinal shotgun metagenomic data for mother-infant stool samples from previous studies (Backhed et al. 2015, Ferretti et al. 2018, and Yassour et al. 2018) and used the MIDAS pipeline to estimate strain-level genomic variation. We find temporal trends in population structure; notably, one dominant strain at birth typically precedes the appearance of multiple strains days later. While signatures of strain replacement predominate within the first days postpartum, we find evidence that evolution of dominant strains occurs over the following months. Additionally, we explore the significance of hypermutability in bacterial genomes, which may reflect unique selective pressures in the infant gut.

NICHOLAS DARCI-MAHER1, Dat Duong2, Richard J. Abdill3, Eleazar Eskin2, Serghei Mangul4

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Dept of Computer Science, UCLA
3 Dept of Genetics, Cell Biology, and Development, University of Minnesota
4 Dept of Clinical Pharmacy, School of Pharmacy, University of Southern California

The constant decrease in the cost of sequencing has resulted in the creation and exponential growth of online sequence repositories such as the Sequence Read Archive. Reuse of this public omics data has been demonstrated to provide key insights into complex biological systems. However, a significant amount of data in these repositories is deposited by the lab that generated it and never reanalyzed. We have conducted an analysis of over two million full texts and preprints to investigate the reuse patterns of omics data. There are far more papers generating their own data than papers reusing data, resulting in a shallow depth of analysis per sample. We aim to illuminate the barriers causing scientists to shy away from reusing data, including missing metadata, confusing online systems, and stigma. We provide a comprehensive picture of the current landscape of genetic repositories, as well as a quantitative analysis of genetic data reusability.

SOPHIA EKSTRAND1, Sandra Capellera Garcia2, Feiyang Ma2, Hanna Mikkola2

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Dept of Molecular, Cell and Developmental Biology, UCLA

Developmental hematopoiesis evolves from lineage-primed progenitors to self-renewing hematopoietic stem cells (HSCs). Despite extensive studies in mice, lack of access to tissues and methods to identify emerging HSCs limit our understanding of human hematopoietic development. We created a single-cell transcriptome map of hemato-vascular cells from first and second trimester human hematopoietic tissues. By utilizing a molecular signature of self-renewing HSCs, we identified CD34+Thy1+RUNX1+HOXA+MLLT3+HLF+ HSCs emerging from HOXA patterned hemogenic endothelium in 5 week embryos. We also discovered SPINK2 as a novel marker of HSCs throughout development. In early fetal liver and yolk sac, SPINK2 also marks lympho-myeloid progenitors, which lack HOXA expression, suggesting HSC-independent  origin. Additionally, we found unexpected macrophage populations with endothelial signature and tissue-specific HOXA code in various organs. This data set provides an unparalleled resource to investigate human developmental hematopoiesis, and a reference for the generating distinct hematopoietic cells in vitro for therapeutic purposes.

RACHEL ELTING1, NICOLE ZELTSER1, Toni Boltz2, Loes Olde Loohuis3, Roel Ophoff2,3

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Dept of Human Genetics, David Geffen School of Medicine, UCLA
3 Center for Neurobehavioral Genetics, Semel Institute for Neuroscience and Human Behavior, UCLA

Genome-wide association studies (GWAS) are used to identify genetic loci that are associated with a trait of interest in order to decipher underlying biological mechanisms. Metabolite measures in cerebrospinal fluid (CSF), the fluid that surrounds the brain, provides insight into brain function that may be relevant for neurobehavioral traits and neuropsychiatric disorders. We performed a GWAS of 600 metabolites in a sample of 500 human subjects, the largest set of CSF data used in a GWAS to date. We applied standard quality control of genetic data including missingness, minor allele frequency cut-off, and population stratification. Phenotype data were checked for outliers, and non-normal data were transformed using inverse rank normalization. A linear association, including age and sex as additional covariates, was performed using the PLINK toolset. Preliminary results show significant associations with a number of metabolites. Future work includes quality control and functional annotations of these results.

MAXIM ERMOSHKIN1, Evan Maltz2, Alon Oyler-Yaniv2, Jennifer Oyler-Yaniv2, Roy Wollman2,3,4

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Molecular Biology Institute, UCLA
3 Dept of Integrative Biology and Physiology, UCLA
4 Dept of Chemistry and Biochemistry, UCLA

Computer vision is a set of tools for extracting quantitative information from images in a systematic, reproducible, and unbiased manner. In the context of biological imaging, a typical image analysis pipeline often consists of binary masking, single-cell segmentation, feature extraction, and tracking over time. Here, we implement such a pipeline and apply it to a dataset of HSV-1 (Herpes Simplex Virus-1) infected NIH3T3 fibroblasts. The experimental setup included wells with HSV-1 either present or absent and treated with different TNF (Tumor Necrosis Factor) concentrations. The pipeline was used to identify each cell in each experimental set up and track its properties throughout the course of the time-lapse experiment. The processed data was then used to fit a death-proliferation model that predicts if infected cells will undergo apoptosis based on TNF concentration and infection with HSV-1. We see that TNF dramatically increases the rate of death in HSV-1 infected cells.

ALEJANDRO ESPINOZA1, Lauren Dedow2, Joanna Landymore3, Firas Bou Daher2,3, Siobhan Braybrook2,3

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Dept of Molecular, Cell and Developmental Biology, UCLA
3 Sainsbury Laboratory, University of Cambridge, UK

Growth of the Arabidopsis thaliana dark grown hypocotyl is largely due to cell elongation, which occurs at a faster rate in lower cells compared to upper cells. Cell elongation is influenced by factors such as light, gravity and hormones, however, little is known about gene expression in respect to the elongation wave seen in dark-grown hypocotyls. Dark-grown hypocotyls were harvested at 24, 36, and 48 hours and dissected into fast and slow growing regions. RNA-seq was performed to evaluate the transcriptome and identify differentially expressed genes (DEGs). We compared two common aligners, STAR and Tophat2, paired with two differential expression analyzers, DeSeq2 and EdgeR, to determine an efficient and accurate methodology. STAR and Tophat2 identified 55% of the same DEGs. Furthermore, DeSeq2 identified 32% more unique DEGs compared to EdgeR. The combination of the STAR and EdgeR packages will be used for analysis of gene ontology and cell wall modifying genes.

ALHAJI FORAY1, ERIC YEH1, Mauricio Cruz1,2, Eric Deeds3, Van Savage4

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Dept of Computational Medicine, David Geffen School of Medicine, UCLA
3 Dept of Integrative Biology and Physiology, UCLA
4 Dept of Ecology and Evolutionary Biology, UCLA

The activity of an enzyme can be regulated by a molecule binding to a different site than the active site, despite it being far away in the protein structure. This phenomenon is known as allosterism. We perform network analysis of the three-dimensional structure of human liver pyruvate kinase (hL-PYK) to predict residues important for allosterism. Our method is based on variants of betweenness centrality, which measure how often residues lie in short paths between the allosteric and catalytic sites. We compare our predictions to an experimental dataset where every non-alanine and non-glycine residue in hL-PYK was mutated to alanine. We find that communicability betweenness, which considers weighted paths of different lengths, outperforms shortest-path betweenness in predicting residues important to allostery. Our predictions improve substantially when analyzing the hL-PYK tetramer compared to the monomer, which suggests that communication between different polypeptide chains is important for hL-PYK allostery.

JINGYUAN FU1, CAMILLE HUANG1, Lisa Gai2, Eleazar Eskin2,3,4

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Dept of Computer Science, UCLA
3 Dept of Computational Medicine, David Geffen School of Medicine, UCLA
4 Dept of Human Genetics, David Geffen School of Medicine, UCLA

Multi-trait methods for analyzing genome-wide association studies (GWASs) boost the predictive power of polygenic risk scores (PRS) by harnessing information in summary statistics of genetically related traits, but are yet to be as developed as single-trait approaches. One such method is Turley et. al.’s MTAG, which relies on an often-broken assumption of a homogeneous variance-covariance matrix of SNP effects across the genome. We present MT, an extension of MTAG which allows SNP effects to be drawn from a two-component mixture model. We set up a pipeline to run MT and generate PRS and applied it to UK Biobank data on four sets of anthropometric and blood pressure measurements with varying degrees of correlation between traits: high positive correlation, high negative correlation, low correlation, and different measures of the same trait. In future work, we will stratify SNPs by allele frequency or degree of linkage disequilibrium when re-estimating SNP effects.

MIGUEL GUARDADO1, JONATHAN MAH1, Jesse Garcia2, Eduardo Amorim3, Kirk Lohmueller2,3,4

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Bioinformatics Interdepartmental PhD Program, UCLA
3 Dept of Ecology and Evolutionary Biology, UCLA
4 Dept of Human Genetics, David Geffen School of Medicine, UCLA

Previous work on inferring the distribution of fitness effects (DFE), or the amount of deleterious, neutral, or adaptive mutations entering a population, has shown that distantly related species have distinct DFEs. However, using genomic resequencing data from arctic wolves and breed dogs, there was no detectable difference in their inferred DFE. Here, we sought to determine if current state-of-the-art methods for DFE inference had sufficient statistical power to detect a change in the DFE between canine populations. We performed forward population genetics simulations modeling canine evolution, and compared the inferred DFE and demographic parameters of simulated and empirical data. We have modeled ancestral wolf DFEs and demographic histories and are awaiting the results of our dog DFE simulations for comparison. Understanding if we can detect a difference in DFE will provide insight towards the impact of domestication on the DFE and help confirm the results found with reported empirical data.

ROSEMARY HE1, HELEN HUANG1, Kodi Collins2, Nathan LaPierre2, Eleazar Eskin2,3,4

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Dept of Computer Science, UCLA
3 Dept of Human Genetics, David Geffen School of Medicine, UCLA
4 Dept of Computational Medicine, David Geffen School of Medicine, UCLA

Genome-Wide Association Studies (GWAS) have successfully discovered tens of thousands of associations between genetic variants and human traits; however, many of these variants have no true genetic effect on the complex trait. Fine mapping refines GWAS results by selecting a small subset of associated SNPs with high probability of containing the causal variant(s). In 2014, Hormozdiari et. al., introduced CAVIAR, a Bayesian method that predicts a minimal set containing the causal variant(s). CAVIAR does this by selecting SNPs with the highest probability and adds them to the minimal set until a posterior probability threshold is reached.  Here we extend the CAVIAR framework to multiple studies and introduce MCAVIAR. While identical to CAVIAR in a single study setting, MCAVIAR utilizes random effects meta-analysis to account for heterogeneity across studies and leverages varying Linkage Disequilibrium (LD) structure across populations to increase power to identify causal variants.

GARY HU1, Hugo Mainguy1, Ruth Johnson2, Kathryn Burch3, Bogdan Pasaniuc3,4,5,6

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Dept of Computer Science, UCLA
3 Bioinformatics Interdepartmental PhD Program, UCLA
4 Dept of Human Genetics, David Geffen School of Medicine, UCLA
5 Dept of Pathology and Laboratory Medicine, David Geffen School of Medicine, UCLA
6 Dept of Computational Medicine, David Geffen School of Medicine, UCLA

Estimating genetic overlap between traits is an important and ongoing problem in statistical genetics. The application of these methods has mostly been limited to within one population. Here, we apply UNITY (Unifying Non-Infinitesimal Trait analYsis) to genome wide association data from Biobank-Japan and the UK-Biobank. UNITY is a Bayesian method that estimates the genetic correlation and the proportion of shared causal SNPs between two traits. We utilize this method in a multi-ethnic setting where we estimate these quantities for the same trait across two populations. We find that the proportion of causal SNPs for BMI is 6.5%, with 3.8% of SNPs being causal and shared between populations. For mean corpuscular volume, 2.5% of SNPs are causal, with 1.8% of SNPs being causal and shared. Further analyses show that this method is highly sensitive to different patterns of linkage disequilibrium, motivating the need for more principled methods that explicitly model distinct LD patterns from different populations.

HUILING HUANG1, Mithun Mitra2,3, Hilary A. Coller2,3

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Dept of Molecular, Cell, and Developmental Biology, UCLA
3 Dept of Biological Chemistry, UCLA

Quiescent cells have reversibly exited the cell cycle and show differential gene expression compared to proliferating cells. How chromosome conformation regulates gene expression in these two cellular states is not clearly known. We compared chromosome conformation between quiescent and proliferating human dermal fibroblasts by analyzing Hi-C contact maps derived from publicly available Hi-C datasets by running a HiC-Pro pipeline. Examining the genome-level cis (within chromosomes) and trans (between chromosomes) interactions in the Hi-C contact maps showed a significantly higher cis/trans ratio in quiescent cells compared to proliferating cells. Eleven percent of genes that were previously shown to switch genomic compartments (A-to-B or B-to-A) during quiescence also displayed a significant change (downregulated or upregulated) in our RNA-seq data between quiescent and proliferating fibroblasts. These genes were found to be enriched in cell cycle genes. Our results, therefore, suggest a possible link between chromosome architecture and gene expression during quiescence.

VIVIAN C. ILOABUCHI1, TYLER LAWS1, Jenny Link2, Thomas Vallim2, Elizabeth Tarling2

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Dept of Medicine/Cardiology, David Geffen School of Medicine, UCLA

Heart disease is a leading cause of death each year in the United States. In this study, we investigated the regulation of reverse cholesterol transport, a process shown to attenuate heart disease. Our lab has previously shown that mice treated with anti-miR-144 for 4 weeks increased ABCA1, which is involved in increased levels of high-density lipoprotein and less fat in the heart. We hypothesized that long-term treatment with anti-miR-144 would reveal other key genes in reverse cholesterol transport. Mice were placed on a high fat, high cholesterol diet for 16 weeks, and treated with either saline, control anti-miRNA, or anti-miR-144. Liver tissues were collected and processed for RNA sequencing. Using differential gene expression analysis, we found very few differences in gene expression among the treatment groups. It is likely that long-term treatment of anti-miR-144 suppressed acute effects at the mRNA transcript level.

AARON KARLSBERG1, Caitlin Loeffler2, Eleazar Eskin2, David Koslicki3, Serghei Mangul4

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Dept of Computer Science, University of California Los Angeles, USA
3 Dept of Mathematics, Oregon State University, USA
4 Dept of Clinical Pharmacy, School of Pharmacy, University of Southern California

Reference genomes are essential for metagenomics studies, which require comparing metagenomic reads with available reference genomes to identify organisms and their functions within a larger sample. However, existing databases have failed to properly integrate new microbial reference sequences. Here we report the development of Djoin, a novel computational method for the rapid and accurate merging of individual microbial reference databases. On average, the number of contigs represented by each strain was reduced by 82.61% while the length per contig increased by 84.60%. Additionally, the overall length of genomes increased by roughly 2.01%. Using Djoin, we created a systematic library of reference genomes called SLRG which extends across various domains of the microbiome including bacteria, virus, fungi and protozoa species. SLRG has increased microbial representation by a minimum of 20% and is the largest collection of reference genomes to date. Djoin and SLRG are freely available at https://github.com/smangul1/SLRG and https://github.com/smangul1/djoin.

SEAN KELUO-UDEKE1, William Speier2

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Dept of Medical Imaging Informatics, David Geffen School of Medicine, UCLA

The P300 speller is a common brain–computer interface application designed to communicate language by detecting event related potentials in a subject’s electroencephalogram signal. With much success seen in incorporating various language models to improve speed and accuracy of P300 speller by exploiting existing knowledge of the linguistic domain, expanding the range of the human languages covered in the P300 speller became seemingly useful because of non-English speaking subjects. Previously, The American Standard Code for Information Interchange (ASCII) was the character encoding used in the system. ASCII presented a problem of not having non-English character set. To overcome this, a mapping between the ASCII based language model characters and Unicode (UTF-8) characters was incorporated into the system. With this, we were able to adapt the system to include a Greek language model. Future work will include online experiments to validate the Greek language model and accommodating more languages into the system.

SANDY KIM1, MARK XIANG1, Shamus Cooley2, Eric J Deeds3

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Bioinformatics Interdepartmental PhD Program, UCLA
3 Dept of Integrative Biology and Physiology, UCLA

Gene expression levels are often regulated by large and complex regulatory networks.  The global structures of these networks are often modeled using graphs with nodes and edges, but such models lack dynamic information.  Dynamic models of specific Gene Regulatory Networks (GRNs) based on differential equations consider dynamics, but are much smaller than the GRNs found in most organisms.  As such, little is currently known about the dynamic properties of GRNs at the whole-genome scale. Here we generate Random GRNs (RGRNs) by randomly sampling both the interactions between genes in the network as well as the logic governing each regulatory interaction. We have demonstrated that current hardware can simulate the long time-scale behavior of networks of similar size and complexity to those that govern expression dynamics within the Human genome. This software will be applied in future work to better understand the general dynamic properties of GRNs at this scale.

HUGO MAINGUY1, Gary Hu1, Ruth Johnson2, Bogdan Pasaniuc3,4,5

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Dept of Computer Science, UCLA
3 Dept of Human Genetics, David Geffen School of Medicine, UCLA
4 Dept of Computational Medicine, David Geffen School of Medicine, UCLA
5 Dept of Pathology and Laboratory Medicine, David Geffen School of Medicine, UCLA

Recent large-scale genome-wide association studies (GWAS) have produced a rich resource of GWAS summary statistics that can be used to systematically assess the genetic overlap across numerous complex traits.  Assessing the shared genetic component across traits can aid in investigating potential treatments across multiple diseases. Here, we use UNITY to quantify the proportion of shared causal SNPs and genetic correlation between pairs of traits from the UK Biobank. Our analyses from UNITY show similarities in correlation between pairs of traits compared to LDSC. For example, LDSC reports a genetic correlation of -0.01 for BMI and sitting height, and UNITY reports a correlation of 0.035. Additionally, we found that 8.3% of the shared SNPs have a non-zero effect for both traits. We use UNITY to create an atlas of genetic correlation and shared causal SNPs across pairs of traits.

SAURAV MATHUR1, TIFFANY PHAN1, Robert Brown2, Sriram Sankararaman2,3

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Dept of Computer Science, UCLA
3 Dept of Human Genetics, David Geffen School of Medicine, UCLA

Complex human phenotypes, such as diabetes or height, are affected by both genetics and the environment. Although genome-wide association studies (GWAS) have been successful in finding genetic variants that affect phenotypes and increase disease risks, the effect of environmental exposures on phenotypes is difficult to ascertain accurately, resulting in a reduction in statistical power. Previous work has failed to address this problem because measuring diverse environmental factors accurately is difficult, and therefore, hard to account for when making disease risk predictions. In this study, we show how to infer the environmental factors and use the inferred factors to improve our understanding of how genetic variants affect traits and disease risk. By better understanding the genetic factors affecting disease, our method will facilitate improvements in personalized precision healthcare and medicine where treatments are targeted to the genetic and environmental risk factors that are specific to an individual.

PETER NEKRASOV1, Mudra Choudhury2, Grace Xiao3

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Bioinformatics Interdepartmental PhD Program, UCLA
3 Dept of Integrative Biology and Physiology, UCLA

RNA editing is the modification of single nucleotides by RNA binding proteins, leading to substitutions, insertions, and deletions in RNA transcripts. The predominant form of editing converts adenosine to inosine (A-to-I), which is recognized as guanosine by subsequent cellular machinery. Catalyzed by adenosine deaminase (ADAR) enzymes, A-to-I editing affects RNA structure and function, making it an important regulatory mechanism. Previous studies have shown that RNA editing is a highly adaptive mechanism that regulates the innate immune system. Many environmental perturbations require adaptive responses and have been shown to affect posttranscriptional processing, such as alternative splicing. However, the impact of these environmental stimuli on RNA editing remains unclear. In this study, we used publicly available RNA-seq data to investigate RNA editing in various human cell types and chemical treatments. We used an in-house pipeline to detect differential RNA editing between treatments. This study aims to profile the effects of various environmental perturbations on RNA editing.

ADITYA PIMPLASKAR1, Katherine Sheu2, Van M. Savage3, Pamela J. Yeh3, Thomas G. Graeber4, Alexander Hoffmann3

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Dept of Microbiology Immunology, and Molecular Genetics, UCLA
3 Dept of Ecology and Evolutionary Biology, UCLA
4 Dept of Molecular and Medical Pharmacology, David Geffen School of Medicine, UCLA

Genomic alterations can confer sensitivity to drugs, but questions remain about how combinations of genomic alterations influence drug sensitivity. Here we examine exome sequencing and SNP array data to identify mutational and copy-number alterations, for over 150 blood cancer cell lines, that may correlate with drug sensitivity. Unsupervised analysis of drug screening data across these cell lines suggests that blood cancer subtype is a poor predictor of drug sensitivity. However, certain genomic features decompose cancers by subtype. Our approach relates genomic features to drug sensitivities, leveraging mutational profiles as predictors of drug sensitivity. We analyze mutational patterns to find candidate epistatic interactions, and utilize a multivariate approach to find correlated drug-mutation pairs. We consider pairwise mutational epistasis to build preliminary drug sensitivity models. Using drug sensitivity data and statistical clustering, we aim to model how gene interactions influence drug sensitivities, and infer drug mechanisms and pathways by clustering interaction profiles.

SUBHANIK PURKAYASTHA1, Christa Caggiano2, Barbara Celona3, Fleur Garton4, Brian Black3,  Naomi Wray4, Catherine Lomen-Hoerth5, Noah Zaitlen6

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Bioinformatics Interdepartmental PhD Program, UCLA
3 Cardiovascular Research Institute, UCSF
4 The University of Queensland, Institute for Molecular Bioscience, Queensland, Australia
5 Dept of Neurology, UCSF
6 Dept of Neurology, David Geffen School of Medicine, UCLA

Amyotrophic Lateral Sclerosis (ALS) is a debilitating disease, with a three-year mean survival1, affecting nearly 30,000 Americans. Current diagnosis involves a lengthy exclusion process and measures of progression are subjective. Therefore, discovering a reliable biomarker for ALS is imperative for patient care and drug development. Metabolic dysfunction and alteration of mitochondrial DNA (mtDNA) levels have been linked to neurodegeneration2, but sampling degenerative tissue is complex. Cell-free DNA (cfDNA) is found in bodily fluids after cellular decay3, and therefore, cell-free mitochondrial DNA (cf-mtDNA) may be a potential biomarker for disease monitoring. We sequenced cfDNA from 20 ALS patients and 20 age-matched controls to examine differences in abundance and presence of structural variation in cf-mtDNA. We observed a suggestive trend of diminished mtDNA levels in ALS patients, potentially driven by metabolic dysfunction in mtDNA. Going forward, we will examine cf-mtDNA in a 10,000-patient cohort to determine its suitability for a clinical biomarker.

NEHA RAJKUMAR1, Ram Ayyala2, Qiyang Hu3, Richard J. Abdill4, Eleazar Eskin2, Serghei Mangul5

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Dept of Computer Science, UCLA
3 Institute for Digital Research and Education, UCLA
4 Dept of Genetics, Cell Biology, and Development, University of Minnesota
5 Dept of Clinical Pharmacy, School of Pharmacy, University of Southern California

Efficiency of bioinformatics software is crucial to the advancement of the field, researchers depend on these state of the art technologies to be able to perform analyses on large amounts of genomic data. These technologies, however, require great computational skills, something which isn’t expected from the existing biological sciences curriculum. Package managers and containers provide a solution, offering an interface where the installation, configuration, updates and removals  of tools are all handled. However, the implementation has not met the community’s needs, due to lack of information. This paper provides an overview of the challenges, advantages, and limitations of the current software from user and developer perspectives. By observing trends in both industry and academia and analyzing how installation time and run time are affected by the software and method of installation, we can find what methods work the best, and use that to generate recommendations for future software updates.

JAMES SOETEDJO1, William Speier2

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Depts of Radiological Sciences and Bioinformatics, David Geffen School of Medicine, UCLA

Amyotrophic lateral sclerosis (ALS) is a rare neurodegenerative disease. In late stages of this disease, patients are cognitively aware, but cannot move or speak. To restore this ability, brain-computer interfaces such as the P300 speller have been developed, which produce written language based on electroencephalogram (EEG) signals recorded from the user. This system flashes rows and columns of a 6×6 character grid and decodes attended characters via a machine learning model to output the target text. We compared five different machine learning models created in MATLAB across 14 healthy subjects. Each subject typed a 30-character phrase, which was decoded in a three-fold cross-validation analysis. The average number of stimuli and the average error rates were recorded for each model and information transfer rates (ITR) were computed. Overall, the best models were the step-wise linear discriminant analysis and the multilayer perceptron, with average ITR values of 52 and 44 bits/minute, respectively.

AMANDA SUN1, TIANNA TRUBY1, Jim Liu2, Jasmine Zhou2

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Dept of Pathology and Laboratory Medicine, David Geffen School of Medicine, UCLA

Weaponizing the body’s own immune system to fight against cancer, immunotherapy is a recent development revolutionizing the landscape for cancer therapies. However, dramatic variations in response necessitates an ability to accurately identify biomarkers predictive for treatment success. A promising biomarker to identify are neoantigens—tumor-cell-generated peptide mutations recognized by the immune system. Here, we compare various bioinformatic methods—including variant discovery, somatic SNP calling, and MHC affinity binding predictions—to develop an optimal pipeline to predict tumor neoantigen burden (TNB). Using publicly available data from immunotherapy-treated lung, skin, and rectal cancer patients (n=5; n=73; n=14, respectively), we analyze the correlation between treatment success and TNB to determine the most accurate pipeline. Although TNB shows promise as a predictive biomarker in some cancers, even the best pipeline did not precisely indicate immunotherapy efficacy. However, further optimization has the potential to greatly increase the capabilities of TNB as an effective biomarker.

DANIEL J. TAN1, Mithun Mitra2,3, Alec M. Chiu4, Hilary A. Coller2,3,4

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Dept of Molecular, Cell, and Developmental Biology, UCLA
3 Dept of Biological Chemistry, David Geffen School of Medicine, UCLA
4 Bioinformatics Interdepartmental PhD Program, UCLA

Pancreatic Ductal Adenocarcinoma (PDAC) is a highly heterogeneous cancer of the pancreatic exocrine gland. It is the fourth leading cause of cancer-related deaths in the US, with a 5-year survival rate of 8%. This study seeks to explain PDAC intertumoral heterogeneity on the basis of different types of alternative splicing (AS) events. Unsupervised clustering of 76 PDAC patients was performed based on AS events derived using RNA-Seq data from TCGA. Intron retention (IR) was the most robust AS event and patterns of intron retention separated patients into two clusters with different survival outcomes. Genes undergoing differential IR between the two clusters were overrepresented in splicing factors. Also, IR events in the cluster with worse survival were enriched in tumor-suppressor genes. Taken together, our study shows that differences in IR among PDACs could be a strong determinant of PDAC heterogeneity.

ANCHIT TANDON1, Mike Thompson2,3, Noah Zaitlen4

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Bioinformatics Interdepartmental PhD Program, UCLA
3 Dept of Computer Science, UCLA
4 Dept of Neurology, David Geffen School of Medicine, UCLA

Large-scale datasets such as the Genotype-Tissue Expression (GTEx) project have assisted researchers in characterizing the relationship between gene expression, genotypes, and complex traits.  Nonetheless, current genotype-expression imputation methods either learn separate models for each tissue or do not focus on the homogeneous effect of genetics that is shared across tissues.  Here, we implement prediction models for imputing gene expression in various tissues from simulated expression-genotype data using a method that considers both heterogeneous and homogeneous components of genetic effects across all tissues simultaneously. In simulations, we compare our joint model with two other models, (i) an entirely heterogeneous Tissue-by-tissue model, and (ii) an entirely Homogeneous-in-all-tissues model.  For each model, we tried the following penalized regression methods – ridge regression, least absolute shrinkage, and selection operator (or LASSO) and Elastic Net. Across the entire range of simulation parameters, our joint model more accurately deconvolved the homogeneous and heterogeneous components of the genetic effects on expression.

JAMES ZHANG1, Chelsea Ju2, Dat Duong2, Yunsheng Bai2, Wei Wang2

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Dept of Computer Science, UCLA

Recent advances of machine learning algorithms make possible more accurate predictions of protein-protein interactions; however, proteins must first be embedded into numerical vectors. Recently, the Gene Ontology (GO), a database of biological terms arranged in a hierarchical graph structure, has been considered a potential source of extracting vector representations of proteins. Here, we utilize two natural language processing methods to embed proteins into dense vector representations. In the first method, the structure of the GO graph, with accompanying protein annotations, is described in a series of sentences. Word embeddings of each protein is generated using Word2Vec. In the second method, sentence embeddings of GO term definitions are inputted into a graph convolutional network trained on entailment relationships of GO terms. Protein embeddings are calculated from GO term embeddings taken from the embedding layer. We find the dense vectors perform well in binary protein-protein interaction and multiclass enzymatic function classification tasks.