B.I.G. Summer 2017 – Institute for Quantitative and Computational Biosciences

2017 Bruins-In-Genomics Summer Undergraduate Research Program

2017 B.I.G. Summer Best Poster Award Winners

Congrats Astrid Manuel & Natalie Dong!
Congrats Toni Boltz & Kensei Kishimoto!
Congrats Heather Han!
Congrats Colleah Gilbert & Kate Abe-Ridgway!
Congrats D’Andrea Mitchell & Kelly Cochran!
Congrats Norris Khoo!

2017 B.I.G. Summer Participants

Lab PIs	Mentors	Students
HILARY COLLER	Adriana Corvalan	Roshni Bhatt, Case Western Reserve Univ.
		Jigar Patel, UC San Diego
JASON ERNST	Tevfik Umut Dincer	Timothy B. Fisher, Morehouse School of Medicine
		Heather Han, Johns Hopkins Univ.
ELEAZAR ESKIN	Serghei Mangul	Linus Chen, UCLA
		Garrett Parker, Santa Monica College
ALEXANDER HOFFMANN	Kim Ngo	Kensei Kishimoto, UCLA
		Toni Boltz, Univ. of Miami
	Diane Lefaudeux	Nick Miller, Cornell University, Ithaca
		Guillermo Sanchez-Arriola, Fisk Univ.
LUISA IRUELA-ARISPE	Julia Mack & Milagros Romay	Colleah Gilbert, Florida A&M Univ.
		Kate Abe-Ridgway, UC Davis
STEVE JACOBSEN	Wanlu Lui	Renee Haserjian, CSU Los Angeles
		Brieana Hollis, Florida A&M Univ.
TRACY JOHNSON		Frank Gutierrez, CSU Los Angeles
		Olayemi Olapado, Florida A&M Univ.
HUIYING LI	Baochen Shi	Jerry Trinh, UCLA
		Jeliah Jones, Florida A&M Univ.
JESSICA LI	Yiling Chen & Yidan Sun	Tiffany Tu, George Washington Univ.
		Mayra Varillas, UCLA
KIRK LOHMUELLER	Tanya N. Phung	Norris C. Khoo, UCLA
BOGDAN PASANIUC	Claudia Giambartolomei & Huwenbo Shi	Natalie Dong, Boston Univ.
		Astrid Manuel, Florida International Univ.
		Anthony Fernandes, Cornell Univ.
		Christian Torres, UCLA
MATTEO PELLEGRINI	Brian Nadel	Hannah Waddel, Univ. of Utah
		Misha Mubasher Khan, Swarthmore College
JESSICA REXACH		Anamika Ghoshm, UCLA
JAE HOON SUL		Regina Lee, UCLA
		Sang Ji Lyu, UCLA
THOMAS A. VALLIM	Jennifer M. Lang	Elizabeth Vanderwall, UCLA
	Jenny C. Link	Adam Weiner, UCLA
WEI WANG	Chelsea J.-T. Ju	Kelly Cochran, Duke Univ.
		D'andrea N. Mitchell, Albany State Univ.
YI XING	Zijun Zhang	Runjia Li, UCLA
		Zanchen Li, UCLA
JASMINE ZHOU	Mary Same	Brooke Garland, Pepperdine Univ.
		Jermone Morris, Fisk Univ.

2017 B.I.G. Summer Poster Abstracts

BHATT, PATEL: Differential Analysis of H4K20 Methylation Marks in Proliferating and Contact Inhibited Human Dermal Fibroblasts

ROSHNI BHATT¹, JIGAR PATEL¹, Adriana Corvalan^1,2, Adam Evertts³, Hilary Coller^1,2,3

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA,
² Molecular Biology Interdepartmental Program, UCLA,
³ Molecular Biology Department, Princeton University

The ability to transition between a proliferative state and quiescence, a reversible resting state outside the cell cycle, is essential for proper development. This shift between proliferation and quiescence is linked to differential gene expression, governed in part by chromatin organization and histone modifications. Recently, our lab showed that histone H4K20 is the most differentially methylated histone modification in quiescent fibroblasts, exhibiting an eightfold increase of the trimethyl mark (H4K20me3) in quiescent cells compared to proliferating cells. Knockdown of Suv4-20h2, a methyltransferase that generates H4K20me3, resulted in increased proliferation. We sought to further investigate the changes in genomewide localization of this methylation mark in proliferating versus quiescent human dermal fibroblasts. We performed chromatin immunoprecipitation for different forms of H4K20, and used Model-based Analysis for ChIP-seq (MACS) to identify differential binding sites of H4K20 mono-, di-, and tri- methylation in proliferating and contact inhibited cells. Trimethylated H4K20 showed the greatest difference in binding sites between proliferating and quiescent cells, consistent with our previous data. Genes that gained H4K20me3 in quiescence showed higher gene expression in quiescent than proliferating cells, and were mostly zinc finger genes. Previous studies reported that H3K9me3 is required for H4K20me3 deposition. We compared the H4K20me3 patterns with genomewide localization profiles for trimethylation modifications of H3K9 and H3K27. A majority of differential sites of H3K9me3 in proliferating cells were located on chromosome 19. Interestingly, in quiescent cells H4K20me3 was also enriched on chromosome 19, near genes coding for zinc finger proteins. Our findings raise the possibility that the role of Suv4-20h2 in proliferation is mediated through effects on the expression of zinc finger proteins, a model that we will test experimentally.

CHEN: Comprehensive benchmarking of error correction methods for next generation sequencing via unique molecular identifiers

LINUS CHEN^1,2, Serghei Mangul², Brian L. Hill², Igor Mandric³, Russell Littman², Douglas Yao², Harry Yang², Kevin Hsieh², Parth Ingle², Arvin Nguyen², S. Gill², Nicholas Wu⁴, Ren Sun², Jan Schroeder⁵, Pavel Skums³, Alexander Zelikovsky³, Eleazar Eskin²

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences,
² University of California Los Angeles, Los Angeles, CA,
³ Georgia State University, Atlanta, GA,
⁴ The Scripps Research Institute, La Jolla, CA,
⁵ Walter and Eliza Hall Institute of Medical Research, Parkville, Melbourne, Victoria, Australia

Error correction is an important computational technique that promises to deliver highly accurate sequencing calls and improve the results of next-generation sequencing (NGS) analysis. While errors in data sets are a concern for any NGS-based application, they mostly affect applications that use variants at frequencies similar to error frequencies. Currently a wide array of error correction methods based on different computational approaches is available, but the optimal choice of error correction is often unclear. We provide the first comprehensive assessment of error-correction algorithms based on high- quality sequencing data derived from heterogeneous populations. Such heterogeneous populations of distinct, but closely related, clonotypes pose a serious challenge to error-correction algorithms due to the comparable level of artificial sequencing noise (i.e., errors) and true low-frequency genetic intra- population diversity. We use two case studies; one consists of a population of viral genomes, and the other consists of a population of T cell receptor clonotypes. We used the UMI-based high-fidelity sequencing protocol (safe-SeqS) to eliminate errors from the sequencing data. Our approach provides an accurate and robust baseline for performing realistic evaluation of error correction on sequenced genomes. We apply different methods to reads from such communities in order to assess whether error correction affects the ability of the assembly method to identify low frequency variants. We present the assessment criteria we used in the study to allow the user to make an informed choice of the most suitable software for specific NGS projects.

CHEN: Highly Accurate Age Prediction Method Based on Whole-Blood Methylation and Genotyping Data

ZEYUAN CHEN^1,2, Elior Rahmani³, Nadav Rakocz⁴, and Eran Halperin^2,5

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA,
² Department of Computer Science, UCLA,
³ Blavatnik School of Computer Science, Tel-Aviv University,
⁴ School of Electrical Engineering, Tel-Aviv University,
⁵ Department of Anesthesiology and Perioperative Medicine, UCLA

Several studies have already shown that Genome-wide DNA methylation changes with age, making it a perfect candidate for a highly accurate age predictor. In this study, we analyzed the longitudinal KORA dataset (Cooperative health research in the Region of Augsburg), which consists of 1799 samples, and correlated each individual’s DNA methylation levels with his or her chronological age. Unlike Steve Horvath’s study with cross-tissues predictor, we only focused on whole-blood and expected to achieve further gain in accuracy and performance on the model. Each time, 40% of the individuals were chosen as test set, from which we collected the predicted ages. We then applied an elastic net regression model with 10 folds cross- validation on the remaining 60% of the individual’s methylation levels and their chronological ages. The ratio between L1-norm and L2-norm and their weights were automatically selected after a thorough grid search as were the significant CpG sites and the resulting model performed astonishing well. We observed an over 0.95 correlation on the training set and 0.9 on the testing set, with mean absolute difference of merely 2.6 years. Significant CpG sites selected by the regression varied considerably each time indicating there may be more CpG sites that are closely related to age than we previously thought. Further analysis on methylation levels of different timestamps in the KORA dataset will allow us to calculate individual-specific aging rate across time and help us better understand human aging.

COCHRAN, MITCHELL: A Genome-Independent and Alignment-Free Approach for Disease Prediction by Kmer Counts from Metagenomic Reads

Kelly Cochran¹, D’andrea N Mitchell¹, Chelsea J.-T. Ju², and Wei Wang²

¹B.I.G. Summer Program, Institute of Quantitative and Computational Biosciences,
² Department of Computer Science, UCLA

Machine learning techniques have previously been applied to sequencing data of the human gut microbiome for prediction of metabolic disorders such as type II diabetes. The conventional metagenomic approach uses the presence of biomarkers and species abundance through alignment of short sequencing reads to a large number of microbial genomes. This method requires that all genomes of gut microbiome species be sequenced for reference, and cannot properly account for problems such as non-unique reads which align to multiple genomes and genetically near-identical bacterial strains which cause drastically different health phenotypes. Our approach bypasses alignment by counting kmers from reads, circumventing issues with genome references and genetic similarities between species or strains. We count the frequencies of both 12-mers and 15-mers using Jellyfish and KMC for each individual. Using multiple statistical tests, we determine the kmers with significant differential presence, which can then be used as features for classification of healthy/non-healthy subjects. By applying this method using two different sets of kmers, we can analyze the different predictive capabilities and effects of different ks. In addition, we will be able to explore the substring and superstring relationship between different sets of significant kmers. We demonstrate the potential advantages of kmer counting as a viable, computationally practical, and potentially more accurate alternative for disease prediction models. Better models of how the gut microbiome impacts disease phenotypes may improve our understanding of the mechanisms of the disease as well as diagnostic capabilities when applied in clinical settings.

DONG, MANUEL: Comparing Genetic Architecture of Complex Traits in Different Ethnic Populations Using Causal Effect Covariance from GWAS Summary Statistics

NATALIE DONG¹, ASTRID MANUEL¹, Claudia Giambartolomei², Huwenbo Shi³, Bogdan Pasaniuc^2,3,4

¹BIG Summer Program, Institute for Quantitative and Computational Biosciences,
²Department of Pathology and Laboratory Medicine,
³Bioinformatics Interdepartmental Program,
⁴Department of Human Genetics, UCLA

Genome-wide association studies (GWAS) identify single-nucleotide polymorphisms (SNPs) associated with a phenotype of interest. Although recent years have seen much progress, it remains the case that the vast majority of individuals assayed by GWAS are European. The purpose of this study is to investigate if results of GWAS done on Europeans can be applied to other ethnicities. We developed an estimator of causal effect covariance from GWAS summary statistics to measure both genome-wide and locus-specific genetic similarity between two populations. This estimator of causal effect covariance identifies shared causal loci, which are specific regions of the genome that exhibit similarities in causal effects for both populations. It accounts for linkage disequilibrium (LD) and does not require individual level data, therefore making it applicable to publicly available GWAS summaries. We investigated psychiatric, autoimmune, and kidney-related disorders across African, Asian, and European populations with ranging sample sizes of 9,000 to 100,000. Of all the complex traits studied, rheumatoid arthritis (RA) and two chronic kidney disorders exhibited statistically significant genome-wide covariance. RA also yielded 286 shared causal loci between Asians and Europeans, 21 of which are on chromosome 6, which harbors the human leukocyte antigen (HLA), known to be associated with RA. Major depression disorder (MDD) had 91 shared causal loci, post-traumatic stress disorder (PTSD) displayed 6, and four of the five kidney disorders each had 1. Future work is to be done for thorough interpretation of these shared causal loci.

FERNANDES, TORRES: Comparing Genetic Architecture of Multiple Complex Traits Using Causal Effect Covariance from GWAS Summary Statistics

ANTHONY FERNANDES¹, CHRISTIAN TORRES¹, Claudia Giambartolomei², Huwenbo Shi³, Bogdan Pasaniuc^2,3,4

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences,
² Department of Pathology and Laboratory Medicine,
³ Bioinformatics Interdepartmental Program,
⁴ Department of Human Genetics, UCLA

Genome wide association studies (GWAS) have been used to identify genetic variants associated with a trait and have created a rich resource of GWAS summary statistics, which allow us to compare the genetic architecture between pairs of traits. One measure of genetic similarity is the genetic covariance between traits, a measure of the similarity between the causal effects shared by two traits caused by single nucleotide polymorphisms (SNPs) collectively across the whole genome. The purpose of this study is to quantify covariance of the causal effect of SNPs on the traits, at each independent region in the genome. The method estimates both local and genome-wide covariance from GWAS summary statistics and accounts for linkage disequilibrium (LD). We analyzed GWAS summary data of 5 psychiatric disorders: attention deficit hyperactivity (ADHD), bipolar (BIP), major depressive (MDD), schizophrenia (SCZ), and autism (AUT), and four lipid traits: LDL, HDL, triglyceride (TG), and total cholesterol (TC) as well as coronary artery disease (CAD) and BMI, with sample sizes ranging from 18,000 to more than 150,000. The genome-wide covariances between CAD and all lipid traits and psychiatric disorders were in the expected direction. For example, BMI and TG show high covariance genome-wide (0.012, SE = 0.0005), as well as BIP and SCZ (0.08, SE = 0.001), as previously reported. Future directions for this study are to infer causality between pairs of traits using causal effect covariance and to focus on loci that contribute disproportionately to genome-wide covariance.

FISHER: Comparing MPRA and CERES to Predict Chromatin States Activity in CRISPR Screens

TIMOTHY B. FISHER^1-2, Tevfik Umut Dincer³, and Jason Ernst³

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
²Department of Biological Chemistry, University of California Los Angeles, Los Angeles, CA 90095, USA
³Cancer Biology Program, Department of OB/GYN, Morehouse School of Medicine, Georgia Cancer Center for Excellence, Grady Health System, Atlanta, USA

CRISPR-Cas9 based epigenetic regulatory element screening (CERES) is an assay that measures regulatory element activity within the original genomic context; as opposed to ectopic reporter assays like the Massively Parallel Reporter Assay (MPRA), that captures activation and repression outside the original genomic context. CERES utilizes the enzyme Cas9, and a piece of RNA, called guide-RNA, to introduce a targeted modification to the genome and quantify its effect. MPRA and CERES both produce a quantitative readout of some functional activity of cis-regulatory elements but their correlation in terms of chromatin-state annotations remain unclear. We used chromatin state annotations from the 25-state extended ChromHMM model to quantify CERES-assayed regions’ activity for each chromatin state. We hypothesized that if there are intersecting segments of the same MaxPos Sharpr-MPRA (MPS-MPRA) activating scores then the CERES segmented region can be used to find the most activating state(s). Using a custom analysis pipeline, we intersected the chromatin states with its respective scores, analyzed the overlapping regions and plotted MPS-MPRA scores to determine most activating CERES scores. We observed moderate agreement between highly expressed activity in CERES and highly expressed activity in MPRA. This analysis puts the CERES data into perspective of chromatin states and provides a better understanding of the gene regulatory landscape in its native genomic context.

GARLAND, MORRIS: Study of Normalization Methods and Binning Definitions for the Joint Analysis of Illumina 450k Microarrays and Whole Genome Bisulfite Sequencing Data

BROOKE GARLAND¹, JEROME MORRIS¹, Mary Same², Xianghong Jasmine Zhou³

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences,
² Bioinformatics Interdepartmental Graduate Program,
³ Department of Pathology and Laboratory Medicine, UCLA

In whole genome bisulfite sequencing (WGBS), methylation information is obtained for every cytosine that gets mapped to the genome, including over 28 million CpG sites. The 450k microarray provides methylation information for only ~450,000 CpG sites, but is relatively inexpensive. Many models that make use of methylation status are trained using the massive amounts of 450k data found on public repositories, but these models often need to be applied to WGBS data. Direct comparison between the 450k data and WGBS data neglects many of the sites present in the WGBS data and is complicated by differences in platform. The problem this study confronts is how to best process 450k and WGBS data so that they are directly comparable. Three WGBS samples and 59 microarray samples from matched age groups, disease status, and tissues were used to determine the best microarray processing procedures. Several microarray normalization methods and binning techniques were used, and changes to the correlation and mean absolute difference (MAD) between the WGBS and 450k data were observed. Quantile normalization followed by beta-mixture quantile normalization (QN.BMIQ) was found to be the best normalization method, and the binning procedure from CancerLocator [Kang S. et al. Genome Biology. 2017] was found to yield the highest correlation and lowest MAD. In the future, models involving methylation status can be constructed from microarray data by first using QN.BMIQ normalization and then collapsing to the CancerLocator bins to improve results on sequencing platforms.

GHOSH: Identifying Astrocyte-Specific Changes Related to Tau-Associated Neurodegeneration using a Ribotag Approach in a Mouse Model of Tauopathy

ANAMIKA GHOSH^1,2 and Jessica Rexach¹

¹ Department of Neurology,
² B.I.G. Summer, Institute of Quantitative and Computational Biology, UCLA

Emerging evidence points to a contribution of glial cells to neurodegenerative disease. To identify changes in astrocytes that are associated with tau-associated neurodegeneration as a first step toward characterizing astrocyte disease-associated pathways, we have captured astrocytes specific ribosome-bound RNA in a highly utilized mouse model of tauopathy (rTg4510) that expresses human mutant P301L tau in forebrain excitatory neurons downstream of a tetracycline transactivator (CamK2-TTA). Our initial analysis has identified a large amount of overlap between the differential gene expression found in astrocytes expressing P301L tau compared to the control mice that express CamK2-TTA only. Recent evidence points to a significant toxicity of TTA in neurons which we now suspect contributes to this unexpected result. We have corrected the data processing to extract as much of the remaining P301L tau specific changes as possible. We have optimized numerous differential gene expression strategies, including applying linear regression to limit effects of particular sample co-variants. Next we will compare the results to complementary transcriptomic data from mouse models of tau-associated neurodegeneration.

GILBERT, ABE-RIDGWAY: Transcriptomic analyses validate a novel method for enriching vascular cell type-specific data from in vivo samples

COLLEAH GILBERT¹, KATE ABE-RIDGWAY¹, Julia J. Mack,²Aditya S. Shirali³, Milagros C. Romay², M. Luisa Iruela-Arispe^2,4

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences,
² Department of Molecular, Cell, and Developmental Biology,
³Department of Surgery, David Geffen School of Medicine,
⁴Molecular Biology Institute, UCLA

The vasculature is a highly complex, intricately structured tissue composed of numerous cell types interacting together to promote cellular homeostasis. Blood vessels can be divided into three major classifications based on cellular composition and structure: arterial, venous and capillary. However, given the complex organizational structure of the vasculature it is extremely difficult to assess transcriptional changes in specific vascular cell types in vivo. In this study, we utilized RNA-seq analysis to assess a novel experimental protocol to quickly isolate vascular cell-type enriched RNA in-vivo from mouse aorta. Using both FPKM-based (Cufflinks) and raw count-based (DESeq2) transcriptional analysis, we found that vessel flush technique produced high quality, enriched endothelial cell (EC) and vascular smooth muscle cell (vSMC) RNA. The transcriptional profiles of these enriched RNA samples were substantially different from that of the whole mouse aorta, with hundreds of genes significantly differentially expressed (p < 1.0 x 10^-4) between the vascular cell type enriched samples and the whole aorta. Detailed analysis the transcriptomic profiles of EC enriched RNA showed on average a 4-fold increase in EC-specific gene expression while vSMC-enriched RNA showed a 2-fold increase in vSMC specific gene expression. Taken together, these results strongly this isolation approach could provide a significant increase power to detect transcriptional changes in specific vascular cell types of interest. Our findings strongly support the use of this novel method for rapidly isolating vascular cell type enriched RNA for assessing changes in gene expression in the vascular wall cells in normal and pathological conditions.

GUTIERREZ, OLADAPO: Identifying the regulatory function of histone acetyltransferase Gcn5 in the S. cerevisiae pre-mRNA splicing pathway

FRANK GUTIERREZ^*1,2, OLAYEMI OLADAPO^*1,2, Shawntel Okonkwo², Calvin Leung², Stephen Douglass², Tracy Johnson²

¹BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
²Department of Molecular, Cellular, and Developmental Biology, UCLA
* Co-authors

Gcn5, a major yeast histone acetyltransferase of the SAGA complex, plays a novel role in regulating pre-mRNA splicing. Surprisingly, Gcn5 HAT activity is required for co-transcriptional recruitment of two major U2 snRNP components onto the branchpoints of DBP2 and ECM33 pre-mRNA. Additionally, the regulated dynamics of histone acetylation/deacetylation is critical for proper spliceosome rearrangements onto the aforementioned genes. It is however unclear how Gcn5 HAT activity is affecting co-transcriptional splicing of intron-containing genes (ICGs) genome-wide. To determine the role of Gcn5 in co-transcriptional splicing, we analyzed RNA-seq data generated from wild type, gcn5Δ and H3KΔ9-16 samples in Saccharomyces cerevisiae cells, where H3KΔ9-16 represents a deletion of the major histone acetylation targets of Gcn5. The RNA-seq data also produces quantifiable changes in splicing efficiency for each intron-containing gene through a calculated ratio of spliced to total (spliced and unspliced) normalized counts. We hypothesize that relative to the wildtype condition, ICGs under the gcn5Δ and H3Δ9-16 mutant backgrounds will result in decreased expression. If Gcn5 HAT activity is required for co-transcriptional spliceosome assembly of two major U2 snRNP components, there will be a decrease in splicing efficiency genome wide. Since spliceosome assembly occurs in a stepwise manner and gcn5Δ as well as H3KΔ9-16 have been previously shown to affect co-transcriptional splicing, we predict these mutants will result in defective spliceosomes and poor splicing overall. Using a bioinformatic approach, results of these RNA-seq analyses will help clarify the relationship between a major yeast HAT and the spliceosome, and ultimately provide deeper insight into how chromatin can influence splicing in eukaryotes.

HAN: Prediction of regulatory potential at DNase I hypersensitive sites across human cells using CRISPR-Cas9 based regulatory element screening assays

HEATHER HAN^1,2, Tevfik Umut Dincer^3,4, Jason Ernst^3,4,5

¹ BIG Summer Program, Institute of Quantitative and Computational Biosciences, University of California Los Angeles, CA 90095
² Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218
³ Bioinformatics IDP, University of California Los Angeles, CA 90095
⁴ Department of Biological Chemistry, David Geffen School of Medicine, University of California Los Angeles, CA 90095
⁵ Computer Science Department, University of California Los Angeles, CA 90095

The human genome contains over a million candidate cis-regulatory elements in non-coding regions of the DNA, which are responsible for the activation and repression of specific genes. However, prediction of regulatory potential based on the identification of these regulatory elements remains a challenge. Recent developments in CRISPR-Cas9-based genome editing tools has allowed for high specificity targeting of DNA sequences of interest, furthering our understanding of the roles of various regulatory factors. To leverage this advancement, we sought to develop a novel method to predict regulatory potential genome-wide. We utilized CRISPR–Cas9-based epigenomic regulatory element screening (CERES) data which targeted DNase I hypersensitive sites (DHSs) from the recent Klann, et al. (2017) study, in conjunction with transcription factor binding sites and chromatin state annotations from ENCODE, to train our model. The supervised learning model was trained on a small subset of 281 DHSs near the HBE1 gene in K562 cells, and then used to predict regulatory potential across all 112,025 DHSs in K562 cells. We validated our model through cross-validation and comparisons with massively parallel reporter assays and other CRISPR-Cas9 datasets. We further utilized the method to determine correlation between sites predicted to have high regulatory functions and disease- associated SNPs. We believe that our model serves as a starting point for understanding cis- regulation using CRISPR-Cas9-based tools, and has the potential to improve as more datasets become publicly available.

HASERJIAN, HOLLIS: Analysis of Off Targets of an Artificial Zinc Finger Epigenetic Editor through RNA directed DNA methylation pathways

RENEE HASERJIAN^1,2, BRIEANA HOLLIS^1,3, Wanlu Lui⁴, Zhenhui Zhong⁴, and Steve Jacobsen⁴

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Department of Biological Sciences, California State University, Los Angeles, CA 90032
³ College of Science and Technology, Florida Agricultural and Mechanical University, FL 32307
⁴ Molecular, Cellular and Developmental Biology, University of California, Los Angeles, CA 90024

DNA methylation is a critical epigenetic process that is involved in gene and transposon silencing in both plants and animals. In plants, DNA methylation is established by the RNA-directed DNA methylation (RdDM) pathway. RdDM involves the synthesis of small interfering RNAs (siRNAs) as well as a long non-coding RNAs (lncRNAs) via Polymerase V (Pol V), that recruit the DNA methyltransferase DRM2 to methylate DNA. Epigenome editing is used to precisely modify the epigenetic landscape, which can be applied to stably silence genes. Previously the Jacobsen lab has demonstrated that artificial Zinc Finger Proteins (ZFs) recognize the promoter of the FWA gene while fused to accessory proteins such as SUVH9 or DMS3, which then recruit Pol V. This fusion efficiently methylated targeted DNA and silencing of FWA. ChIP-seq analysis of the ZFDMS3 identified thousands of binding sites, mostly in promoter regions. De novo motif analyses suggest that the lack of binding specificity is due to subsets of the finger domains interacting with the genome. ChIP-seq analysis of Pol V recruitment in ZF-DMS3 plants reveal that most of the off targets are able to recruit Pol V. Although, the ZF-DMS3 display widespread binding and Pol V recruitment, we only observe limited de novo methylation from Whole Genome Bisulfite Sequencing (WGBS) data. Further analysis of the features of methylated vs. non-methylated off targets might shed light on the mechanism that establishes de novo methylation.

KHOO: Comparing the Impact of Natural Selection on Linked Neutral Sites in Dogs Versus Wolves

NORRIS C. KHOO ^1,2,3, Tanya N. Phung ³, Christian D. Huber ⁴, Kirk E. Lohmueller ^3,4,5

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences,
² Department of Earth, Planetary, and Space Sciences,
³ Interdepartmental Program in Bioinformatics,
⁴ Department of Ecology and Evolutionary Biology,
⁵ Department of Human Genetics, David Geffen School of Medicine, UCLA

Natural selection, a key mechanism driving evolution, not only affects sites under direct selection (i.e. coding regions and functional sites). Diversity at neutral sites can also be reduced through linkage to selected sites. Two mechanisms of linked selection include selective sweeps (reduction in neutral diversity surrounding beneficial fixations) and background selection (reduction in neutral diversity when nearby deleterious variants are removed by selection). Although the impact of linked selection has been observed in a variety of species, the mechanism responsible is less understood. Here, we aim to determine the relative strength of each mechanism by using the dog-wolf system. This system is ideal for studying linked selection because following the dog-wolf split approximately 15,000 years ago, it is hypothesized that dogs experienced more selective sweeps than wolves due to intense artificial selection during domestication and breed formation. Thus, we hypothesized that dogs would show greater effects of linked selection than wolves. Using 13 dog and 6 wolf whole-genome sequences, we determined the strength of linked selection for each species by correlating neutral genetic diversity with the amount of functional content in regions of the genome and recombination rate. Regression analysis indicates that wolves experience a greater reduction in genetic diversity with increasing functional content and decreasing recombination rate. Contrary to our hypothesis, wolves experience more linked selection than dogs, arguing that background selection plays a greater role in shaping genetic diversity compared to selective sweeps.

KISHIMOTO, BOLTZ: Differential gene expression analysis of an NF?B/RelA Transactivation Domain mutant

KENSEI KISHIMOTO¹, TONI BOLTZ^1,2, Kim Ngo³, Alexander Hoffmann³

¹BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA,
²Department of Computer Science, University of Miami, Coral Gables, FL,
³Signaling Systems Laboratory, Department of Microbiology, Immunology, and Molecular Genetics, and Institute for Quantitative and Computational Biosciences, UCLA

The Nuclear Factor kappaB (NF?B) family of transcription factors play a vital role in the regulation of inflammatory and immune responses, cell proliferation and development. Dysregulation of NFkB activity has been implicated in many types of cancer and inflammatory diseases. To characterize the role of NF?B in immune response gene expression, we have generated a knock-in mouse of a Transactivation Domain (TAD) mutant in RelA/p65, the predominant NF?B family member. RelA has two TADs in its C-terminal end and our RelA- TADmut knock-in mouse carries a deletion of TAD1, and two point mutations, L449 and F473A, in TAD2. In this study, using RNA-seq analysis, we compared differential gene expression between Mouse Embryonic Fibroblast (MEF) and Bone Marrow-Derived Macrophage (BMDM) cells from Wild-Type (WT) and RelA-TADmut knock-in mice stimulated with TNF and LPS at 0, 0.5, 1, 3, and 8 hours. Reads were aligned to the mouse mm10 genome and RefSeq genes, and the data was analyzed for differential gene expression with the R package edgeR. We identified 189 genes to be induced more than four fold change in at least one of the time points in wild-type MEF cells due to TNF stimulation, and 840 genes in MEF cells with LPS stimulation. Similarly for BMDM cells, we identified 361 genes induced by TNF and 949 genes induced by LPS. Many of these genes are known NFkB targets. We analyzed those genes further using heatmap visualization and principal component analysis, in order to identify which genes are down-regulated in the RelA-TADmut as compared to WT. Overall, the mutant had a more severed effect in response to TNF than LPS, which is consistent with our understanding that NF?B is the primary transcription factor during TNF response, but only one of several during the LPS response. Our detailed results will be discussed during the poster presentation.

LEE: SV discovery in large families with bipolar disorder

REGINA LEE¹, Jae Hoon Sul²

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences,
² Department of Psychiatry and Biobehavioral Sciences at UCLA

Structural variations (SV) are genetic variations that involve changes in the structure of one’s chromosome. The most common types of SVs include deletions, duplications, insertions, and inversions, which may potentially affect traits of the individual. In this project, we aim to identify novel SVs using the LUMPY software (Layer et al., 2014) and improve the quality of the previous SV calls. LUMPY utilizes multiple SV signals and their positions across samples to enable more sensitive SV discovery compared to other actively maintained SV discovery packages. Using genome sequencing dataset of large families with bipolar disorder (454 individuals), we filtered out SVs with high missing rate and monomorphic SVs from the merged VCF file. We then checked for Mendelian errors to measure the accuracy of SV calls. We also re-ran the LUMPY pipeline on the cleaned data and compared it to the previous analysis. A future plan is to improve the LUMPY pipeline to improve data quality.

LI, LI: Performance Evaluation of Peak Detection Algorithms for N6-Methyladenosine (m6A) Site Identification in the Human Transcriptome

RUNJIA LI^1,2, ZANCHEN LI^1,3, Zijun Zhang⁴, Zhicheng Pan⁴, Yi Xing⁴

¹ BIG summer Program, Institute for Quantitative and Computational Biosciences,
² Department of Computer Science,
³ Department of Chemistry and Biochemistry,
⁴ Department of Bioinformatics, UCLA

N6-Methyladenosine (m6A) is a widespread base modification in eukaryotic mRNA and plays a key role in translation regulation. Understanding m6A’s potential regulatory role in gene expression and alternative splicing require knowledge of its topology in the transcriptome. Currently, next-generation, massively parallel Methylated RNA Immunoprecipitation Sequencing (MeRIP-seq) technology has produced abundant sequence read data. A suitable peak detection (calling) algorithm is the crucial requisite for transcriptome-wide profiling of m6A sites from these data, as it must consider the statistical bias and background noise inherent to the data. Here, we designed a set of metrics to evaluate m6A detection performance of three published peak callers: MACS2, MeTPeak and RIPSeeker. m6A MeRIP-seq and control reads for human GM12878 cell line are aligned to hg19 reference genome, and peaks are generated from the alignment using the three peak callers repsectively. The peaks called are evaluated by comparing the number of peaks called, overlap and intersections among peaks, enrichment at different transcript regions and motif search results. From the analysis we found that MACS2 out performs MeTPeaks for specificity and RIPSeeker for total peaks called. This result provides valuable datas for the training of further peak calling algorithms on nano pore sequencing.

LYU: Analysis of INDELs mutations within reference genomes using Scalpel

SANG JI LYU¹, Jae Hoon Sul²

¹Department of Computational and Systems Biology,
²Department of Psychiatry and Biobehavioral Sciences, UCLA

Insertions and deletions (INDELs) is known to be linked to many diseases. Frameshift during translation of mRNA (single base pair change, by INDEL, in the coding part of mRNA) results in a premature stop codon in a different frame. Our main aim was to acquaint ourselves with a new software package called Scalpel. Scalpel is a genetic variants discovery tool, software package, for detecting INDELs. More specifically Scalpel was build to execute localized microassembly of desired specific regions. Mainly, we wanted to see if Scalpel was able to perform with “high accuracy and increased power”(Narzisi, 2013) in detecting INDELs mutation. Our focus was on “whole genome versus whole exome studies” (Narzisi, 2013). Scalpel is predominantly tested for exome capture data; however, Scalpel can also detect mutations in whole genome data. For fast processing, by lowering memory requirement, we ran each chromosome separately. Future plans is to further obtain information about Scalpel and use it in de novo mode and somatic mode. Scalpel is said to be best for detecting de novo indels in a quad family, and somatic indels for the sequencing data from match tumors and normal samples.

MILLER, SANCHEZ: Estimating mRNA Half-life in Naive and Tolerized Macrophages using Actinomycin D RNA-seq Data

NICK MILLER^1,2, GUILLERMO SANCHEZ^1,3, Diane Lefaudeux⁴, Alexander Hoffmann⁴

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Department of Biological and Environmental Engineering, Cornell University, Ithaca, NY 14850
³ Department of Life and Physical Sciences, Fisk University, Nashville, TN 37208
⁴ Department of Microbiology, Immunology, & Molecular Genetics, UCLA, Los Angeles, CA 90095

Gene expression dynamics are regulated by mRNA transcription and degradation. Together they determine how long mRNA transcripts of a given gene can actively be translated, producing the proteins that determine biological function. While many studies have elucidated mechanisms of transcriptional control, far fewer studies have addressed mRNA degradation, in part because there is no accepted method for measuring mRNA half-lifes. Here, we analyzed RNAseq datasets produced following a block of transcriptional elongation with Actinomycin D. In this study, bone marrow- derived macrophages (BMDM) were cultured and either tolerized by LipidA (LPA) pre- stimulation or kept naive. The cells were then stimulated with LPA (time 0) to induce an immune response. Transcription was then blocked by adding Actinomycin D (ActD) at 0, 60 and 180 minutes after LPA treatment. RNA was sequenced for all conditions and at 7 different timepoints after ActD. Reads counts were normalized using spike-in RNAs (external controls) to account for diminishing mRNA amount over time. We generated counts for each mRNA and assessed the quality of the RNAseq datasets. A linear model was used to fit a decay rate of each gene and the corresponding half-life was derived for a given experiment. Half-lives of 235 LPA-induced genes were calculated. For example, Tnf, Nfkbia and Mmp13 yielded a mRNA half-life at basal of 20, 17.8, 86.3 minutes respectively. We also derived confidence intervals and assessed the statistical significance of changes in the derived halflifes in different conditions.

PARKER: UMI-Reducer: Collapsing duplicate sequencing reads via Unique Molecular Identifiers

GARRETT PARKER², Serghei Mangul^1,2, Sarah Van Driesche³, Lana S. Martin¹, Kelsey C. Martin^3,4,5, Eleazar Eskin^1,6

¹Department of Computer Science, UCLA
²Institute for Quantitative and Computational Biosciences, UCLA
³Department of Biological Chemistry, UCLA
⁴David Geffen School of Medicine, UCLA
⁵Department of Psychiatry and Biobehavioral Sciences, UCLA
⁶ Department of Human Genetics, UCLA

Every sequencing library contains duplicate reads. While many duplicates arise during polymerase chain reaction (PCR), some duplicates derive from multiple identical fragments of mRNA present in the original lysate (termed “biological duplicates”). Unique Molecular Identifiers (UMIs) are random oligonucleotide sequences that allow differentiation between technical and biological duplicates. Here we introduce a computational method, UMI-Reducer, that processes mapped sequencing reads to differentiate PCR duplicates from biological duplicates. UMI-Reducer uses UMIs and the mapping position of reads to identify and collapse technical duplicates. Remaining true biological reads are further used for bias-free estimate of mRNA abundance in the original lysate. This strategy is of particular use for libraries made from low amounts of starting material, which typically require additional cycles of PCR and therefore are most prone to PCR duplicate bias. UMI-Reducer provides an additional, novel functionality for processing reads that are assigned to more than one locus (multi-mapped reads). This method stochastically assigns multi-mapped reads based on transcript abundance. The result of UMI-Reducer is a less biased sequencing output. The UMI-Reducer is an open source Python software and is freely available for non-commercial use (GPL-3.0) at https://sergheimangul.wordpress.com/umi-reducer/.

TRINH, JONES: Computational Analysis of Propionibacterium acnes Strain Populations in the Skin Microbiome Using Pathoscope

JERRY TRINH^1,2,3, JELIAH JONES^1,4, Baochen Shi², Huiying Li²

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA,
² Department of Molecular & Medical Pharmacology, David Geffen School of Medicine, Crump Institute for Molecular Imaging, UCLA,
³ Department of Chemistry and Biochemistry, UCLA,
⁴ Department of Biological Science, Florida A&M University

Propionibacterium acnes is a major commensal of the skin microbiome and a key contributor to skin health. However, P. acnes strains from different lineages have been linked to various diseases, suggesting that strain-level differences may be essential in disease pathogenesis. Therefore, the ability to detect strain differences in the microbiome data will be crucial to the investigations of the skin microbiome and its role in diseases. Our objective for this study is to determine whether Pathoscope 2.0, a metagenomic data analysis tool, is able to distinguish strains of P. acnes from different phylogenetic lineages in metagenomic shotgun sequencing data. We generated simulated P. acnes population datasets by sampling the sequence reads of multiple P. acnes strains and combining them at certain ratios. Pathoscope 2.0 was able to distinguish strains from distant lineages, but performed poorly when the strains were phylogenetically closer. We then tested Pathoscope on 26 clinical datasets. The relative abundances of 5 ribotype groups (RT1, RT2/6, RT3, RT4/5, RT8) in each sample correlated with 16S ribosomal RNA sequencing data with a mean Pearson’s correlation coefficient of 0.712 + 0.339 but overestimated the relative abundances of RT4/5 and RT8. Our simulated microbiome data demonstrated that Pathoscope 2.0 was effective in identification of P. acnes strains from distant phylogenetic groups but could not distinguish those with high phylogenetic similarity. Furthermore, Pathoscope was able to estimate the ribotype abundance based on the data obtained from clinical samples.

TU: Analysis of Human and Mouse Homologs with Tight Spectral Clustering Algorithm for Bipartite Networks

TIFFANY TU¹, Yidan Sun², and Jessica (Jingyi) Li²

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences,
² Department of Statistics, UCLA

Homology is an important area for gene expression studies, as it confirms similarities that exists in both biological structure and genetic function between two species. Often, homologs differ in mutations that occurred in a common ancestor, and such information can be crucial for biomedical research in drug discovery and development. This study focuses on developing a new algorithm for gene expression studies using tight spectral clustering with bipartite node covariates. Given gene expression (FPKM) data and a homolog bipartite network between two species, say human and mouse, we hope to identify tight gene clusters. Understanding tight gene clusters can help infer unknown genetic functions between the two species. That is, given a pre- specified positive integer k, our goal is to identify k “tight” cluster of nodes simultaneously on both sides of the bipartite network, producing tight and stable clusters without forcing all points into clusters. Edge and covariate information is implemented into the tight clustering on bipartite network algorithm. RNA-seq gene quantification data for human and mouse tissues such as liver, heart, and stomach are extracted from the Encyclopedia of DNA Elements (ENCODE) and aggregated to fit our model. A dataset with human and mouse corresponding homolog gene ID is applied as our bipartite node covariates.

VANDERWALL: Understanding the role of cohousing wild-type and ZFP36L1 knockout mice on the gut microbiome

ELIZABETH VANDERWALL^1,2, Jennifer M. Lang³, Jenny C. Link³, Elizabeth J. Tarling³, Thomas A. Vallim^3,4

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences,
² Department of Microbiology, Immunology & Molecular Genetics,
³ Division of Cardiology, David Geffen School of School of Medicine, UCLA

The gene ZFP36L1, which we recently discovered as a novel regulator of bile acid synthesis and lipid metabolism, is associated with reduced adiposity, steatosis, and lipid adsorption and increased bile acid levels. In addition, alternations in bile acid levels are known to both affect and be affected by the microbiome. We used liver-specific ZFP36L1 knockouts to investigate the effect of genotype on the microbiome and we expected there to be microbiome differences caused by genotype. But because mice ingest each other’s microbiome, cohousing genotypically different mice that harbor unique communities should lead to microbiome equilibration. This would then allow us to determine whether genotype-mediated phenotypic traits can be transferred through the microbiome. We hypothesized that cohousing wild-type and ZFP36L1 knockout mice may result in broader metabolic outcomes, at least in part due to equalizing microbiomes. Three groups based on genotype and cohousing condition (genotype groups) were studied: ZFP36L1 liver-specific knockouts, wild-type mice cohoused with knockouts, and wild-type non-cohoused mice. DNA from cecum contents of these mice was extracted, and the 16S rRNA gene was sequenced using Illumina sequencing to describe the microbiome community. Data was then analyzed in Quantitative Insights Into Microbial Ecology (QIIME) and phyloseq. Principal coordinate analysis ordination (PCoA) plots show that fat mass, body weight, and genotype were significantly associated with their bacterial composition. These observations were especially apparent when comparing the ZFP36L1 knockout, wild-type non-cohoused, and wild-type cohoused individuals. Further analysis with DESeq2 demonstrated that bacterial taxa levels were not significantly different between the knockout and cohoused wild-type mice. However, wild-type non- cohoused mice had significantly different levels of bacterial taxa between both the knockouts and their cohoused wild-type counterparts. Since the only difference between the two genetically identical wild- type groups was co-housing or not, we conclude that the loss of ZFP36L1 causes a shift in microbiome, which the co-housed wild-type mice can also acquire. We will now determine whether the differences in the two wild-type microbiomes also correlates with other metabolic parameters, such as differences in fat mass, body weight, and bile acid levels. Together, we hope to determine what pathways ZFP36L1 controls that are then associated with changes in the microbiome. Analysis of this data will dictate how future mouse studies comparing loss of ZFP36L1 will be conducted and will enhance our understanding of how ZFP36L1 affects metabolism and the microbiome.

VARILLAS, TU: Neyman Pearson Classification Algorithm Applied on Clinical Data for Breast Cancer Diagnosis

MAYRA VARILLAS¹, TIFFANY TU¹, Yiling Chen², Jessica Jingyi Li²

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences,
² Department of Statistics, UCLA

Classification methods are used to predict labels from a set of attributes, and are widely applied on biomedical data. In particular, binary classification restricts the situation to two cases; coded as a 0 or a 1.Traditional classification method aims to minimize the false negative rate while keeping the false positive rate under a threshold. When analyzing clinical data, the consequences of false positive and false negatives are highly asymmetrical. To approach this issue, we apply Neyman Pearson (NP) classification on DNA methylation profiles of healthy and diseased patients for analysis The main objective is to develop a variable selection algorithm to select a subset of variables. A statistical model is developed so that disease diagnosis misclassification can be under 0.1%. By utilizing variable selection, biomedical research is benefited since computational resources are maximized, therefore creating less problems for data analysis. NP classification allows high probability of controlled false positive rate (or false negative rate) under pre-specified threshold. Variable selection is a crucial preprocessing step in data analysis used for predictive and inferential purposes. Data preprocessing includes finding an appropriate dataset to fit our framework. Our contribution was to search data for method development. After this the data can be applied to the algorithm. This is done only after data cleaning using R.

WADDEL, MUBASHER KHAN: Evaluating the efficiency of single cell data in cell type deconvolution

HANNAH WADDEL^1,2, MISHA MUBASHER KHAN^1,2, Brian Nadel², Matteo Pellegrini²

¹BIG Summer Program, Institute for Quantitative and Computational Biosciences,
²Department of Molecular, Cell and Developmental Biology, UCLA

Cell type deconvolution has been used to determine relative fractions of diverse cell type subsets in heterogeneous tissue samples. To utilize existing deconvolution tools, a reference data set is needed. We wanted to test whether single cell RNA-SEQ data could be used as a reference set for cell type deconvolution tools. We asked how few cells were needed to generate an accurate reference profile. We employed publicly available single cell RNA-SEQ datasets from 10X Genomics to create a reference data set. Accuracy was tested by comparing predicted and actual cell type fractions in samples where the cell type abundances are known a priori. We found that the accuracy of prediction increased with number of cells and reads per cell used to create the reference cell type profiles. We conclude that 1000 cells can accurately model cell type profiles, and that additional cells and reads generates only incremental improvements. We conclude that single cell RNA-SEQ generates reliable cell type profiles for deconvolution of bulk samples.

WEINER: Integrating chromatin immunoprecipitation sequencing data to identify binding patterns of the small Maf family of transcription factors

ADAM WEINER¹, Jenny C. Link³, Elizabeth Tarling³, Thomas de Aguiar Vallim³

¹ Department of Bioengineering, Henry Samueli School of Engineering,
² BIG Summer Program, Institute for Quantitative and Computational Biosciences,
³ Department of Medicine, Division of Cardiology, David Geffen School of Medicine, University of California, Los Angeles

Previous studies have identified small Maf (sMaf) transcription factors as regulators of bile acid genes. The three members of the sMaf family, MafG, MafF, and MafK, form heterodimers and homodimers with themselves and other transcription factors such as Bach and NRF proteins; these complexes can either activate or repress transcription. The goal of our research was to find binding motifs of sMafs, NRFs and Bach proteins in order to draw conclusions about their dimerization behavior and thus regulatory function. We used publically available chromatin immunoprecipitation sequencing (ChIP-seq) data along with a pipeline of modeling, motif enrichment, and genomic regions enrichment tools including MACS, HOMER, and GREAT to discover the binding motifs, regulated genes, and distance from the transcription start site (TSS) of different sMaf, Bach, and NRF response elements in a human hepatoma cell line (HepG2). Our results indicate that there are strong binding site similarities between MafF and MafK and between Bach1 and NRF2, suggesting that these two sets of proteins were likely the most common dimers in the cellular context we examined. We will now investigate how the binding of the transcription factors to motifs regulates gene expression, focusing on how the regulatory functions of each transcription factor can change upon dimerization. Understanding the regulation and preferential binding sites will allow us to mechanistically determine the role of sMaf, Bach and NRF proteins in regulating specific genes, particularly those of lipid and bile acid metabolism in future studies.

2017 Bruins-In-Genomics Summer Undergraduate Research Program

2017 B.I.G. Summer Best Poster Award Winners

2017 B.I.G. Summer Participants

2017 B.I.G. Summer Poster Abstracts

Interesting links

Pages

Categories

Archive