2017 Bruins-In-Genomics Summer Undergraduate Research Program

2017 B.I.G. Summer Best Poster Award Winners

2017 B.I.G. Summer Participants

Lab PIsMentorsStudents
HILARY COLLERAdriana CorvalanRoshni Bhatt, Case Western Reserve Univ.
Jigar Patel, UC San Diego
JASON ERNSTTevfik Umut DincerTimothy B. Fisher, Morehouse School of Medicine
Heather Han, Johns Hopkins Univ.
ELEAZAR ESKINSerghei MangulLinus Chen, UCLA
Garrett Parker, Santa Monica College
ALEXANDER HOFFMANNKim NgoKensei Kishimoto, UCLA
Toni Boltz, Univ. of Miami
Diane LefaudeuxNick Miller, Cornell University, Ithaca
Guillermo Sanchez-Arriola, Fisk Univ.
LUISA IRUELA-ARISPEJulia Mack & Milagros RomayColleah Gilbert, Florida A&M Univ.
Kate Abe-Ridgway, UC Davis
STEVE JACOBSENWanlu LuiRenee Haserjian, CSU Los Angeles
Brieana Hollis, Florida A&M Univ.
TRACY JOHNSONFrank Gutierrez, CSU Los Angeles
Olayemi Olapado, Florida A&M Univ.
HUIYING LIBaochen ShiJerry Trinh, UCLA
Jeliah Jones, Florida A&M Univ.
JESSICA LIYiling Chen & Yidan SunTiffany Tu, George Washington Univ.
Mayra Varillas, UCLA
KIRK LOHMUELLERTanya N. PhungNorris C. Khoo, UCLA
BOGDAN PASANIUCClaudia Giambartolomei & Huwenbo ShiNatalie Dong, Boston Univ.
Astrid Manuel, Florida International Univ.
Anthony Fernandes, Cornell Univ.
Christian Torres, UCLA
MATTEO PELLEGRINIBrian NadelHannah Waddel, Univ. of Utah
Misha Mubasher Khan, Swarthmore College
JESSICA REXACHAnamika Ghoshm, UCLA
JAE HOON SULRegina Lee, UCLA
Sang Ji Lyu, UCLA
THOMAS A. VALLIMJennifer M. LangElizabeth Vanderwall, UCLA
Jenny C. LinkAdam Weiner, UCLA
WEI WANGChelsea J.-T. JuKelly Cochran, Duke Univ.
D'andrea N. Mitchell, Albany State Univ.
YI XINGZijun ZhangRunjia Li, UCLA
Zanchen Li, UCLA
JASMINE ZHOUMary SameBrooke Garland, Pepperdine Univ.
Jermone Morris, Fisk Univ.

2017 B.I.G. Summer Poster Abstracts

ROSHNI BHATT1, JIGAR PATEL1, Adriana Corvalan1,2, Adam Evertts3, Hilary Coller1,2,3

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA,
2 Molecular Biology Interdepartmental Program, UCLA,
3 Molecular Biology Department, Princeton University

The ability to transition between a proliferative state and quiescence, a reversible resting state outside the cell cycle, is essential for proper development. This shift between proliferation and quiescence is linked to differential gene expression, governed in part by chromatin organization and histone modifications. Recently, our lab showed that histone H4K20 is the most differentially methylated histone modification in quiescent fibroblasts, exhibiting an eightfold increase of the trimethyl mark (H4K20me3) in quiescent cells compared to proliferating cells. Knockdown of Suv4-20h2, a methyltransferase that generates H4K20me3, resulted in increased proliferation. We sought to further investigate the changes in genomewide localization of this methylation mark in proliferating versus quiescent human dermal fibroblasts. We performed chromatin immunoprecipitation for different forms of H4K20, and used Model-based Analysis for ChIP-seq (MACS) to identify differential binding sites of H4K20 mono-, di-, and tri- methylation in proliferating and contact inhibited cells. Trimethylated H4K20 showed the greatest difference in binding sites between proliferating and quiescent cells, consistent with our previous data. Genes that gained H4K20me3 in quiescence showed higher gene expression in quiescent than proliferating cells, and were mostly zinc finger genes. Previous studies reported that H3K9me3 is required for H4K20me3 deposition. We compared the H4K20me3 patterns with genomewide localization profiles for trimethylation modifications of H3K9 and H3K27. A majority of differential sites of H3K9me3 in proliferating cells were located on chromosome 19. Interestingly, in quiescent cells H4K20me3 was also enriched on chromosome 19, near genes coding for zinc finger proteins. Our findings raise the possibility that the role of Suv4-20h2 in proliferation is mediated through effects on the expression of zinc finger proteins, a model that we will test experimentally.

LINUS CHEN1,2, Serghei Mangul2, Brian L. Hill2, Igor Mandric3, Russell Littman2, Douglas Yao2, Harry Yang2, Kevin Hsieh2, Parth Ingle2, Arvin Nguyen2, S. Gill2, Nicholas Wu4, Ren Sun2, Jan Schroeder5, Pavel Skums3, Alexander Zelikovsky3, Eleazar Eskin2

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
2 University of California Los Angeles, Los Angeles, CA,
3 Georgia State University, Atlanta, GA,
4 The Scripps Research Institute, La Jolla, CA,
5 Walter and Eliza Hall Institute of Medical Research, Parkville, Melbourne, Victoria, Australia

Error correction is an important computational technique that promises to deliver highly accurate sequencing calls and improve the results of next-generation sequencing (NGS) analysis. While errors in data sets are a concern for any NGS-based application, they mostly affect applications that use variants at frequencies similar to error frequencies. Currently a wide array of error correction methods based on different computational approaches is available, but the optimal choice of error correction is often unclear. We provide the first comprehensive assessment of error-correction algorithms based on high- quality sequencing data derived from heterogeneous populations. Such heterogeneous populations of distinct, but closely related, clonotypes pose a serious challenge to error-correction algorithms due to the comparable level of artificial sequencing noise (i.e., errors) and true low-frequency genetic intra- population diversity. We use two case studies; one consists of a population of viral genomes, and the other consists of a population of T cell receptor clonotypes. We used the UMI-based high-fidelity sequencing protocol (safe-SeqS) to eliminate errors from the sequencing data. Our approach provides    an accurate and robust baseline for performing realistic evaluation of error correction on sequenced genomes. We apply different methods to reads from such communities in order to assess whether error correction affects the ability of the assembly method to identify low frequency variants. We present the assessment criteria we used in the study to allow the user to make an informed choice of the most suitable software for specific NGS projects.

ZEYUAN CHEN1,2, Elior Rahmani3, Nadav Rakocz4, and Eran Halperin2,5

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA,
2 Department of Computer Science, UCLA,
3 Blavatnik School of Computer Science, Tel-Aviv University,
4 School of Electrical Engineering, Tel-Aviv University,
5 Department of Anesthesiology and Perioperative Medicine, UCLA

Several studies have already shown that Genome-wide DNA methylation changes with age, making it a perfect candidate for a highly accurate age predictor. In this study, we analyzed the longitudinal KORA dataset (Cooperative health research in the Region of Augsburg), which consists of 1799 samples, and correlated each individual’s DNA methylation levels with his or her chronological age. Unlike Steve Horvath’s study with cross-tissues predictor, we only focused on whole-blood and expected to achieve further gain in accuracy and performance on the model. Each time, 40% of the individuals were chosen as test set, from which we collected the predicted ages. We then applied an elastic net regression model with 10 folds cross- validation on the remaining 60% of the individual’s methylation levels and their chronological ages. The ratio between L1-norm and L2-norm and their weights were automatically selected after a thorough grid search as were the significant CpG sites and the resulting model performed astonishing well. We observed an over 0.95 correlation on the training set and 0.9 on the testing set, with mean absolute difference of merely 2.6 years. Significant CpG sites selected by the regression varied considerably each time indicating there may be more CpG sites that are closely related to age than we previously thought. Further analysis on methylation levels of different timestamps in the KORA dataset will allow us to calculate individual-specific aging rate across time and help us better understand human aging.

Kelly Cochran1, D’andrea N Mitchell1, Chelsea J.-T. Ju2, and Wei Wang2

1 B.I.G. Summer Program, Institute of Quantitative and Computational Biosciences,
2 Department of Computer Science, UCLA

Machine learning techniques have previously been applied to sequencing data of the human gut microbiome for prediction of metabolic disorders such as type II diabetes. The conventional metagenomic approach uses the presence of biomarkers and species abundance through alignment of short sequencing reads to a large number of microbial genomes. This method requires that all genomes of gut microbiome species be sequenced for reference, and cannot properly account for problems such as non-unique reads which align to multiple genomes and genetically near-identical bacterial strains which cause drastically different health phenotypes. Our approach bypasses alignment by counting kmers from reads, circumventing issues with genome references and genetic similarities between species or strains. We count the frequencies of both 12-mers and 15-mers using Jellyfish and KMC for each individual. Using multiple statistical tests, we determine the kmers with significant differential presence, which can then be used as features for classification of healthy/non-healthy subjects. By applying this method using two different sets of kmers, we can analyze the different predictive capabilities and effects of different ks. In addition, we will be able to explore the substring and superstring relationship between different sets of significant kmers. We demonstrate the potential advantages of kmer counting as a viable, computationally practical, and potentially more accurate alternative for disease prediction models. Better models of how the gut microbiome impacts disease phenotypes may improve our understanding of the mechanisms of the disease as well as diagnostic capabilities when applied in clinical settings.

NATALIE DONG1, ASTRID MANUEL1, Claudia Giambartolomei2, Huwenbo Shi3, Bogdan Pasaniuc2,3,4

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
2 Department of Pathology and Laboratory Medicine,
3 Bioinformatics Interdepartmental Program,
4 Department of Human Genetics, UCLA

Genome-wide association studies (GWAS) identify single-nucleotide polymorphisms (SNPs) associated with a phenotype of interest. Although recent years have seen much progress, it remains the case that the vast majority of individuals assayed by GWAS are European. The purpose of this study is to investigate if results of GWAS done on Europeans can be applied to other ethnicities. We developed an estimator of causal effect covariance from GWAS summary statistics to measure both genome-wide and locus-specific genetic similarity between two populations. This estimator of causal effect covariance identifies shared causal loci, which are specific regions of the genome that exhibit similarities in causal effects for both populations. It accounts for linkage disequilibrium (LD) and does not require individual level data, therefore making it applicable to publicly available GWAS summaries. We investigated psychiatric, autoimmune, and kidney-related disorders across African, Asian, and European populations with ranging sample sizes of 9,000 to 100,000. Of all the complex traits studied, rheumatoid arthritis (RA) and two chronic kidney disorders exhibited statistically significant genome-wide covariance. RA also yielded 286 shared causal loci between Asians and Europeans, 21 of which are on chromosome 6, which harbors the human leukocyte antigen (HLA), known to be associated with RA. Major depression disorder (MDD) had 91 shared causal loci, post-traumatic stress disorder (PTSD) displayed 6, and four of the five kidney disorders each had 1. Future work is to be done for thorough interpretation of these shared causal loci.

ANTHONY FERNANDES1, CHRISTIAN TORRES1, Claudia Giambartolomei2, Huwenbo Shi3, Bogdan Pasaniuc2,3,4

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
2 Department of Pathology and Laboratory Medicine,
3 Bioinformatics Interdepartmental Program,
4 Department of Human Genetics, UCLA

Genome wide association studies (GWAS) have been used to identify genetic variants associated with a trait and have created a rich resource of GWAS summary statistics, which allow us to compare the genetic architecture between pairs of traits. One measure of genetic similarity is the genetic covariance between traits, a measure of the similarity between the causal effects shared by two traits caused by single nucleotide polymorphisms (SNPs) collectively across the whole genome. The purpose of this study is to quantify covariance of the causal effect of SNPs on the traits, at each independent region in the genome. The method estimates both local and genome-wide covariance from GWAS summary statistics and accounts for linkage disequilibrium (LD). We analyzed GWAS summary data of 5 psychiatric disorders: attention deficit hyperactivity (ADHD), bipolar (BIP), major depressive (MDD), schizophrenia (SCZ), and autism (AUT), and four lipid traits: LDL, HDL, triglyceride (TG), and total cholesterol (TC) as well as coronary artery disease (CAD) and BMI, with sample sizes ranging from 18,000 to more than 150,000. The genome-wide covariances between CAD and all lipid traits and psychiatric disorders were in the expected direction. For example, BMI and TG show high covariance genome-wide (0.012, SE = 0.0005), as well as BIP and SCZ (0.08, SE = 0.001), as previously reported. Future directions for this study are to infer causality between pairs of traits using causal effect covariance and to focus on loci that contribute disproportionately to genome-wide covariance.

TIMOTHY B. FISHER1-2, Tevfik Umut Dincer3, and Jason Ernst3

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Department of Biological Chemistry, University of California Los Angeles, Los Angeles, CA 90095, USA
3 Cancer Biology Program, Department of OB/GYN, Morehouse School of Medicine, Georgia Cancer Center for Excellence, Grady Health System, Atlanta, USA

CRISPR-Cas9 based epigenetic regulatory element screening (CERES) is an assay that measures regulatory element activity within the original genomic context; as opposed to ectopic reporter assays like the Massively Parallel Reporter Assay (MPRA), that captures activation and repression outside the original genomic context. CERES utilizes the enzyme Cas9, and a piece of RNA, called guide-RNA, to introduce a targeted modification to the genome and quantify its effect. MPRA and CERES both produce a quantitative readout of some functional activity of cis-regulatory elements but their correlation in terms of chromatin-state annotations remain unclear. We used chromatin state annotations from the 25-state extended ChromHMM model to quantify CERES-assayed regions’ activity for each chromatin state. We hypothesized that if there are intersecting segments of the same MaxPos Sharpr-MPRA (MPS-MPRA) activating scores then the CERES segmented region can be used to find the most activating state(s). Using a custom analysis pipeline, we intersected the chromatin states with its respective scores, analyzed the overlapping regions and plotted MPS-MPRA scores to determine most activating CERES scores. We observed moderate agreement between highly expressed activity in CERES and highly expressed activity in MPRA. This analysis puts the CERES data into perspective of chromatin states and provides a better understanding of the gene regulatory landscape in its native genomic context.

BROOKE GARLAND1, JEROME MORRIS1, Mary Same2, Xianghong Jasmine Zhou3

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
2 Bioinformatics Interdepartmental Graduate Program,
3 Department of Pathology and Laboratory Medicine, UCLA

In whole genome bisulfite sequencing (WGBS), methylation information is obtained for every cytosine that gets mapped to the genome, including over 28 million CpG sites. The 450k microarray provides methylation information for only ~450,000 CpG sites, but is relatively inexpensive. Many models that make use of methylation status are trained using the massive amounts of 450k data found on public repositories, but these models often need to be applied to WGBS data. Direct comparison between the 450k data and WGBS data neglects many of the sites present in the WGBS data and is complicated by differences in platform. The problem this study confronts is how to best process 450k and WGBS data so that they are directly comparable. Three WGBS samples and 59 microarray samples from matched age groups, disease status, and tissues were used to determine the best microarray processing procedures. Several microarray normalization methods and binning techniques were used, and changes to the correlation and mean absolute difference (MAD) between the WGBS and 450k data were observed. Quantile normalization followed by beta-mixture quantile normalization (QN.BMIQ) was found to be the best normalization method, and the binning procedure from CancerLocator [Kang S. et al. Genome Biology. 2017] was found to yield the highest correlation and lowest MAD. In the future, models involving methylation status can be constructed from microarray data by first using QN.BMIQ normalization and then collapsing to the CancerLocator bins to improve results on sequencing platforms.

ANAMIKA GHOSH1,2 and Jessica Rexach1

1 Department of Neurology,
2 B.I.G. Summer, Institute of Quantitative and Computational Biology, UCLA

Emerging evidence points to a contribution of glial cells to neurodegenerative disease. To identify changes in astrocytes that are associated with tau-associated neurodegeneration as a first step toward characterizing astrocyte disease-associated pathways, we have captured astrocytes specific ribosome-bound RNA in a highly utilized mouse model of tauopathy (rTg4510) that expresses human mutant P301L tau in forebrain excitatory neurons downstream of a tetracycline transactivator (CamK2-TTA). Our initial analysis has identified a large amount of overlap between the differential gene expression found in astrocytes expressing P301L tau compared to the control mice that express CamK2-TTA only. Recent evidence points to a significant toxicity of TTA in neurons which we now suspect contributes to this unexpected result. We have corrected the data processing to extract as much of the remaining P301L tau specific changes as possible. We have optimized numerous differential gene expression strategies, including applying linear regression to limit effects of particular sample co-variants. Next we will compare the results to complementary transcriptomic data from mouse models of tau-associated neurodegeneration.

COLLEAH GILBERT1, KATE ABE-RIDGWAY1, Julia J. Mack,2 Aditya S. Shirali3, Milagros C. Romay2, M. Luisa Iruela-Arispe2,4

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
2 Department of Molecular, Cell, and Developmental Biology,
3 Department of Surgery, David Geffen School of Medicine,
4 Molecular Biology Institute, UCLA

The vasculature is a highly complex, intricately structured tissue composed of numerous cell types interacting together to promote cellular homeostasis. Blood vessels can be divided into three major classifications based on cellular composition and structure: arterial, venous and capillary. However, given the complex organizational structure of the vasculature it is extremely difficult to assess transcriptional changes in specific vascular cell types in vivo. In this study, we utilized RNA-seq analysis to assess a novel experimental protocol to quickly isolate vascular cell-type enriched RNA in-vivo from mouse aorta. Using both FPKM-based (Cufflinks) and raw count-based (DESeq2) transcriptional analysis, we found that vessel flush technique produced high quality, enriched endothelial cell (EC) and vascular smooth muscle cell (vSMC) RNA. The transcriptional profiles of these enriched RNA samples were substantially different from that of the whole mouse aorta, with hundreds of genes significantly differentially expressed (p < 1.0 x 10-4) between the vascular cell type enriched samples and the whole aorta. Detailed analysis the transcriptomic profiles of EC enriched RNA showed on average a 4-fold increase in EC-specific gene expression while vSMC-enriched RNA showed a 2-fold increase in vSMC specific gene expression. Taken together, these results strongly this isolation approach could provide a significant increase power to detect transcriptional changes in specific vascular cell types of interest. Our findings strongly support the use of this novel method for rapidly isolating vascular cell type enriched RNA for assessing changes in gene expression in the vascular wall cells in normal and pathological conditions.

FRANK GUTIERREZ*1,2, OLAYEMI OLADAPO*1,2, Shawntel Okonkwo2, Calvin Leung2, Stephen Douglass2, Tracy Johnson2

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Department of Molecular, Cellular, and Developmental Biology, UCLA
* Co-authors

Gcn5, a major yeast histone acetyltransferase of the SAGA complex, plays a novel role in regulating pre-mRNA splicing. Surprisingly, Gcn5 HAT activity is required for co-transcriptional recruitment of two major U2 snRNP components onto the branchpoints of DBP2 and ECM33 pre-mRNA. Additionally, the regulated dynamics of histone acetylation/deacetylation is critical for proper spliceosome rearrangements onto the aforementioned genes. It is however unclear how Gcn5 HAT activity is affecting co-transcriptional splicing of intron-containing genes (ICGs) genome-wide. To determine the role of Gcn5 in co-transcriptional splicing, we analyzed RNA-seq data generated from wild type, gcn5Δ and H3KΔ9-16 samples in Saccharomyces cerevisiae cells, where H3KΔ9-16 represents a deletion of the major histone acetylation targets of Gcn5. The RNA-seq data also produces quantifiable changes in splicing efficiency for each intron-containing gene through a calculated ratio of spliced to total (spliced and unspliced) normalized counts. We hypothesize that relative to the wildtype condition, ICGs under the gcn5Δ and H3Δ9-16 mutant backgrounds will result in decreased expression. If Gcn5 HAT activity is required for co-transcriptional spliceosome assembly of two major U2 snRNP components, there will be a decrease in splicing efficiency genome wide. Since spliceosome assembly occurs in a stepwise manner and gcn5Δ as well as H3KΔ9-16 have been previously shown to affect co-transcriptional splicing, we predict these mutants will result in defective spliceosomes and poor splicing overall. Using a bioinformatic approach, results of these RNA-seq analyses will help clarify the relationship between a major yeast HAT and the spliceosome, and ultimately provide deeper insight into how chromatin can influence splicing in eukaryotes.

HEATHER HAN1,2, Tevfik Umut Dincer3,4, Jason Ernst3,4,5

1 BIG Summer Program, Institute of Quantitative and Computational Biosciences, University of California Los Angeles, CA 90095
2 Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218
3 Bioinformatics IDP, University of California Los Angeles, CA 90095
4 Department of Biological Chemistry, David Geffen School of Medicine, University of California Los Angeles, CA 90095
5 Computer Science Department, University of California Los Angeles, CA 90095

The human genome contains over a million candidate cis-regulatory elements in non-coding regions of the DNA, which are responsible for the activation and repression of specific genes. However, prediction of regulatory potential based on the identification of these regulatory elements remains a challenge. Recent developments in CRISPR-Cas9-based genome editing tools has allowed for high specificity targeting of DNA sequences of interest, furthering our understanding of the roles of various regulatory factors. To leverage this advancement, we sought to develop a novel method to predict regulatory potential genome-wide. We utilized CRISPR–Cas9-based epigenomic regulatory element screening (CERES) data which targeted DNase I hypersensitive sites (DHSs) from the recent Klann, et al. (2017) study, in conjunction with transcription factor binding sites and chromatin state annotations from ENCODE, to train our model. The supervised learning model was trained on a small subset of 281 DHSs near the HBE1 gene in K562 cells, and then used to predict regulatory potential across all 112,025 DHSs in K562 cells. We validated our model through cross-validation and comparisons with massively parallel reporter assays and other CRISPR-Cas9 datasets. We further utilized the method to determine correlation between sites predicted to have high regulatory functions and disease- associated SNPs. We believe that our model serves as a starting point for understanding cis- regulation using CRISPR-Cas9-based tools, and has the potential to improve as more datasets become publicly available.

RENEE HASERJIAN1,2, BRIEANA HOLLIS1,3, Wanlu Lui4, Zhenhui Zhong4, and Steve Jacobsen4

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Department of Biological Sciences, California State University, Los Angeles, CA 90032
3 College of Science and Technology, Florida Agricultural and Mechanical University, FL 32307
4 Molecular, Cellular and Developmental Biology, University of California, Los Angeles, CA 90024

DNA methylation is a critical epigenetic process that is involved in gene and transposon silencing in both plants and animals. In plants, DNA methylation is established by the RNA-directed DNA methylation (RdDM) pathway. RdDM involves the synthesis of small interfering RNAs (siRNAs) as well as a long non-coding RNAs (lncRNAs) via Polymerase V (Pol V), that recruit the DNA methyltransferase DRM2 to methylate DNA. Epigenome editing is used to precisely modify the epigenetic landscape, which can be applied to stably silence genes. Previously the Jacobsen lab has demonstrated that artificial Zinc Finger Proteins (ZFs) recognize the promoter of the FWA gene while fused to accessory proteins such as SUVH9 or DMS3, which then recruit Pol V. This fusion efficiently methylated targeted DNA and silencing of FWA. ChIP-seq analysis of the ZFDMS3 identified thousands of binding sites, mostly in promoter regions. De novo motif analyses suggest that the lack of binding specificity is due to subsets of the finger domains interacting with the genome. ChIP-seq analysis of Pol V recruitment in ZF-DMS3 plants reveal that most of the off targets are able to recruit Pol V. Although, the ZF-DMS3 display widespread binding and Pol V recruitment, we only observe limited de novo methylation from Whole Genome Bisulfite Sequencing (WGBS) data. Further analysis of the features of methylated vs. non-methylated off targets might shed light on the mechanism that establishes de novo methylation.

NORRIS C. KHOO 1,2,3, Tanya N. Phung 3, Christian D. Huber 4, Kirk E. Lohmueller 3,4,5

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
2 Department of Earth, Planetary, and Space Sciences,
3 Interdepartmental Program in Bioinformatics,
4 Department of Ecology and Evolutionary Biology,
5 Department of Human Genetics, David Geffen School of Medicine, UCLA

Natural selection, a key mechanism driving evolution, not only affects sites under direct selection (i.e. coding regions and functional sites). Diversity at neutral sites can also be reduced through linkage to selected sites. Two mechanisms of linked selection include selective sweeps (reduction in neutral diversity surrounding beneficial fixations) and background selection (reduction in neutral diversity when nearby deleterious variants are removed by selection). Although the impact of linked selection has been observed in a variety of species, the mechanism responsible is less understood. Here, we aim to determine the relative strength of each mechanism by using the dog-wolf system. This system is ideal for studying linked selection because following the dog-wolf split approximately 15,000 years ago, it is hypothesized that dogs experienced more selective sweeps than wolves due to intense artificial selection during domestication and breed formation. Thus, we hypothesized that dogs would show greater effects of linked selection than wolves. Using 13 dog and 6 wolf whole-genome sequences, we determined the strength of linked selection for each species by correlating neutral genetic diversity with the amount of functional content in regions of the genome and recombination rate. Regression analysis indicates that wolves experience a greater reduction in genetic diversity with increasing functional content and decreasing recombination rate. Contrary to our hypothesis, wolves experience more linked selection than dogs, arguing that background selection plays a greater role in shaping genetic diversity compared to selective sweeps.

KENSEI KISHIMOTO1, TONI BOLTZ1,2, Kim Ngo3, Alexander Hoffmann3

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA,
2 Department of Computer Science, University of Miami, Coral Gables, FL,
3 Signaling Systems Laboratory, Department of Microbiology, Immunology, and Molecular Genetics, and Institute for Quantitative and Computational Biosciences, UCLA

The Nuclear Factor kappaB (NF?B) family of transcription factors play a vital role in the regulation of inflammatory and immune responses, cell proliferation and development. Dysregulation of NFkB activity has been implicated in many types of cancer and inflammatory diseases. To characterize the role of NF?B in immune response gene expression, we have generated a knock-in mouse of a Transactivation Domain (TAD) mutant in RelA/p65, the predominant NF?B family member. RelA has two TADs in its C-terminal end and our RelA- TADmut knock-in mouse carries a deletion of TAD1, and two point mutations, L449 and F473A, in TAD2. In this study, using RNA-seq analysis, we compared differential gene expression between Mouse Embryonic Fibroblast (MEF) and Bone Marrow-Derived Macrophage (BMDM) cells from Wild-Type (WT) and RelA-TADmut knock-in mice stimulated with TNF and LPS at 0, 0.5, 1, 3, and 8 hours. Reads were aligned to the mouse mm10 genome and RefSeq genes,  and the data was analyzed for differential gene expression with the R package edgeR. We identified 189 genes to be induced more than four fold change in at least one of the time points in wild-type MEF cells due to TNF stimulation, and 840 genes in MEF cells with LPS stimulation. Similarly for BMDM cells, we identified 361 genes induced by TNF and 949 genes induced by LPS. Many of these genes are known NFkB targets. We analyzed those genes further using heatmap visualization and principal component analysis, in order to identify which genes are down-regulated in the RelA-TADmut as compared to WT. Overall, the mutant had a more severed effect in response to TNF than LPS, which is consistent with our understanding that NF?B is the primary transcription factor during TNF response, but only one of several during the LPS response. Our detailed results will be discussed during the poster presentation.

REGINA LEE1, Jae Hoon Sul2

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
2 Department of Psychiatry and Biobehavioral Sciences at UCLA

Structural variations (SV) are genetic variations that involve changes in the structure of one’s chromosome. The most common types of SVs include deletions,   duplications, insertions, and inversions, which may potentially affect traits of the individual. In this project, we aim to identify novel SVs using the LUMPY software (Layer et al., 2014) and improve the quality of the previous SV calls. LUMPY utilizes multiple SV signals and their positions across samples to enable more sensitive SV discovery compared to other actively maintained SV discovery packages. Using genome sequencing dataset of large families with bipolar disorder (454 individuals), we filtered out SVs with high missing rate and monomorphic SVs from the merged VCF file. We then checked for Mendelian errors to measure the accuracy of SV calls. We also re-ran the LUMPY pipeline on the cleaned data and compared it to the previous analysis. A future plan is to improve the LUMPY pipeline to improve data quality.

RUNJIA LI1,2, ZANCHEN LI1,3, Zijun Zhang4, Zhicheng Pan4, Yi Xing4

1 BIG summer Program, Institute for Quantitative and Computational Biosciences,
2 Department of Computer Science,
3 Department of Chemistry and Biochemistry,
4 Department of Bioinformatics, UCLA

N6-Methyladenosine (m6A) is a widespread base modification in eukaryotic mRNA and plays a key role in translation regulation. Understanding m6A’s potential regulatory role in gene expression and alternative splicing require knowledge of its topology in the transcriptome. Currently, next-generation, massively parallel Methylated RNA Immunoprecipitation Sequencing (MeRIP-seq) technology has produced abundant sequence read data. A suitable peak detection (calling) algorithm is the crucial requisite for transcriptome-wide profiling of m6A sites from these data, as it must consider the statistical bias and background noise inherent to the data. Here,  we designed a set of metrics to evaluate m6A detection performance of three published peak callers: MACS2, MeTPeak and RIPSeeker. m6A MeRIP-seq and control reads for human GM12878 cell line are aligned to hg19 reference genome, and peaks are generated from the alignment using the three peak callers repsectively. The peaks called are evaluated by comparing the number of peaks called, overlap and intersections among peaks, enrichment at different transcript regions and motif search results. From the analysis we found that MACS2  out performs MeTPeaks for specificity and RIPSeeker for total peaks called. This result provides valuable datas for the training of further peak calling algorithms on nano pore sequencing.

SANG JI LYU1, Jae Hoon Sul2

1 Department of Computational and Systems Biology,
2 Department of Psychiatry and Biobehavioral Sciences, UCLA

Insertions and deletions (INDELs) is known to be linked to many diseases. Frameshift during translation of mRNA (single base pair change, by INDEL, in the coding part of mRNA) results in a premature stop codon in a different frame. Our main aim was to acquaint ourselves with a new software package called Scalpel. Scalpel is a genetic variants discovery tool, software package, for detecting INDELs. More specifically Scalpel was build to execute localized microassembly of desired specific regions. Mainly, we wanted to see if Scalpel was able to perform with “high accuracy and increased power”(Narzisi, 2013) in detecting INDELs mutation. Our focus was on “whole genome versus whole exome studies” (Narzisi, 2013). Scalpel is predominantly tested for exome capture data; however, Scalpel can also detect mutations in whole genome data. For fast processing, by lowering memory requirement, we ran each chromosome separately. Future plans is to further obtain information about Scalpel and use it in de novo mode and somatic mode. Scalpel is said to be best for detecting de novo indels in a quad family, and somatic indels for the sequencing data from match tumors and normal samples.

NICK MILLER1,2, GUILLERMO SANCHEZ1,3, Diane Lefaudeux4, Alexander Hoffmann4

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Department of Biological and Environmental Engineering, Cornell University, Ithaca, NY 14850
3 Department of Life and Physical Sciences, Fisk University, Nashville, TN 37208
4 Department of Microbiology, Immunology, & Molecular Genetics, UCLA, Los Angeles, CA 90095

Gene expression dynamics are regulated by mRNA transcription and degradation. Together they determine how long mRNA transcripts of a given gene can actively be translated, producing the proteins that determine biological function. While many studies have elucidated mechanisms of transcriptional control, far fewer studies have addressed mRNA degradation, in part because there is no accepted method for measuring mRNA half-lifes. Here, we analyzed RNAseq datasets produced following a block of transcriptional elongation with Actinomycin D. In this study, bone marrow- derived macrophages (BMDM) were cultured and either tolerized by LipidA (LPA) pre- stimulation or kept naive. The cells were then stimulated with LPA (time 0) to induce an immune response. Transcription was then blocked by adding Actinomycin D (ActD) at 0, 60 and 180 minutes after LPA treatment. RNA was sequenced for all conditions and at 7 different timepoints after ActD. Reads counts were normalized using spike-in RNAs (external controls) to account for diminishing mRNA amount over time. We generated counts for each mRNA and assessed the quality of the RNAseq datasets. A linear model was used to fit a decay rate of each gene and the corresponding half-life was derived for a given experiment. Half-lives of 235 LPA-induced genes were calculated. For example, Tnf, Nfkbia and Mmp13 yielded a mRNA half-life at basal of 20, 17.8, 86.3 minutes respectively. We also derived confidence intervals and assessed the statistical significance of changes in the derived halflifes in different conditions.

GARRETT PARKER2, Serghei Mangul1,2, Sarah Van Driesche3, Lana S. Martin1, Kelsey C. Martin3,4,5, Eleazar Eskin1,6

 1 Department of Computer Science, UCLA
2 Institute for Quantitative and Computational Biosciences, UCLA
3 Department of Biological Chemistry, UCLA
4 David Geffen School of Medicine, UCLA
5 Department of Psychiatry and Biobehavioral Sciences, UCLA
6 Department of Human Genetics, UCLA

Every sequencing library contains duplicate reads. While many duplicates arise during polymerase chain reaction (PCR), some duplicates derive from multiple identical fragments of mRNA present in the original lysate (termed “biological duplicates”). Unique Molecular Identifiers (UMIs) are random oligonucleotide sequences that allow differentiation between technical and biological duplicates. Here we introduce a computational method, UMI-Reducer, that processes mapped sequencing reads to differentiate PCR duplicates from biological duplicates. UMI-Reducer uses UMIs and the mapping position of reads to identify and collapse technical duplicates. Remaining true biological reads are further used for bias-free estimate of mRNA abundance in the original lysate. This strategy is of particular use for libraries made from low amounts of starting material, which typically require additional cycles of PCR and therefore are most prone to PCR duplicate bias. UMI-Reducer provides an additional, novel functionality for processing reads that are assigned to more than one locus (multi-mapped reads). This method stochastically assigns multi-mapped reads based on transcript abundance. The result of UMI-Reducer is a less biased sequencing output. The UMI-Reducer is an open source Python software and is freely available for non-commercial use (GPL-3.0) at https://sergheimangul.wordpress.com/umi-reducer/.

JERRY TRINH1,2,3, JELIAH JONES1,4, Baochen Shi2, Huiying Li2

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA,
2 Department of Molecular & Medical Pharmacology, David Geffen School of Medicine, Crump Institute for Molecular Imaging, UCLA,
3 Department of Chemistry and Biochemistry, UCLA,
4 Department of Biological Science, Florida A&M University

Propionibacterium acnes is a major commensal of the skin microbiome and a key contributor to skin health. However, P. acnes strains from different lineages have been linked to various diseases, suggesting that strain-level differences may be essential in disease pathogenesis. Therefore, the ability to detect strain differences in the microbiome data will be crucial to the investigations of the skin microbiome and its role in diseases. Our objective for this study is to determine whether Pathoscope 2.0, a metagenomic data analysis tool, is able to distinguish strains of P. acnes from different phylogenetic lineages in metagenomic shotgun sequencing data. We generated simulated P. acnes population datasets by sampling the sequence reads of multiple P. acnes strains and combining them at certain ratios. Pathoscope 2.0 was able to distinguish strains from distant lineages, but performed poorly when the strains were phylogenetically closer. We then tested Pathoscope on 26 clinical datasets. The relative abundances of 5 ribotype groups (RT1, RT2/6, RT3, RT4/5, RT8) in each sample correlated with 16S ribosomal RNA sequencing data with a mean Pearson’s correlation coefficient of 0.712 + 0.339 but overestimated the relative abundances of RT4/5 and RT8. Our simulated microbiome data demonstrated that Pathoscope 2.0 was effective in identification of P. acnes strains from distant phylogenetic groups but could not distinguish those with high phylogenetic similarity. Furthermore, Pathoscope was able to estimate the ribotype abundance based on the data obtained from clinical samples.

TIFFANY TU1, Yidan Sun2, and Jessica (Jingyi) Li2

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
2 Department of Statistics, UCLA

Homology is an important area for gene expression studies, as it confirms similarities that exists in both biological structure and genetic function between two species. Often, homologs differ in mutations that occurred in a common ancestor, and such information can be crucial for biomedical research in drug discovery and development. This study focuses on developing a new algorithm for gene expression studies using tight spectral clustering with bipartite node covariates. Given gene expression (FPKM) data and a homolog bipartite network between two species, say human and mouse, we hope to identify tight gene clusters. Understanding tight gene clusters can help infer unknown genetic functions between the two species. That is, given a pre- specified positive integer k, our goal is to identify k “tight” cluster of nodes simultaneously on both sides of the bipartite network, producing tight and stable clusters without forcing all points into clusters. Edge and covariate information is implemented into the tight clustering on bipartite network algorithm. RNA-seq gene quantification data for human and mouse tissues such as liver, heart, and stomach are extracted from the Encyclopedia of DNA Elements (ENCODE) and aggregated to fit our model. A dataset with human and mouse corresponding homolog gene ID is applied as our bipartite node covariates.

ELIZABETH VANDERWALL1,2, Jennifer M. Lang3, Jenny C. Link3, Elizabeth J. Tarling3, Thomas A. Vallim3,4

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
2 Department of Microbiology, Immunology & Molecular Genetics,
3 Division of Cardiology, David Geffen School of School of Medicine, UCLA

The gene ZFP36L1, which we recently discovered as a novel regulator of bile acid synthesis and lipid metabolism, is associated with reduced adiposity, steatosis, and lipid adsorption and increased bile acid levels. In addition, alternations in bile acid levels are known to both affect and be affected by the microbiome. We used liver-specific ZFP36L1 knockouts to investigate the effect of genotype on the microbiome and we expected there to be microbiome differences caused by genotype. But because    mice ingest each other’s microbiome, cohousing genotypically different mice that harbor unique communities should lead to microbiome equilibration. This would then allow us to determine whether genotype-mediated phenotypic traits can be transferred through the microbiome. We hypothesized that cohousing wild-type and ZFP36L1 knockout mice may result in broader metabolic outcomes, at least in part due to equalizing microbiomes. Three groups based on genotype and cohousing condition    (genotype groups) were studied: ZFP36L1 liver-specific knockouts, wild-type mice cohoused with knockouts, and wild-type non-cohoused mice. DNA from cecum contents of these mice was extracted, and the 16S rRNA gene was sequenced using Illumina sequencing to describe the microbiome community. Data was then analyzed in Quantitative Insights Into Microbial Ecology (QIIME) and phyloseq. Principal coordinate analysis ordination (PCoA) plots show that fat mass, body weight, and genotype were significantly associated with their bacterial composition. These observations were especially apparent when comparing the ZFP36L1 knockout, wild-type non-cohoused, and wild-type cohoused individuals. Further analysis with DESeq2 demonstrated that bacterial taxa levels were not significantly different between the knockout and cohoused wild-type mice. However, wild-type non- cohoused mice had significantly different levels of bacterial taxa between both the knockouts and their cohoused wild-type counterparts. Since the only difference between the two genetically identical wild- type groups was co-housing or not, we conclude that the loss of ZFP36L1 causes a shift in microbiome, which the co-housed wild-type mice can also acquire. We will now determine whether the differences in the two wild-type microbiomes also correlates with other metabolic parameters, such as differences in fat mass, body weight, and bile acid levels. Together, we hope to determine what pathways ZFP36L1 controls that are then associated with changes in the microbiome. Analysis of this data will dictate how future mouse studies comparing loss of ZFP36L1 will be conducted and will enhance our understanding of how ZFP36L1 affects metabolism and the microbiome.

MAYRA VARILLAS1, TIFFANY TU1, Yiling Chen2, Jessica Jingyi Li2

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
2 Department of Statistics, UCLA

Classification methods are used to predict labels from a set of attributes, and are widely applied on biomedical data. In particular, binary classification restricts the situation to two cases; coded as a 0 or a 1.Traditional classification method aims to minimize the false negative rate while keeping the false positive rate under a threshold. When analyzing clinical data, the consequences of false positive and false negatives are highly asymmetrical. To approach this issue, we apply Neyman Pearson (NP) classification on DNA methylation profiles of healthy and diseased patients for analysis The main objective is to develop a variable selection algorithm to select a subset of variables. A statistical model is developed so that disease diagnosis misclassification can be under 0.1%. By utilizing variable selection, biomedical research is benefited since computational resources are maximized, therefore creating less problems for data analysis. NP classification allows high probability of controlled false positive rate (or false negative rate) under pre-specified threshold. Variable selection is a crucial preprocessing step in data analysis used for predictive and inferential purposes. Data preprocessing includes finding an appropriate dataset to fit our framework. Our contribution was to search data for method development. After this the data can be applied to the algorithm. This is done only after data cleaning using R.

HANNAH WADDEL1,2, MISHA MUBASHER KHAN1,2, Brian Nadel2, Matteo Pellegrini2

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
2 Department of Molecular, Cell and Developmental Biology, UCLA

Cell type deconvolution has been used to determine relative fractions of diverse cell type subsets in heterogeneous tissue samples. To utilize existing deconvolution tools, a reference data set is needed. We wanted to test whether single cell RNA-SEQ data could be used as a reference set for cell type deconvolution tools. We asked how few cells were needed to generate an accurate reference profile. We employed publicly available single cell RNA-SEQ datasets from 10X Genomics to create a reference data set. Accuracy was tested by comparing predicted and actual cell type fractions in samples where the cell type abundances are known a priori. We found that the accuracy of prediction increased with number of cells and reads per cell used to create the reference cell type profiles. We conclude that 1000 cells can accurately model cell type profiles, and that additional cells and reads generates only incremental improvements. We conclude that single cell RNA-SEQ generates reliable cell type profiles for deconvolution of bulk samples.

ADAM WEINER1, Jenny C. Link3, Elizabeth Tarling3, Thomas de Aguiar Vallim3

1 Department of Bioengineering, Henry Samueli School of Engineering,
2 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
3 Department of Medicine, Division of Cardiology, David Geffen School of Medicine, University of California, Los Angeles

Previous studies have identified small Maf (sMaf) transcription factors as regulators of bile acid genes. The three members of the sMaf family, MafG, MafF, and MafK, form heterodimers and homodimers with themselves and other transcription factors such as Bach and NRF proteins; these complexes can either activate or repress transcription. The goal of our research was to find binding motifs of sMafs, NRFs and Bach proteins in order to draw conclusions about their dimerization behavior and thus regulatory function. We used publically available chromatin immunoprecipitation sequencing (ChIP-seq) data along with a pipeline of modeling, motif enrichment, and genomic regions enrichment tools including MACS, HOMER, and GREAT to discover the binding motifs, regulated genes, and distance from the transcription start site (TSS) of different sMaf, Bach, and NRF response elements in a human hepatoma cell line (HepG2). Our results indicate that there are strong binding site similarities between MafF and MafK and between Bach1 and NRF2, suggesting that these two sets of proteins were likely the most common dimers in the cellular context we examined. We will now investigate how the binding of the transcription factors to motifs regulates gene expression, focusing on how the regulatory functions of each transcription factor can change upon dimerization. Understanding the regulation and preferential binding sites will allow us to mechanistically determine the role of sMaf, Bach and NRF proteins in regulating specific genes, particularly those of lipid and bile acid metabolism in future studies.