2015 Bruins-In-Genomics Summer Undergraduate Research Program

2015 B.I.G. Summer Participants

Lab PIsMentorsStudents
UPTAL BANERJEEYan Zhao
ARNIE BERKAndrew Takeda
DAVID CASEROForough Taghavifar
Yasamine Modarresi
HILARY COLLERDaniel Taylor
JASON ERNSTArtur JaroszewiczEarle Aguilar
Hana Wasserman
GUOPING FANKatherine Sheu
ROBERT L. GOLDBERGDominic Saadi
ANN M. HIRSCHAllan W. ChongLeah Briscoe
ALEXANDER HOFFMANNGajendra SuryavanshiFaraz Behzadi
Andre Leon
WILLIAM HSUEllen Peterson
Robert Seniors III
Ucheze Ononuju
TRACY JOHNSONSrivats Venkataramanan & Stephen DouglassAnoop Galivanche
JAMES LLOYD-SMITHMichael G. Buhnerkempe & Katherine PragerMelissa Barcelona
WILLIAM LOWRYRoberto SpreaficoYing Lin
BOGDAN PASANIUCNicholas MancusoChristopher Kuo
Gleb Kichaev & Megan RoytmanMegan McGhie
YI XINGShihao ShenChristian St. Pierre
XIA YANGChristine Sun

2015 B.I.G. Summer Poster Abstracts

EARLE B. AGUILAR, Artur Jaroszewicz, Jason Ernst

1 Bruins in Genomics, UCLA, Department of Computer Science, UCLA

Hi-C is a technique that examines the 3D architecture of the human genome. Distal interactions can bring functional elements into spatial proximity with each other, allowing for distal gene regulation. In this study we examine these intrachromosomal interactions, principally in H1 cells, and look for correlations between histone marks and exon inclusion levels. Hi-C data was made doubly stochastic by dividing the rows and columns by a normalization vector, and then normalized by distance between loci. This matrix, known as the observed over expected (O/E) matrix, was used to find pairs of loci with significant contact. Due to time constraints, the 95th percentile of the O/E matrix was used as a proxy for significance. These significant pairs of loci are presumably important for genomic regulation, including exon splicing. Within these pairs of loci, we look for correlations between histone modifications and exon inclusion levels by analyzing corresponding ChIP-seq and RNA-seq data. We examine only exons with unambiguous local splicing patterns that are also non-constitutive to increase robustness of the results, which we present currently.

MELISSA A. BARCELONA, Michael G. Buhnerkempe, Katherine Prager, James O. Lloyd-Smith

1 Department of Ecology and Biology, UCLA,
2 Institute for Quantitative and Computational Biosciences, UCLA

Leptospirosis is an understudied zoonotic disease caused by pathogenic spirochetes in the genus Leptospira that affects mammals worldwide. The first documented leptospirosis outbreak in California sea lions (Zalophus californianus) occurred in 1970. No further outbreaks of leptospirosis in Z. californianus were documented again until 1975, however, yearly, seasonal outbreaks have been occurring since at least 1984, with large outbreaks occurring about every four years. It is currently unknown whether the apparent lack of leptospirosis in Z. californianus before and immediately after 1970 is due to a lack of surveillance or due to initial emergence in 1970. We addressed this by analyzing the age-distribution of leptospirosis cases in Z. californianus across years to determine whether initial emergence occurred in 1970 and also to distinguish an age distribution that would make the population susceptible to large outbreaks. We found that the individuals infected during the 1970 outbreak consisted of a higher proportion of adults than that found in during other outbreaks. Additionally, we found that large outbreak years from 1984 to present had a higher proportion of juveniles and lower proportion of adults than smaller outbreak years.  This suggests that 1970 may have been the first actual outbreak of leptospirosis in Z. californianus and that a build-up of susceptible juveniles is necessary to generate large outbreaks in the system.

FARAZ BEHZADI*, ANDRE N. LEON*, Gajendra Suryavanshi, Roberto Spreafico, Alexander Hoffmann

1 Signaling Systems Laboratory, Institute for Quantitative and Computational Biosciences, UCLA

As the cost of next-generation sequencing continues to decline, experiments addressing gene expression control will be increasingly genome-wide and detailed. The predicted growth of the field necessitates the study of best practices for bioinformatics pipelines, from pre-processing to downstream analysis of RNA-sequencing data. Here we examined best practices for the analysis of inducible gene expression programs triggered by immune and inflammatory stimuli. Using R, R packages, Bioconductor packages, and a variety of data sets, both replicate and non-replicate, the effects of numerous factors on traditional methods of downstream analysis were examined. Some of the factors examined include pseudocount addition, thresholds for data significance, and clustering methods. The possibility of improving analysis by utilizing cubic splines (piecewise numeric functions of third-order polynomials) in combination with EdgeR (a traditional R package for gene expression studies) was also explored. The work highlighted the importance of setting reasonable thresholds for significance, the risk of low reliability in genes expressed at less than 5 counts per million, and the value of comparing clustering methods. In the future, we seek to build upon our work on cubic splines in gene expression experiments and to further examine the fundamental factors involved in downstream analysis of RNA-sequencing data.

LEAH P. BRISCOE1, Allan W. Chong1, and Ann M. Hirsch1,2

1 Department of Molecular, Cellular Cell, and Developmental Biology, University of California, Los Angeles, CA 90095
2 Molecular Biology Institute, University of California, Los Angeles, CA 90095

Nitrogen-fixing bacteria that nodulate legumes produce a plethora of molecules that enable complex communication with their potential hosts. These root-dwelling bacteria, termed rhizobia, produce exopolysaccharides as well as Nod factors, the product of nodulation genes which are essential for effective nodulation and nitrogen fixation on legumes. Exopolysaccharides are well studied in Ensifer meliloti, an alphaproteobacteria strain that nodulates legumes. Burkholderia spp. is a betaproteobacteria strain that also nodulates legumes. The comparison of the nod genes in Burkholderia tuberum to other rhizobia suggests that this species obtained some of its symbiotic genes through horizontal gene transfer from alphaproteobacteria. However, much less is known about the exo genes of B. tuberum. In this study, we compare the B. tuberum genome to that of other ß-rhizobia and α-rhizobia to see if B. tuberum’s EPS pathway is the same or different from these other species. If different, this may suggest they are involved in host range just as recent evidence shows that EPS is implicated in host specificity in α-rhizobia (Kawaharada et. al.). Using the Integrated Microbial Genomes database from the Joint Genome Institute, we obtained the protein sequences for a number of genes involved in exopolysaccharide synthesis in 19 α- and ß- rhizobial species. The putative exo, exs, and exp genes in these species were compared using sequence alignment and assessment of functional domains. We found that the EPS genes in B. tuberum have a higher-than-expected similarity to genes in alphaproteobacteria, whether B. tuberum obtained these genes through horizontal gene transfer is the focus of future study.

ANOOP R. GALIVANCHE1, Srivats Venkataramanan2, Stephen Douglass2, and Tracy Johnson2

1 BIG Summer, UCLA
2 Department of Molecular, Cell, and Developmental Biology, UCLA

ABSTRACT: Abundance of chromatin remodeler Snf2 controls alternative splicing of stress response gene PTC7 Pre-mRNA splicing, the removal of noncoding intron sequences from pre-mRNAs, is a fundamental gene expression reaction. An important feature of splicing is that it occurs co-transcriptionally, while RNA is being synthesized from a chromatin template. Understanding the mechanism of co-transcriptional splicing and the role of chromatin and chromatin modifications has been a crucial, but poorly understood, challenge in molecular biology. Chromatin remodelers alter DNA availability to transcription and other gene expression machinery. The chromatin remodeler Snf2 has been implicated in RNA splicing by an as yet unknown mechanism. Our lab has recently shown a role for Saccharomyces cerevisiae Snf2 in regulating the expression of ribosomal protein genes (RPGs), an abundant class of intron-containing genes. Our analysis of previously published ChIP-seq data revealed an unexpected enrichment of Snf2 at the promoters of RPGs. These observations have led to a model whereby Snf2 drives RPG expression such that deletion of Snf2 causes a reduction in the abundance of RPG transcripts, which frees spliceosomes to associate with substrates with non-consensus splice signals. Splicing of PTC7, a stress response gene with non-consensus 5’ and 3’ splice sites, increases markedly when SNF2 is deleted. Previously published studies show that retention of the PTC7 intron also produces a translated protein. We hypothesize that Snf2 levels change to control the relative expression of the two functionally distinct Ptc7 protein isoforms through the spliceosome reallocation effect. Moreover, our bioinformatics analysis provides evidence suggesting Snf2 could also affect splicing through direct association with the PTC7 gene bo

CHRIS KUO, Nicholas Mancuso, Bogdan Pasaniuc

1 Departments of Biochemistry and Bioinformatics, UCLA

Estimating the variation in trait attributable to genetics (i.e. heritability) is a fundamental component in the study of complex traits. Initial methods for heritability estimation rely on the use of pedigree data while recent approaches use genotype data, typically assayed through genotyping arrays that capture the common variation across the genome. While heritability studies are typically performed using array data, these chips fail to probe all genetic variants for a sample. Alternatively, modern sequencing platforms can be employed to better identify genetic variation, but require a greater cost. Low-coverage sequencing facilitates sequencing many samples while decreasing the associated cost [1]. The challenges of estimating heritability directly from low-coverage sequencing data have yet to be thoroughly investigated. Here, we investigate recovering the underlying heritability of complex traits from low-coverage sequencing data and discuss approaches to minimize the negative impacts of sparse data on heritability approximations. Using simulated and real genotype data, we simulated reads and estimated genetic relationship matrices to model how various coverage levels bias heritability estimates. With simulated genotype data, we observed nearly complete overlap of the true heritability at 5x coverage. The heritability estimations from real genotype data were largely inflated. The cause of this inflation is currently being investigated. By understanding the effects of low-coverage sequencing on heritability estimations, we can better capture underlying genetic variations while simultaneously keeping costs low.

  1. Converge Consoritum (2015). “Sparse whole-genome sequencing identifies two loci for major depressive disorder.” Nature 523(7562): 588-591.

YING LIN1, Roberto Spreafico2, Yuan Xie1, Benjamin Garcia3, and William Lowry1

1 Molecular, Cell , and Developmental Biology, University of California, Los Angeles
2 Immunology, Molecular Biology, Genomics., University of California, Los Angeles
3 Department of Biochemistry and Biophysics., University of Pennsylvania

Early human development occurs in low oxygen tension and previous studies showed that changes of oxygen tension functionally impacts human neural development. Human neural progenitor cells (NPCs) derived from either embryonic stem cells or induced pluripotent stem cells in vitro shows immature characteristics as demonstrated by their propensity to make many more neurons than glia. By culturing NPCs in low oxygen or treatment with hypoxia mimetic compounds, we found that progenitors became more gliogenic. Interestingly, even a temporary exposure to low oxygen was sufficient to promote generation of glial cells. Therefore, we hypothesized that low oxygen tension induces epigenetic changes that can permanently influence cell fate in NPCs. Taking advantage of mass spectrometry, we found that H3K27Ac and H3K9me3 were dramatically altered globally in response to lowered oxygen tension. To understand where in the genome these changes occurred, we conducted a pulse-chase experiment in NPCs exposed to low oxygen or hypoxia mimetic and performed Chromatin Immunoprecipitation-sequencing analysis. Better knowledge of how oxygen tension affects the epigenome could help us develop methods to generate more mature human pluripotent stem derivatives cultured in vitro.

MEGAN MCGHIE, Gleb Kichaev, Megan Roytman, and Bogdan Pasaniuc

1 Institute for Quantitative and Computational Bisociences,
2 Departments of Biostatistics and Human Genetics, UCLA

Although genome-wide association studies have found thousands of risk variants, the causal mechanism underlying risk loci is generally unknown. Recent works have shown that functional annotation and association data can be integrated to improve accuracy when choosing likely causal variants for further validation. Combined Annotation Dependent Depletion (CADD) scores are a summary annotation that incorporates multiple functional metrics (e.g. conservation) into one score for each variant in the genome. We investigated whether integrating CADD statistical scores with association data improved accuracy at finding potential casual variants for aberrant phenotypes. We analyzed data from a large-scale Rheumatoid Arthritis study using various functional annotations and with the CADD scores to answer this question. We found that CADD scores were in the top ten functional annotations enriched with causal variants across 480 tested annotations, with an enrichment of 1.8 (p-value = 0.00315) for CADD. When run in a model including the top five existing functional annotations—from groups including skin keratinocytes, T helper 2 cells, B lymphocytes, immune enhancer regions, and exon regions—the CADD score enrichment remained significant at 1.6 (p-value = 0.047). In addition to Rheumatoid Arthritis, we assessed the performance of CADD scores as an annotation for lipids data, including HDL, LDL, triglycerides, and total cholesterol. We found similar improvements in prediction accuracy, demonstrating that CADD scores can be applied to various traits in order to better detect plausible causal variants.

YASAMINE MODARRESI*1,2, FOROUGH TAGHAVIFAR*1,2, David Casero 2, 3

* Contributed equally
1 Dept. of Biology, CSUN
2 UCLA B.I.G. Program
3 Dept. of Pathology and Laboratory Medicine, UCLA

Transcriptome sequencing has been successfully employed to reveal differential gene expression and splicing in human cancer. Here, we aim to identify differentially expressed and differentially spliced genes in three common subtypes of non-Hodgkin lymphomas; Diffuse large B-cell lymphoma (DLBCL), Burkitt lymphoma (BL) and Follicular lymphoma (FL). We analyzed RNA sequencing data from a panel of 30 patients and performed differential expression and differential splicing analysis using Cuffdiff. We found 3637 genes that were differentially expressed in at least one of the subtypes, from which 829, 414 and 26 were specific to FL, BL and DLBCL respectively. We then identified genes that were differentially spliced among these subtypes and created a list of splicing signatures for each tumor type. These expression signatures were enriched in functions and pathways relevant to cancer and lymphomagenesis. Our results can be used to categorize different subtypes of lymphoma according to their gene expression profile, and provide, for the first time, a differential splicing signature that could inform future functional studies.

DOMINIC E. SAADI, Xiaomeng Wu, Min Chen, Jungim Hur, Robert B. Goldberg

1 Department of Molecular, Cell, Developmental Biology, UCLA

Embryogenesis is a process of division and differentiation by which the embryo forms from the zygote. In plants, embryogenesis begins with a double fertilization event: one sperm cell merges with the egg cell, forming the zygote, while the other fuses with the central cell, forming the endosperm, a structure which provides nutrients for the developing embryo and germinating seedling. The zygote divides to gives rise to two structures: the embryo proper, or the next generation plant, and the suspensor, a structure which nurses the embryo but degenerates as the embryo matures. The embryo proper and suspensor differ widely in morphology between species, which begs the question: what is the function of the suspensor in plant embryos and how are different morphologies indicative of different functions? To investigate this question, we used laser capture microdissection (LCM) to isolate the globular stage embryo proper and suspensor of two closely related legume species which possess giant suspensors, the Scarlet Runner Bean (SRB) and Common Bean (CB) in order to profile the active genes within the tissues. After isolating the RNA, we performed RNA-seq on an Illumina platform and used edgeR to detect differentially expressed genes. We found the suspensor of the SRB and the CB to be enriched with transport-related and biosynthesis genes, suggesting the role of the suspensor as a hub for plant hormone synthesis and transfer of nutrients and growth factors to the embryo proper.

CHRISTIAN ST. PIERRE, Shihao Shen, Yi Xing

1 BIG Summer
2 Microbiology, Immunology, & Molecular Genetics
3 Bioinformatics Interdepartmental Ph.D. Program

Understanding the growth of brain cancer, and predicting the individual outcomes poses a challenge to radiologists. With recent developments in high-­‐throughput sequencing technologies, large databases have been created consisting of both genotypic and phenotypic data. The challenge is to integrate this data and extract useful information. Our study aims to use existing databases to determine significant genes correlated with characteristics of Glioblastoma Multiforme (GBM). MRI imaging features were used along with micro-­‐array expression data, both available through The Cancer Genome Atlas (TCGA). A preliminary correlation analysis uncovered 3101 significant genes (p-­‐value < 0.05). [/av_toggle] [av_toggle title='SHEU: Examining the Role of DNA Methylation in Naive Pluripotent Stem Cells' tags='' av_uid='av-6sfrrm'] KATHERINE SHEU1,2, and Guoping Fan1

1 Department of Human Genetics, UCLA,
2 Institute for Quantitative and Computational Bisociences, Department of Molecular, Cell, and Developmental Biology, UCLA

The conventional serum-based culture media for mouse embryonic stem cells gives rise to a cell population that is considered metastable and exhibits heterogeneous properties reminiscent of both the naive and primed pluripotent states. These two states of pluripotency can be represented in vivo by cells of the mouse inner cell mass and mouse post-implantation epiblast, respectively. Adding two inhibitors (2i) to the culture media can stabilize the cells in the naive ground state through inhibition of the MEK/ERK and GSK3-beta pathways. Epigenetic differences between naive and metastable pluripotency include global hypomethylation in the naive state, as well as downregulation of de novo DNA methyltransferases (Dnmt3a/3b). The proposed mechanism underlying the establishment of hypomethylation in the naive state is transcriptional repression of de novo Dnmt expression, but the interdependence of the stages of pluripotency on methylation level is unclear. Whether the loss of de novo DNA methyltransferases and subsequent hypomethylation is a requirement for production of naive pluripotency is examined using knock-out and Dnmt3a reconstitution cell lines. Understanding the dynamical control between these stem cell states will be important for studying the in vitro behavior of embryonic stem cell cultures and the epigenetic regulation of early development.

CHRISTINE SUN, Brandon Tsai, Qingying Meng, Le Shu, Bassem Shoucri, Raquel Garcia, Bruce Blumburg, Xia Yang

1 Institute for Quantitative and Computational Biosciences,
2 Department of Integrative Biology and Physiology, UCLA

Much evidence indicates that early exposure to drugs and chemicals, particularly obesogenic endocrine-disrupting chemicals (EDCs), is linked to adipogenesis and obesity. Among EDCs, tributyltin (TBT) has been shown to increase adipocyte sizes, reprogram mesenchymal stem cells to favor an adipogenic lineage, and induce hepatic steatosis in future generations. Therefore, it is important to understand the molecular mechanisms of TBT and its transgenerational effects on metabolic disorders. We hypothesize that prenatal TBT exposure reprograms fetal development and increases susceptibility to metabolic disorders in adulthood as well as in future generations by perturbing tissue-specific metabolic pathways and gene networks. We utilized RNA sequencing analysis to measure the transcriptional activities in liver tissues from the F1 and F3 generation mouse offspring, and identified 706 (F1: p<0.01) and 1525 (F3: p<0.05) differentially expressed genes at gene, isoform, and splicing levels. Using these gene sets we conducted gene ontology enrichment analysis and revealed relevant biological pathways perturbed by TBT, such as PPAR-alpha, adipocytokine signaling, developmental biology, and RAR/RXR signaling. We also associated the TBT-perturbed genes to metabolic diseases in humans using genome-wide association studies. Using key driver analysis, we identified 90 and 23 potential key regulators of the TBT-affected genes in the F1 and F3 generations, respectively. This systems biology study uncovered important molecular mechanisms behind TBT’s transgenerational effects on metabolic disorders. The mechanistic insights obtained will help pinpoint key perturbation points for future experimental validation and therapeutic target identification to counteract environmentally-induced obesity. [/av_toggle] [av_toggle title='TAYLOR: RNA sequencing of quiescent and proliferating fibroblasts reveals differentially expressed genes and splicing isoforms' tags='' av_uid='av-3zmvc2'] DANIEL G. TAYLOR1, Mithun Mitra1,2, Elizabeth L. Johnson3, David C. Corney1,2,3, and Hilary A. Coller1,2

1 Department of Molecular, Cell and Developmental Biology, University of California, Los Angeles, Los Angeles, California, United States of America
2 Department of Biological Chemistry, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, California, United States of America
3 Department of Molecular Biology, Princeton University, Princeton, New Jersey, United States of America

Cell quiescence is a reversible resting phase in the cell cycle and is important biologically, such as in stem cell maintenance, wound healing, and cancer dormancy. While it has previously been shown that several genes are differentially expressed in quiescent cells as compared to proliferating ones, these data are limited and fail to include detailed information regarding differential RNA processing events. Thus, we compared the differential gene and isoform expression between quiescent human dermal fibroblasts and proliferating ones through RNA sequencing. Analysis of the data revealed thousands of differentially regulated genes, such as the upregulation of extracellular matrix genes and downregulation of cell cycle genes in quiescent cells. Further, we discovered alternative splicing events through the use of the splicing software replicate Multivariate Analysis of Transcript Splicing (rMATS), and these genes were enriched for RNA binding proteins. Ultimately, these data suggest a multitude of genes and RNA processing events that are differentially regulated between quiescent and proliferating cell states, providing a framework for further analysis of the individual functions of these genes.

HANA WASSERMAN1,2,3, Artur Jaroszewicz4, and Jason Ernst4,5,6

1 BIG Summer Program, University of California Los Angeles, CA., USA
2 Dept. of Molecular Biology, Colorado College, CO., USA
3 Dept. of Computer Science, Colorado College, CO., USA
4 Bioinformatics Interdepartmental Program, University of California Los Angeles, CA., USA
5 Dept. of Biological Chemistry, University of California Los Angeles, CA., USA
6 Dept. of Computer Science, University of California Los Angeles, CA., USA

Intrachromosomal 3-dimensional conformation is believed to play an important role in gene expression regulation by controlling genomic regions’ accessibility to transcription factors. With techniques like Hi-C and ChIP-seq, it is now possible to study transcription factor binding within regions of intrachromosomal folding. Due to inherent biases in crosslinking in the library preparation step of the ChIP-seq assay, it remains unclear if signals originate from a direct binding event, such as to a transcription factor motif, or from a distal binding event due to 3-dimensional proximity. We employed high-resolution Hi-C data along with matched cell type ChIP-seq data, to investigate the occurrence of transcription factor motifs to determine how often ChIP-seq signals “cross over” from one genomic locus to another. Preliminary results from this study will give insight as to whether transcription factors interact distally or locally within intrachromosomal folding and how 3-dimensional DNA conformation more generally affects gene expression.