2016 Bruins-In-Genomics Summer Undergraduate Research Program

2016 B.I.G. Summer Alumni Updates

  • November 2016: Congratulations to Favour Esedebe, mentored by Dr. Roel Ophoff, and Kofi Amoah, mentored by Dr. Tracy Johnson, for winning prestigious poster awards at this years ABRCMS meeting in Tampa, Florida. The Annual Bimedical Research Conference for Minority Students (ABRMCS) had over 4,000 attendees, and 1,500 student presentations in 2016.

2016 B.I.G. Summer Participants

Lab PIsMentorsStudents
HILARY COLLERMithun MitraAlec Chiu, UCLA
JASON ERNSTAdriana SperleaAshish Gauli, Fisk Univ.
Scott De Taboada, UCLA
ELEAZAR ESKINSerghei MangulBenjamin Statz, UCLA
Jeremy Rotman, UCLA
Kevin Wesel, Westlake-Harvard High School
Will Van Der Lay, UCLA
JONATHAN FLINTSamantha Jensen, Brigham Young Univ.
DANIEL GESCHWINDRebecca WalkerSanan Venkatesh, UCLA
TRACY JOHNSONKofi Amoah, Fisk Univ.
Anna Marie Rowell, Univ. of Oklahoma
SRIRAM KOSURIKimberly InsigneBrenda Ji, Wellesley College
HUIYING LIBaochen ShiKathleen Muenzen, Scripps College
Rebecca Oyetoro, Florida A&M Univ.
KIRK LOHMUELLERTanya N. PhungJesse Garcia, UCLA
ROEL OPHOFFCatharine E. KrebsFavour Esedebe, Fisk Univ.
Nihal Eltom, Univ. of Colorado Denver
PÄIVI PAJUKANTAMarcus AlvarezMarissa Di, Univ. of Southern Califonia
Nelson Chou, Brown Univ.
BOGDAN PASANIUCHuwenbo ShiSarah Spendlove, Brigham Young Univ.
Kathryn BurchTerisha Paul, Fisk Univ.
ALVARO SAGASTIJeff P. RasmussenAnastasia Repouliou, Princeton Univ.
BRYAN A. SMITHMichael Thompson, UCLA
Youngjun Park, UCLA
GRACE XIAOStephen Tran & Ashley CassDaniel Johnson, Morehouse College
Deja Goodson, Florida A&M Univ.
Nathan Spear, Geneva College
YI XINGEmad Bahrami-SamaniJoshua Sherfey, The Master's College
Robert Seniors, Florida A&M Univ.
XIA YANGYuqi YangAshok Arjunakani, Univ. of Illinois Urbana Champaign
Ugoma Onubogu, Florida A&M Univ.

2016 B.I.G. Summer Poster Abstracts

KOFI AMOAH1, Calvin Leung2, Stephen Douglass3, Srivats Venkataramanan3, Tracy Johnson3

1 Biology, Department of Life and Physical Sciences, Fisk University, Nashville, TN 37208
2 Molecular Biology Institute, University of California, Los Angeles, CA 90024
3 Molecular, Cellular and Developmental Biology, University of California, Los Angeles, CA 90024

RNA splicing, which involves the excision of introns and ligation of exons, has been shown to be coupled with transcription–a process known as co-transcriptional splicing. Because splicing happens in the vicinity of chromatin, the chromatin environment may play an important role in splicing. H3K4me3 is a histone mark that has previously been shown to be involved in splicing in mammals. We sought to determine whether this mark is important for splicing in Saccharomyces cerevisiae. Interestingly, we have shown that PRP43, a gene that encodes for a spliceosome disassembly factor, genetically interacts with SET1 and that Prp43 protein physically interacts with chromatin. We hypothesize that placement of H3K4me3 on intron containing genes (ICGs) may affect RNA splicing outcomes. To test this hypothesis, ICGs were clustered based on their H3K4me3 enrichment using published ChIP-Seq data. These data were compared to RNA-Seq data from cells lacking Set1 (set1∆) and cells that have defective Prp43 (prp43-1) cells. First, we observe enrichment near splicing signals in ICGs. Secondly, we confirm that prp43-1 has a significant splicing defect, and significant correlations were found between the clusters and the changes in splicing efficiencies of genes observed in the RNA-Seq data. Surprisingly, splicing of genes that show more subtle methylation appear to be more affected in both the prp43-1 and set strain. A closer examination of these genes reveals that they are the highly expressed ribosomal protein genes (RPGs). This leads us to two models: (1) H3K4me3 may be directly involved in recruiting the spliceosome to ICGs and (2) Deletion of the H3K4me3 mark induces changes in gene expression that lead to redistribution of the spliceosome in the cell to poorly spliced genes. Our results show the power of relating ChIP-Seq and RNA-Seq datasets to understand the mechanism of co-transcriptional splicing.


1 Department of Bioengineering, University of Illinois at Urbana-Champaign
2 Department of Biology, Florida Agricultural & Mechanical University
3 Department of Integrative Biology and Physiology, University of California, Los Angeles

Gastric cancer (GC) remains highly prevalent worldwide, causing the majority of cancer-related deaths in developing countries. Furthermore, GC shows high geographic variability and its underlying mechanisms are largely unknown. To better understand GC tumorigenesis, we used GWAS data from 8 different populations to create a gene network indicating the conserved key driver genes among all populations. For each population, Multiscale Embedded Gene Co-expression Network Analysis (MEGENA) was used to create pathway-like modules of similarly expressed genes. Then, pairwise comparisons were performed between the coexpression modules of different populations to identify the common genes between them. The modules that had gene overlap with a module from each population were labeled as conserved. These modules were merged to create separate, independent modules with no gene overlap. Pathway and network meta-analysis resulted in 14 weighted GC gene networks. These networks were linked to several GC-associated processes such as ephrin receptor binding, pyrimidine metabolism, cation homeostasis, and lipid biosynthesis. The key driver genes of the networks were cross-referenced with known GC genes. Several key drivers were supported by literature however the most significant ones, those involved in tumor suppression and transcriptional regulation processes, had no prior association with GC. Enrichment analysis comparing gene expression between GC and normal patients indicated mitotic and intracellular signaling pathways to be greatly perturbed in GC patients. Additionally, the pathways related to signal peptides and membrane were strongly down regulated. Overall, our findings indicate several novel biological processes and key drivers of GC pathogenesis.

FARAZ BEHZADI1,2, Zhang Cheng2, Alexander Hoffmann2

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Department of Microbiology, Immunology, & Molecular Genetics, University of California Los Angeles

Advances in technology have enabled high-throughput, quantitative measurements of key genetic and biochemical molecules that determine the cell’s functions. However, though high-throughput datasets are routinely analyzed bioinformatically, revealing correlations between measurements, the identification and characterization of molecular mechanism requires a different mathematical modeling framework, which may better contribute to our comprehension of human health and disease and ultimately have superior predictive power. To develop a mathematical modeling framework for gene expression studies, we focused on the transcription level genomics of the human macrophage and the understanding of immune cell signaling in response to pathogens. A pipeline was developed, which determines differential gene expression, clusters genes, assigns logical models to each cluster, and statistically validates them. First the differential expression genes are selected based on a three fold-of-induction cutoff. Then the most efficient number of clusters are chosen by the Bayesian Information Criterion, and the genes are clustered into groups according to their expression patterns. With previously published knowledge, related models are derived based on Transcription Factors (TF), Stimuli, and Degradation Half-life; additionally, the consideration of logical AND/OR gates in the models accounts for the modes of multi-TF-binding. At last models are hierarchically organized with the means of the data clusters in a process called model assignment. Then the quality of the models’ fits is presented with spearman statistics and Homer motif analysis to validate the authenticity of the results. The study was able to identify models that fitted up to 90% of the genes in certain clusters. Such findings are important since they provide us with accurate mathematical predictors of the signaling systems of human macrophage.

ALEC M. CHIU1,2, Mithun Mitra1,3,4, Hilary A. Coller1,3,4

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Department of Chemistry and Biochemistry, UCLA
3 Department of Molecular, Cell, and Developmental Biology, UCLA
4 Department of Biological Chemistry, David Geffen School of Medicine

Triple negative breast cancer (TNBC) is a highly aggressive subtype of breast cancer with poor prognosis and a high chance of recurrence. It is clinically defined by the absence of three common target receptors: estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2 (HER2)/neu. This makes TNBC unresponsive to standard effective breast cancer treatments targeting these receptors. To design specific therapies for TNBC, a detailed knowledge of the molecular events driving TNBC is required. In this study, we compared differential gene and isoform expression between matched normal and tumor tissues from TNBC patients using RNA-seq data obtained from The Cancer Genome Atlas (TCGA). We discovered thousands of differentially expressed genes (adjusted p value <0.05) from raw count-based (DESeq and DESeq2) and FPKM-based analyses (Cufflinks and Ballgown). We found that genes upregulated in tumors were significantly enriched for genes involved in DNA repair and microtubule dynamics. We also explored alternative isoform usage to discover thousands of differentially used exons (DEXSeq analysis), thousands of alternative splicing events (rMATS), and a dozen alternative polyadenylation events (DaPars). We discovered that alternative isoform usage was enriched in genes involved in translation and mRNA processing in TNBC. Furthermore, differential miRNA expression analysis revealed upregulation of miRNAs targeting genes involved in transcription. Overall, our transcriptomic analyses unveil a plethora of insights into critical genes and alternative isoform events in TNBC. [/av_toggle] [av_toggle title='CHOU, DI: Exome Sequencing implicates coding mutations in familial hypobetalipoproteinemia' tags='' av_uid='av-42xy1z'] NELSON CHOU1, 2, MARISSA E. DI1, 2, Carlos Aguilar-Salinas3, Marcus Alvarez2, Païvi Pajukanta2, 4

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Department of Human Genetics, David Geffen School of Medicine at UCLA
3 Instituto Nacional de Ciencias Médicas y Nutrición, Salvador Zubiran, Mexico City, Mexico
4 Molecular Biology Institute at UCLA, Los Angeles, California, USA

Hypobetalipoproteinemia is a relatively rare monogenic disorder that causes lowered levels of cholesterol. This disorder impairs apolipoprotein B (apoB) metabolism, resulting in low total cholesterol (TC) and low density lipoprotein cholesterol (LDL-C) levels, which is expected to reduce the risk for atherosclerosis. Furthermore, Amerindian populations have not been well-studied thus far for lipid disorders, yet they are particularly susceptible to cardiometabolic diseases. Our goal was to find mutations affecting the coding sequence causing familial hypobetalipoproteinemia (FHBL). In order to identify genes and variants implicated in FHBL, we exome sequenced two Mexican families with six affected individuals and six controls. Samples were sequenced on the Illumina HiSeq 2000, and aligned to the hg19 reference genome using bowtie2. We called variants using the GATK best practices guidelines. In order to prioritize variants and genes, we further filtered out variants by excluding common variants with minor allele frequency of less than 5% in the Latino populations. We focused on mutations affecting amino acid changes, such as nonsynonymous, nonsense, and slice site mutations. We restricted our analysis to variants with at least one copy of the mutation in the cases, but not in the controls. At the end of the filtering process, we will identify variants and genes linked to hypobetalipoproteinemia. We will also compare our results with previously found data in GWAS studies associated with TC/LDL-C and lipids. The identification of causal genes can lead to the development of preventative interventions, diagnostic tools, and targeted gene therapeutic strategies for lipid disorders.

SCOTT C. DE TABOADA1, 2, ASHISH GAULI1, 3, Adriana Sperlea2, 5, Jennifer Zhou2, 4 and Jason Ernst2, 4, 5

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Jason Ernst Lab, Department of Biological Chemistry, UCLA
3 Department of Computer Science and Mathematics, Fisk University
4 Computer Science Department, UCLA
5 UCLA Bioinformatics Program

Transcription factors (TFs) are regulatory proteins that bind to specific DNA sequences (motifs) thereby affecting the transcription rate of genetic information from DNA to messenger RNA. TF binding sites differ across cell types and experimental conditions. Chromatin immunoprecipitation (ChIP-seq) followed by sequencing is a commonly used method to obtain the genome-wide binding profile of a TF in a specific cell type under specific conditions. However, profiling the binding landscape of every TF is infeasible due to constraints on materials and cost which makes accurate computational prediction of in vivo TF binding sites desirable. It proves difficult to predict TF binding site locations across cell types so for the current scope of our project we focused on predictions within a single cell. Using k-mer counts as features representing the number of times each sequence of length k appears in the given ChIP-seq data we trained a support vector machine (SVM) classifier and obtained an area under the ROC curve (AUC) of 0.84. We then incorporated a binary feature from chromatin accessibility (DNase-seq) data representing if a bound sequence overlaps with a portion of open chromatin and after training again the AUC increased to 0.90. Moving forward we hope to include both in vitro DNA shape information and RNA-seq data to train our classifier and further improve prediction accuracies. We hope to expand our method across cell types to predict transcription factor binding sites in cells for which there is currently no experimental data.

NIHAL A. ELTOM1, FAVOUR N. ESEDEBE2, Catharine E. Krebs3, Roel A. Ophoff4

1 Psychology and Biology, Department of Psychology, University of Colorado, Denver
2 Biochemistry and Molecular Biology, Department of Life and Physical Sciences, Fisk University
3 Department of Human Genetics, University of California, Los Angeles
4 Center for Neurobehavioral Genetics, Semel Institute for Neuroscience and Human Behavior, University of California, Los Angeles

Genomic sequencing technologies have contributed to a better understanding of the genomic landscape and how it relates to heritable diseases. Specifically, RNA sequencing (RNA-seq) reveals gene expression by sequencing and quantifying RNA from cells. The Assay for Transposase-Accessible Chromatin with sequencing (ATAC-seq) similarly reveals the open chromatin sites of the genome. The broader scope of this project is to determine how genetic variation impacts open chromatin and gene expression in the context of disease using RNA-seq and ATAC-seq on fibroblast cell lines from cases and controls. The typical RNA-seq procedure for fibroblasts is to lyse cells directly as they grow adherent to culture plates. The typical ATAC-seq protocol for fibroblasts involves dissociating adherent cells from culture plates using trypsin, creating a single cell suspension. Because trypsin induces cellular stress and alters gene expression,1 we sought to determine if there are major transcriptomic differences between RNA collected through the typical adherent RNA-seq protocol and a more ATAC-seq-like suspension protocol. In this pilot study, we collected RNA-seq data from fibroblast cells of subjects with bipolar disorder using the adherent and suspension methods. We ran basic quality control on the fastq files with FastQC, aligned the reads to a human genome reference (hg19) using tophat, quantified transcripts using HTSeq, and performed principal component and differential expression analyses using DESeq2. Although there are 1161 differentially expressed genes between conditions, the PCA reveals greater variation between cell lines than conditions. We therefore suggest that the typical RNA-seq procedure be used in the expanded project.

JESSE A. GARCIA¹, Tanya N. Phung², Clare Marsden¹, Kirk E. Lohmueller¹²³

1 Department of Ecology and Evolutionary Biology, University of California, Los Angleles, CA 90095, USA
2 Interdepartmental Program in Bioinformatics, University of California, Los Angeles, CA 90095, USA
3 Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA 90095, USA

Eastern gorillas are comprised of two subspecies: the mountain gorilla (Gorilla beringei beringei) and the eastern lowland gorilla (Gorilla beringei graueri). Population size of the mountain gorilla is estimated to be approximately 880 individuals while the eastern lowland is estimated to be 4,000 individuals (Guschanski et al. 2009). High levels of inbreeding are likely in such small population sizes, which is predicted to result in long regions of homozygosity and a reduction of fitness due to the increased incidence of deleterious variants in homozygous form (inbreeding depression). However, Xue et al. (2015) suggested that the long-term impact of this inbreeding in eastern gorillas has led to the purging of deleterious recessive mutations in runs of homozygosity (ROHs). To examine this, we compute the positions of homozygous runs in western lowland gorillas (population size 100,000 ), eastern lowland gorillas, and mountain gorillas. Then we evaluate whether there are more deleterious variants in non-ROHs as compared to in ROHs. Because recessive loss of function (LoF) variants would be expressed and selected against in ROHs but not in non-ROHs, we expect a significant difference between their distributions in ROHs and non-ROHs in populations where recessive LoF variants are established. In contrast, no difference in distribution between LoF variants in ROHs and non-ROHs would suggest that recessive LoF variants no longer exist at a significant level in a population. Although not an automatic indicator of a loss of all deleterious recessive mutations in a gene pool, this is one potential signature of genetic purging led by inbreeding. Initial examination of chromosome 1 detected this signature of purging in mountain gorillas, as found in Xue et al, but not in eastern lowland gorillas nor western lowland gorillas.

Guschanski, Katerina et al. “Counting Elusive Animals: Comparing Field and Genetic Census of the Entire Mountain Gorilla Population of Bwindi Impenetrable National Park, Uganda.” Biological Conservation 142.2 (2009): 290–300. Web.

Xue, Yali et al. “Mountain Gorilla Genomes Reveal the Impact of Long-Term Population Decline and Inbreeding.” Science 348.6231 (2015): 242–245. Web.

DEJA GOODSON1,2*, DANIEL JOHNSON1,3*, NATHAN SPEAR1,4*, Stephen Tran3, Ashley Cass3, Bill Lowry7,9,10, Xinshu Xiao3,5-8

1 Bruins in Genomics, University of California, Los Angeles
2 Department of Biology, Florida A&M University
3 Bioinformatics Interdepartmental Program, University of California, Los Angeles
4 Department of Chemistry, Math and Physics, Geneva College
5 Department of Integrative Biology and Physiology, University of California, Los Angeles
6 Department of Bioengineering, University of California, Los Angeles
7 Johnson Comprehensive Cancer Center, University of California, Los Angeles
8 Molecular Biology Institute, University of California, Los Angeles
9 Molecular, Cell, and Developmental Biology, University of California, Los Angeles
10 Eli & Edythe Broad Center of Regenerative Medicine & Stem Cell Research, University of California, Los Angeles

RNA editing is a post-transcriptional mechanism that modifies RNA nucleotides. It occurs thousands of times in a single cell, and abnormal RNA editing has been linked to diseases such as cancer, heart disease, and diabetes. In this project, we consider the relationship between RNA editing and Rett Syndrome, a disease that affects 1 in 10,000 live female births. Mutations of a gene on the X chromosome, MECP2, are known to be the cause of the disease. Individuals with Rett Syndrome may exhibit sensory issues, spasticity, inability to express emotion, or speech loss. From two Rett Syndrome patients, induced pluripotent stem cells (iPSCs) were generated and differentiated into neural progenitor cells (NPCs) and finally into neurons. Cells with mutant or wildtype MECP2 were collected respectively from the same patients. At each of the 3 timepoints, RNA-seq was conducted and RNA editing sites were detected, showing, as expected, a high percentage of A-to-I editing. A hyper-editing pipeline was used to ensure high sensitivity. At each editing site, we used Fisher’s Exact Test to determine differential editing between the control and mutant samples. A total of 291 significant sites common to the patients were further analyzed. Consistent with findings from literature, editing sites were observed most frequently in intronic regions, both overall and in the differentially edited sites. Gene Ontology analysis of differentially edited genes revealed 26 genes associated with diseases including mental retardation, ocular anomalies, and leukemia. Our results suggest a possible involvement of RNA editing abnormalities in Rett Syndrome.

TIMOTHY G. ISONIO1, Sriram Sankararaman2,3

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
2 Department of Computer Science,
3 Department of Human Genetics, UCLA

Previous studies have shown that much information about the participants of a study can be inferred from genome wide association study summary statistics. Given an individual’s genotype and summary statistics, it is possible to infer their participation in a study and to predict their phenotype. Here we examine the accuracy to predict phenotypes from an individual’s genotype data as well as GWAS summary statistics under two cases, where it is known whether the individual participated in a study, and where it is not possible to make the same assertion. We show how the inclusion of an individual in a study allows for more accurate predictions to be made using both the simulated data and that from 1000 Genomes. We also compare the difference in phenotype prediction accuracy between the two models as a function of the number of SNPs n, the number of individuals m, the heritability of the trait h2, and effect sizes. Our results have implications for privacy of individual attributes from genomic data.

SAMANTHA L. JENSEN1, Yun Qi Jiang2, Margaret Distler2, Eleazar Eskin3, and Jonathan Flint4

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Department of Psychiatry and Biobehavioral Sciences, UCLA David Geffen School of Medicine
3 Departments of Computer Science and Human Genetics, UCLA
4 Center for Neurobehavioral Genetics, UCLA Semel Institute for Neuroscience and Human Behavior

The pathology of psychiatric disease has long eluded researchers because of the complexity and heterogeneity of expression. In a novel study last year Dr. Jonathan Flint collected full genome data from 13,000 women who were determined to be clinically depressed. By utilizing low coverage sequencing, Dr. Flint was able to collect many more samples than would otherwise be possible, increasing his power to detect significant genetic influences on the disease. This larger sample size enabled the discovery of two SNPs significantly implicated in depression. While that kind of single nucleotide mutation can have extensive impact on phenotypic expression, larger structural variants have the potential to reveal unstudied psychiatric pathologies. So in a continuation of Dr. Flint’s research, we undertook a full comparative study of available structural variant discovery tools. Using eight mus musculus genomes with quantified and verified known structural variants, we determined the accuracy and predictive ability of fifteen of the most used programs at different levels of coverage. The tools that proved to be most accurate and user-friendly were then used to detect the presence of a common inversion in chromosome 17 in Dr. Flint’s dataset, demonstrating their ability to provide useful predictions even in such low coverage data.

BRANDON S JEW1, Jae Hoon Sul2

 1 Department of Chemistry and Biochemistry, University of California, Los Angeles
2 Department of Psychiatry and Biobehavioral Sciences, University of California, Los Angeles

Identification of genetic variation from sequencing data requires the use of a multitude of bioinformatics tools that perform alignment, deduplication, local realignment, covariate counting, quality score recalibration, genotyping and variant calling. Each step can take a significant amount of computational time and is susceptible to errors due to hardware and system failures that can halt the entire pipeline. This complex implementation of the pipeline introduces issues with scalability and reliability for analyzing large-scale genomic data. Churchill [Kelly et al. Genome Biology (2015) 16:6] is an efficient and scalable pipeline for variant calling, which streamlines this entire calling process by utilizing regional parallelization for analysis. However, this software runs into optimization issues when working with shared resources in high-performance clusters (HPC). One main issue is the generation of many redundant files, which wastes disk storage, delays the job submission process, and prevents simultaneous variant calling on multiple individuals. This work improves the automation of the data analysis pipeline, efficiently scheduling these tasks without these drawbacks, significantly enhancing the efficiency of variant calling for large sample sizes. It also implements conversion software that takes aligned reads to a raw read format suitable for analysis with Churchill using significantly less memory than other tools, improving usability and reliability in HPC. This optimized method will allow efficient and accurate discovery of genetic variation in the rapidly growing pool of sequencing data generated for clinical studies. Future plans include the implementation of a checkpoint system to prevent disruption of analysis when errors are encountered.

BRENDA JI1, Kimberly Insigne2, and Sriram Kosuri3

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Bioinformatics IDP Program, UCLA
3 Chemistry and Biochemistry Department, UCLA

3’ untranslated regions (UTRs) are important non-coding sequences of mRNA that can alter gene expression. A single gene can have multiple, unique 3’ UTRs that can change the outcome of the same protein as a result of alternative cleavage and polyadenylation. In addition to playing a major role in gene expression, 3’ UTR isoforms can also control cell proliferation, development, and differentiation and even contribute to genetic diseases in humans. While there have been advances in understanding the power of 3’ UTRs, the exact effects of different regulatory elements and motifs in these regions are largely unknown. In order to gain a comprehensive understanding of the effect of these isoforms, we designed a 3’ UTR library that tiles the entire human genome and takes average conservation scores and UTR lengths into consideration. We also identified the locations of polyadenylation signal sequences, polyadenylation sites and miRNA binding site motifs based on putative sequences and public databases in anticipation of predicting the effects of these artificially designed 3’ UTR sequences. We can then test the function of hundreds of thousands of our sequences simultaneously in a single experiment using a massively parallel reporter assay. This high-throughput approach will allow us to identify novel regulatory elements, which can give more insight to understanding the profound effects of genetic variation in the human transcriptome.

KATHLEEN MUENZEN1, REBECCA OYETORO2, Baochen Shi3, and Huiying Li3

1 Department of Biology, Scripps College
2 Department of Biology, Florida Agricultural & Mechanical University
3 Crump Institute for Molecular Imaging, University of California, Los Angeles

The development of Next-Generation Sequencing (NGS) technologies over the past decade has enabled comprehensive analysis of microbial communities. Paired-end sequences cover a larger region of the 16S ribosomal RNA (rRNA) gene and provide higher phylogenetic resolution than single-end sequences, however, at the expense of higher cost and more computing effort. In this study, we investigated whether the microbiome composition determined by the single-end reads covering the hypervariable V1-V2 regions and V3 region of the 16S rRNA gene are similar to that determined by paired-end sequences covering the V1-V3 regions. Using the paired-end 16S rRNA V1-V3 sequence data generated from 45 skin microbiome samples, we compared the taxonomic compositions of the human skin microbiome, determined based on the V1-V2 only, V3 only, and V1-V3 regions of the 16S rRNA gene using the QIIME pipeline. The microbiome composition determined by the reads covering the V3 region showed a higher similarity to that determined by the V1-V3 regions (Pearson correlation coefficient r = 0.998 V3 vs. V1-V3; r = 0.934 V1-V2 vs. V1-V3). Sixteen prevalent genera were identified in all three analyses. Propionibacterium acnes and Staphylococcus aureus were significantly underrepresented by the V1-V2 regions compared to the microbiome determined by the V1-V3 regions. The microbiome composition determined by the V3 region was used to perform an in-depth comparative analysis of the human skin microbiome samples. This study demonstrates that the V3 hypervariable region of the 16S rRNA gene can determine a similar microbiome composition to that generated by paired end reads covering the V1-V3 regions.


1 Department of Computer Science, University of California, Los Angeles
2 Department of Psychiatry and Biobehavioral Sciences

Short tandem repeats (STR), or microsatellites, are highly variable repeated regions in the human genome capable of serving as unique genetic profiles for each individual. STRs are often utilized in DNA fingerprinting and linkage analysis and have been known to influence fragile X syndrome and Huntington’s disease. In this project, we aim to identify STRs in a genome-wide scale from high-coverage (30x) whole genome sequencing (WGS) data of 450 individuals from large extended pedigrees in Columbia and Costa Rica. We used lobSTR software (Gymrek et al., Genome Research, 2012), which is designed to detect STRs from WGS and has high accuracy of identifying STRs. As a pilot project, using the lobSTR software, we called the STRs on 40 individuals from one family that has STR data from DNA electrophoresis, a widely accepted standard for STR allelotyping. We measured the accuracy of STR calls from lobSTR by comparing those calls with STR calls from DNA electrophoresis. After filtering the STR loci with low-quality calls from lobSTR, we found that the overall concordance between the two STR calls was very high (94%). In addition to measuring the concordance, we used Pedcheck software to identify Mendelian inconsistencies of STR calls in the family. My pipeline takes FASTQ or BAM files aligned with BWA-MEM as input and produces a VCF file with STR data. A future plan is to call STRs on all 450 individuals and use STR data for linkage analysis to identify the genetic basis of bipolar disorder using the large family data.

TERISHA L. PAUL1, Kathryn Burch2 and Bogdan Pasaniuc2,3,4

1 QCBio BIG Summer Program, University of California, Los Angeles
2 Bioinformatics Interdepartmental Program, University of California, Los Angeles
3 Department of Pathology and Laboratory Medicine, University of California, Los Angeles
4 Department of Human Genetics, University of California, Los Angeles

Genome-wide association studies (GWAS) identify genetic variants associated with a given phenotype without taking the biological functions of those variants into account. Recent large-scale efforts by consortia such as ENCODE use a variety of biochemical assays and methods to identify and annotate functional elements in the genome. The goal of our project is to quantify the relevance of over 4,000 binary functional annotations to the interpretation of GWAS summary statistics for 30 complex traits, and to combine these two independent sources of information to produce disease-specific scores of pathogenicity for each variant. For a given trait and annotation, we construct a statistical model that combines GWAS results and annotations and quantify the significance of the annotation to the trait by performing a likelihood ratio test. We hypothesize that for a given trait, the functional annotations measured in tissues that are relevant to that trait will be most significant. Disease-specific scores of pathogenicity constructed from both GWAS and functional annotations could potentially aid in the discovery of treatments and therapeutic targets.

ANASTASIA REPOULIOU1,2, Jeff P. Rasmussen3, and Alvaro Sagasti3

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Department of Molecular Biology, Princeton University
3 Department of Molecular, Cell, and Developmental Biology, UCLA

Axon injury induces a process known as Wallerian degeneration that degrades severed axons. The cellular debris created by Wallerian degeneration must be cleared by phagocytes to allow efficient axonal regeneration. The mechanisms used by phagocytes to recognize, degrade, and repair degenerating axons are not fully understood. Larval zebrafish offers a simple vertebrate system to examine the transcriptional response to axon damage. Skin cells are the major phagocytes for axon debris in larval zebrafish skin. We analyzed the transcriptomes of skin cells upon widespread axonal degeneration in the skin. We induced axon damage by expressing the bacterial enzyme NfsB, which converts metronidazole (MTZ) into a toxic metabolite, in neurons that innervate the skin, and bathing the fish in MTZ. Following RNAseq on FACS-purified skin cells, we detected transcripts representing over 20,000 genes. However, only a tiny fraction (<0.5% of genes) were differentially expressed in response to axonal degeneration. Of these differentially expressed genes, 24% also responded in control treatments, leaving a set of 32 skin genes that specifically responded to axonal degeneration. Some of the most meaningfully enriched GO terms were those involving glial cell proliferation and inflammatory response regulation. Furthermore, several genes were known to respond to axon or tissue damage in paradigms like central nervous system or heart injury, indicating that the skin might share a tissue repair transcriptional program with these cell types. This study serves as a springboard to examine the role of the identified skin cell responses in recognition, phagocytosis, and disposal of axon debris. [/av_toggle] [av_toggle title='ROTMAN: Studying the Microbiome by Analyzing the Coverage of Sequencing Reads Mapped to Viruses, Eukaryotes, and Bacteria' tags='' av_uid='av-jl4947'] JEREMY ROTMAN1, David Koslicki2, Nicholas Wu3, Serghei Mangul1

1 UCLA, Computer Science, Los Angeles, CA,
2 OSU, Mathematics, Corvallis, OR,
3 TSRI, Department of Integrative Structural and Computational Biology, La Jolla, CA

Reads obtained from genomic sequencing can give us insight into more than just the host individual. In particular, looking at the unmapped reads of RNA-Seq can tell us more about other living things like viruses, eukaryotes, and bacteria. If we can determine which genomes in the microbiome exist, then we can also discover things like infections in the sample. By using MegaBLAST we mapped reads to the various parts of the microbiome, and then created plots to show how each genome is covered by its matching reads. After manually looking at these coverage plots, we came up with ways to automate the acceptance of plots based on things like the length of consistent coverage. In addition, we looked at the distribution of the indices at which each read began. This helped us determine which reads are pieces of artifacts, and which reads provide evidence that the mapped genome exists within the sample. We then applied our tool by running it on cancer skin samples as well as indigenous African samples. The automated system effectively filtered the results so that we were not overwhelmed by extraneous results. Additionally, our tool managed to correctly identify a common retrovirus across all of the cancer skin samples, as well as detect the presence of a few other viruses.

ANNA MARIE ROWELL1, Calvin Leung2, Stephen Douglass3, Srivats Venkataramanan3, and Tracy Johnson3

1 Department of Chemistry and Biochemistry, University of Oklahoma
2 Molecular Biology Institute, University of California, Los Angeles
3 Molecular, Cell, and Developmental Biology, University of California, Los Angeles

During RNA processing in eukaryotes, splicing, which involves the removal of introns and the ligation of exons to produce mature mRNA, is catalyzed by the spliceosome co-transcriptionally. Previous studies indicate that histone methylation both affects transcription rates and recruits various splicing components, strengthening the model that splicing and transcription are coupled. Here we examine the histone modification H3K36me3, placed by the histone H3 methyltransferase set2∆, which has previously been associated with splicing, and how this mark may affect splicing in Saccharomyces cerevisiae. Our laboratory has also shown that the spliceosome disassembly factor, prp43-1 associates with chromatin in a manner that is dependent on H3K36 trimethylation. H3K36me3 has also been shown to affect transcription in cells which may lead to changes in splicing indirectly. We hypothesize that the spatial arrangement of methylation marks in H3K36 may perturb splicing of intron containing genes. Analysis of published ChIP-seq data revealed distinct patterns of H3K36 trimethylation. We next sought to correlate these with changes in splicing efficiency, which was calculated from RNA-seq data from set2∆ and prp43-1 cells. We observe that depletion of methylation correlates with reduced splicing efficiency in intron containing genes. However, for some lowly expressed genes, set2∆ correlates with decreased splicing efficiency. In fact, these are affected to an even greater extent than deletion of the canonical splicing factor prp43-1. From this data we propose two possible models: (1) changes in the expression of ribosomal protein genes (RPGs), the most abundant class of intron-containing genes, may have an indirect role in splicing, specifically, improved splicing and (2) these marks may have a direct role in recruiting splicing factors to the gene, but understanding these effects may require focused analysis of non-RPG. In conclusion, such methodology in analysis is advantageous in correlating multiple sets of high-throughput data.

ALEXANDER SALAS1 and Oliver I. Fregoso*2

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Department of Microbiology, Immunology, and Molecular Genetics, UCLA

The first line of defense in pathogenic infections is the innate immune response. Restriction factors (RFs) play a fundamental role in the innate immune response to viruses such as HIV by obstructing targeted steps of the viral lifecycle. Many RFs are interferon (IFN) induced, a key component of innate immunity. It has become increasingly more imperative to identify and characterize these restriction factors in the pursuit to ameliorate viral infection treatments. Here we implement an evolutionary analysis for positive selection (PS) of host genes as a novel means to identify uncharacterized restriction factors. Positive selection is the accumulation of amino acid altering mutations at a rate higher than neutral evolution, which is often a result of genetic conflict, referred to as an evolutionary arms race. Positive selection is a potent means to detect host-viral interactions at the protein-protein level, as host and virus are in direct conflict with one another. In context of the HIV lifecycle, the importation of exogenous viral DNA and integration into host genome can elicit a strong DNA damage response. We hypothesize that DNA damage response factors regulate the lentiviral lifecycle as novel undelineated restriction factors. We applied an initial PS pipeline consisting of molecular phylogenetic analysis, and statistical maximum likelihood methods to over 320 DDR factors and identified 67 genes displaying PS. To enrich for potential RFs, comparative screens for PS and IFN induction were done, identifying 8 gene candidates. From this computational framework we can now wisely carry out functional assays for physical data that will begin to elucidate both pro- and anti-viral mechanistic pathways of DDRs in the lentiviral lifecycle.

ROBERT SENIORS1*, JOSHUA SHERFEY1*, Emad Bahrami-Samani2, and Yi Xing1,2,3

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Microbiology, Immunology, & Molecular Genetics, UCLA
3 Bioinformatics Interdepartmental Ph.D. Program, UCLA

RNA-Seq has emerged as a powerful approach to analyze differential gene expression. Here we aim to identify differentially expressed genes between two prostate cancer cell lines: PC3E, our control condition that exhibits epithelial cell properties, and GS689, our experimental condition that was recovered from a secondary metastatic liver tumor and exhibits mesenchymal and invasive characteristics. Using three biological replicates of the two cell lines, we employ a newly developed RNA-Seq workflow, the Lair. Integrating two previously published tools, kallisto and sleuth, the Lair quickly processes and analyzes RNA-Seq data. Quantification of transcript abundance is accomplished by kallisto, which utilizes pseudoalignment and multiple bootstraps to ensure maximum efficiency and accuracy. Furthermore, sleuth makes use of bootstraps produced by kallisto for differential analysis of transcript abundance and generates an interactive display containing test diagnostics and various graphical analyses of the data. Upon running the Lair, we found that a notable proportion of transcripts were differentially expressed between PC3E and GS689. Hypothesizing that expression state of differentially expressed genes may be linked to cancerous behavior, we generated a list of differentially expressed genes from these results. Additionally, the sleuth’s interactive display allowed for further exploratory analysis of the cancer datasets. This exploratory analysis allowed us to visualize data and notice trends that would be otherwise difficult, if not impossible, to recognize. Looking forward, we plan to further investigate our list of differentially expressed genes, in order to narrow our list to a high-confidence set of genes that are linked to cancerous behavior.

SARAH J. SPENDLOVE1, Huwenbo Shi2, and Bogdan Pasaniuc2,3,4

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, University of California, Los Angeles
2 Bioinformatics Interdepartmental Program, University of California, Los Angeles
3 Department of Pathology and Laboratory Medicine, University of California, Los Angeles
4 Department of Human Genetics, University of California, Los Angeles

Genome-wide association studies (GWAS) have identified many loci that associate significantly with complex phenotypes, but there remains much missing heritability, or heritability that cannot be explained by GWAS loci. Learning about the genetic architecture behind complex traits can help us understand the biology behind these phenotypes. Recent scientific advances, including the development of the HESS tool, have allowed us to estimate local single nucleotide polymorphism (SNP) heritability, or the phenotypic variance explained by a set of SNPs at a locus in the genome. Estimating local SNP heritability allows us to investigate missing heritability because it aggregates the effects of all SNPs, not only ones that individually reach genome-wide significance in GWAS studies. Here we analyzed the local SNP heritability of 7 traits from 8 sets of GWAS summary statistics, with each dataset having information from 106,736 to 298,420 individuals. We analyzed educational attainment, subjective well-being, depressive symptoms, neuroticism, waist-to-hip ratio, hip circumference, and waist circumference. We plotted local heritability estimates to illustrate where the heritability of these traits lie. We showed that several of these traits have a significant amount of local SNP heritability in loci containing GWAS hits, but that much of the SNP heritability of these phenotypes is still unexplained by loci with GWAS hits. We also used our results to show that all of the complex traits we studied are polygenic. Overall, our research gives a succinct summary of the local SNP heritability of these traits that can be used to aid future research.

BENJAMIN STATZ1, Serghei Mangul1,2, Eleazar Eskin1,3

1 UCLA, Computer Science, Los Angeles, CA,
2 UCLA, Quantitative and Computational Biosciences, Los Angeles, CA,
3 UCLA, Human Genetics, Los Angeles, CA

Multiple tools exist for the analysis of the variable domains of B and T cell receptors. Their primary function is identification of the encoding genes – the variable, diversity (optional), and joining genes, which are randomly selected via V(D)J recombination; additionally, they may determine the complementarity-determining region 3 (CDR3) sequences which span the short, random insertions between these two or three genes. Though successful, these tools can often be improved in both efficiency and functionality. IgBLAST, the method currently used by the ROP pipeline, aligns V and J genes separately without considering their placement and relation to each other, increasing the number of unnecessary searches and missing potential alignments. Our modified approach involves two steps. First, the database of V genes is stored as a suffix tree containing only the ends of the genes equal to the read length, and the reads are aligned using a lowest common ancestor based algorithm to allow for mismatches. The reads containing V alignments are then selected and individually searched for subsequent J alignments. The resulting data can be used to automatically infer the CDR3 sequences. Preliminary tests on a single sample have shown improvements compared to IgBLAST, including a 30% increase in potential VJ alignments, a 13.9% increase in the number of reads on which potential VJ alignments were found, and a 3.4% increase in the number of CDR3 sequences detected. Additional testing will be conducted on larger data sets once the tool is finalized.

MICHAEL J. THOMPSON1, Avinash Nanjundiah1, Niko G. Balanis1, Bryan A. Smith2

1 Department of Molecular and Medical Pharmacology, UCLA
2 Department of Microbiology, Immunology, and Molecular Genetics, UCLA

In relative survival analyses of cancer patients, it has been found that cancer metastasizes quite commonly. Despite the prevalence of metastasis, the mechanism surrounding the phenomenon and its treatments remain poorly understood. Due to recent understandings of cancer and its genetic profile, it is important to analyze the disease on a molecular level. Based on the ideas of the cancer stem cell hypothesis, our group examined commonalities of cancers and adult human stem cells. Using data from both RNA sequencing and array expression technologies, our group identified a gene expression signature of adult human stem cells. The forty-nine genes were chosen according to their average rank after calculating their signed log p value in a variety of adult human stem cell tissues. After validating the genes through a hypergeometric overlap test, we tested their presence in aggressive cancer datasets. The adult human stem cell signature was significantly greater expressed in aggressive cancers than their less aggressive counterparts, as predicted by previous literature. We further show that patients whose cancers express the signature had a significantly shorter survival time than those whose cancers did not. Future projects will analyze the functions and types of genes selected for the signature.

WILLIAM VAN DER WEY1, Serghei Mangul2, Eleazar Eskin2,3

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Department of Computer Science, UCLA
3 Department of Human Genetics, UCLA

A significant proportion of the reads typically generated by high throughput RNA sequencing technologies typically remain unmapped. Read origin protocol (ROP), a bioinformatics tool designed to determine the location of these unmapped reads, provides invaluable information about the microbial community in human samples. Reads that are generally discarded in a typical RNA seq analysis are mapped to the genomes of microbial organisms. As an extension of ROP we developed a pipeline for performing a functional analysis of the residual microbial reads. Although the specific composition of these microbial communities varies across individuals as well as tissues, we suggest that the functional profile of these communities will be relatively conserved in both cases. Using RNA seq data from 53 tissue sites across 544 individuals, we ran the ROP tool and the functional profile pipeline to ultimately determine the metabolic capabilities of the human microbial communities across tissue. Unsurprisingly, the most abundant pathways in the majority of tissue types were related to the central metabolic pathway. Consistent with previous research, the diversity of metabolic capabilities was highest in microbe rich tissues such as the skin, lung and gut. Ultimately ROP and the functional profiling pipeline provide insight into not only the metabolic pathways but also the use of those pathways in a microbial community.

NEERJA VASHIST1, Matthew S. Bramble1, Ascia Eskin1, and Eric Vilain1

1 Department of Human Genetics, David Geffen School of Medicine, University of California-Los Angeles

The mechanisms by which sex differences in the mammalian brain arise are poorly understood, but are influenced by a combination of underlying genetic differences and gonadal hormone exposure. Using a mouse neural stem cell (NSC) model to understand early events contributing to sexually dimorphic brain development, we identified novel interactions between chromosomal sex and hormonal exposure that are instrumental to early brain sex differences. RNA-sequencing identified 103 differentially expressed transcripts between XX and XY NSCs at baseline (FDR=0.05). Treatment with testosterone-propionate (TP) reveals sex-specific gene expression changes, causing 2854 and 792 transcripts to become differentially expressed on XX and XY backgrounds respectively.  These alterations in gene expression perturb the masculine- and feminine-specific baseline expression patterns. Within the TP responsive genes, there was enrichment for epigenetic regulators that affect both histone modifications and DNA methylation patterning.   We observed that TP caused a global decrease in 5-methylcytosine abundance in both sexes, a heritable effect that was maintained in cellular progeny. Additionally, we determined that TP was associated with residue-specific alterations in acetylation of histone tails. While the global decrease in DNA methylation was not sex-specific the measured changes in acetylation were. Collectively, our results (1) provide novel transcriptional sex-differences in NSCs (2) demonstrate that TP can globally alter gene expression and (3) show epigenetic programming in NSCs is responsive to gonadal hormones.  These findings highlight an unknown component of androgen action on the developmental CNS, and contribute to a novel mechanism of action by which early hormonal organization is initiated and maintained.

SANAN VENKATESH1,2, Rebecca Walker3, Daniel Geschwind2

1 Department of Neuroscience, University of California Los Angeles
2 Department of Neurology, University of California Los Angeles
3 Bioinformatics IDP, University of California Los Angeles

Expressive quantitative trait loci (eQTLs) are critical genomic loci that lead to significant differences in mRNA expression levels. While previously only performed in adult brain, by examining eQTLs in developing human cerebral cortex we hope to shed light on biological processes corresponding to neurodevelopment and neurodevelopmental disorders such as autism spectrum disorder. In order to accurately draw associations between gene expression and genotype information, we must first explore the intricacies of both data types. I have looked into the expression data through performing weighted correlation network analysis (WGCNA) that performs correlations between gene expression profiles to identify clusters of highly correlated genes, summarized by an eigengene. This identifies key groups of genes that may play distinct roles in biological function during development. I have also performed ancestry analysis by ADMIXTURE and IBD analysis by Plink. Our WGCNA results shows clear modules in our data. We will further analyze these modules for gene ontology enrichment. IBD analysis showed our samples are all unrelated and our ancestry analysis revealed that the majority of our samples are of Mexican and African ancestry, which was expected given the nature that these samples were collected. The large diversity in our samples is unlike most previous studies which are done in primarily European individuals. Therefore, carefully examining the properties of our genotypic and expression data has allowed us to accurately correct for population structure in our eQTL analysis.

KEVIN WESEL1, Serghei Mangul2, Eleazar Eskin2, 3

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Department of Computer Science, UCLA
3 Department of Human Genetics, UCLA

Read Origin Protocol (ROP) is a tool used to discover the source of all RNA-sequencing reads, especially targeting the unmapped reads, reads not mapped to the annotated reference genome, that previously would go unused. However, ROP output, specifically output for the lost repeat reads and elements in blast format, is lengthy and hard to interpret or graph. Here, we aimed to create python scripts that would use an ROP lost repeat output file and return specific summary files of the data, providing the ability to compare tissue samples between individuals. Summary files included a vertical, human-readable file with relative frequency percentages of all reads that matched to a certain element, and a horizontal summary file that could be concatenated with other files to produce boxplot graphs in R illustrating the frequency of certain body tissues. Novel boxplot graphs were created using the script for the 2,977 body tissue samples from the GTEx project, and they illustrated an abundance of the MER20 element in the cerebellum, just as earlier GTEx boxplots from different data had showed. Additionally, the boxplots exhibited a new larger fraction of the SVA(A) element in the left ventricle of the heart that previously had not been observed. Calculating the frequency of reads to each element has been crucial to highlight the body tissues certain reads of RNA are likely to reside, leading to a better understanding of disease etiology. This python script will be available with the new release of ROP (1.0.5), and the wiki for the code can be found at https://github.com/kevinwesel/SLRsummary/wiki/SLRsummary-Wiki.