B.I.G. Summer 2018 – Institute for Quantitative and Computational Biosciences

2018 Bruins-In-Genomics Summer Undergraduate Research Program

2018 B.I.G. Summer Alumni Updates

September 2018: Congratulations to our 2018 alumni who have been selected by the Annual Biomedical Research Conference for Minority Students (ABRCMS) to present their research from B.I.G. Summer. Ruth Adewale (Howard University ’19), mentored by Dr. Valerie Arboleda, and Janae Lyttle (Spelman College ’19), mentored by Dr. Luisa Iruela-Arispe, will present results from their summer research projects at ABRCMS 2018 in Indianapolis, Indiana.

2018 B.I.G. Summer Best Poster Award Winners

Congrats Ram Ayyala, Jacqueline Castellanos, and Emily Wesel!
Congrats Dana Franklin and Nnamdi Osakwe!
Congrats Mohammadreza Hajy Heydary and Anna Yaschenko!
Congrats Elaine Huang and Michele Ramos Correa!
Congrats Emily Kobayashi and Jacquelyn Roger!
Congrats Varuni Sarwal!

2018 B.I.G. Summer Participants

Lab PIs	Mentors	Students
VALERIE ARBOLEDA	Maria Palafox	Amy Freiberg, Univ. of Central Florida
		Ruth Adewale, Howard University
JASON ERNST	Adriana Sperlea	Brooke Felsheim, Washington University
		Jennifer Chien, Wellesley College
ELEAZAR ESKIN	Robert Brown	Jinjing Zhou, UCLA
		Sarah Faller, Duke University
	Serghei Mangul	Caitlin Loeffler, UCLA
		Emily Wesel, Westlake-Harvard High School
		Jacqueline Castellanos, Santa Monica College
		Keith Mitchell, UCLA
		Ram Ayyala, UCLA
		Varuni Sarwal, Indian Institute of Technology
ALEXANDER HOFFMANN	Quen Cheng	Kensei Kishimoto, UCLA
	Diane Lefaudeux	Nick Miller, Cornell University, Ithaca
	Simon Mitchell	Amy Tam, UCLA
LUISA IRUELA-ARISPE	Milagros Romay	Janae Lyttle, Spelman College
		Lyndon Rolle, Fisk University
HUIYING LI	Baochen Shi	Dana Franklin, Fisk University
		Nnamdi Osakwe, North Carolina Central Univ.
STANLEY NELSON	Hane Lee	Kayla Schimke, UC Santa Cruz
		Samuel Nkrumah, Fisk University
BENNETT NOVITCH	Ranmal Samarasinghe	Brandon Nanfito, UCLA
PÄIVI PAJUKANTA	Marcus Alvarez	Grant Schulte, UCLA
JEANETTE PAPP	Benjamin Chu	Gordon Mosher, UC Riverside
		Marcel Nwaukwa, Univ. of Arkansas at Pine Bluff
BOGDAN PASANIUC	Kathryn Burch	Angela Wei, Univ. of Kentucky
	Malika Kumar Freund	Daniela Perry, Cornell University
MATTEO PELLEGRINI	Colin Farrell	Huiling Huang, UCLA
		Simran Rajpal, UCLA
SRIRAM SANKARARAMAN	Robert Brown	Boyang Fu, Rutgers University
		Jingxian (Clara) Liu, Cornell University
	Arun Durvasula	Elliot Kang, Brown University
		Mario Paciuc, Rice University
	Ariel Wu	Anna Yaschenko, Univ. of Maryland, Baltimore Co.
		Mohammadreza Hajy Heydary, CSU Fullerton
JAE HOON SUL	Brandon Jew	Sabrina Iddir, Univ. of Chicago
	Huajun Zhou	Jiangyuan (Jerome) Luo, UCLA
GRACE XIAO	Kikuye Koyano	Elaine Huang, Lafayette College
		Michele Ramos Correa, CSU Northridge
JASMINE ZHOU	Shuo Li	Emily Kobayashi, UC Santa Cruz
		Jacquelyn Roger, UC Santa Cruz

2018 B.I.G. Summer Poster Abstracts

ADEWALE: Differential DNA Methylation in Patients with Global Development Delay Caused By De Novo Mutations in Gene KAT6A

RUTH ADEWALE¹, Courtney Rauchman², Valerie Arboleda^2,3

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences,
² Department of Pathology and Laboratory Medicine, David Geffen School of Medicine,
³ Department of Human Genetics, David Geffen School of Medicine, UCLA

Mutations in the KAT6A gene cause a rare neurodevelopmental syndrome that includes intellectual disability, cardiac defects, gastrointestinal dysmotility, and dysmorphic features. We obtained patient-derived skin fibroblasts and extracted DNA for sodium bisulfite treatment to identify differentially methylated regions genome-wide. Bisulfite treated DNA was then run on the Illumina Infinium Methylation Epic chip. We ran 12 KAT6A-syndrome biological replicates and 13 control samples. Data was processed using Python and the ChAMP R pipeline. We hypothesize that there will be differential DNA methylation between patients and controls, and these will inform our understanding of the mechanisms underlying KAT6A Syndrome. Quality control analyses identified that two chips were switched in the chips. Using the ChAMP R pipeline, we identified 91 differentially methylated probes and 99 differentially methylated regions that passed our significance threshold. Overall, we have identified consistently differentially methylated probes and regions that can help us identify the downstream effects of KAT6A mutations.

AYYALA, CASTELLANOS, WESEL: Analyzing the Adaptive Immune Repertoire Diversity and Microbiomes of African Individuals using RNA Sequence Data

RAM AYYALA¹, JACQUELINE CASTELLANOS¹, EMILY WESEL¹, Serghei Mangul², Eleazar Eskin^2,3

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences,
² Department of Computer Science,
³ Department of Human Genetics, David Geffen School of Medicine, UCLA

RNA-Sequence data from African individuals provide an unprecedented opportunity to study features of the adaptive immune system across multiple populations. We hypothesize that due to different lifestyles, these populations have been exposed to a wide variety of viruses, and as a result, should have distinct CDR3 sequences and VJ recombination. Using ImReP, we were able to assemble distinct CDR3 sequences of 130 individuals from eight distinct African tribes and estimate their immune diversity. Upon further analysis using statistical methods, we found that there is a significant difference (p-value <10^-3) between the diversity of T-cell receptor repertoire across African populations. These results support the hypothesis that the African populations have highly diverse T-cell repertoires, which could be influenced by their lifestyle. In the future, we plan to focus on analyzing populations of different countries in order to obtain a global perspective of the adaptive immune repertoire diversity.

CHIEN, FELSHEIM: Applying ConsHMM to Provide a Novel Annotation Resource for Various Multiple Species Alignments of Major Model Organisms

JENNIFER CHIEN¹, BROOKE FELSHEIM¹, Adriana Sperlea^2,3, Jason Ernst^3,4

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences,
² Bioinformatics Interdepartmental Program,
³ Department of Biological Chemistry,
⁴ Department of Computer Science, UCLA

ConsHMM is an approach for whole genome annotation that uses a multivariate hidden Markov model to identify the combinatorial and spatial patterns in a multiple species DNA sequence alignment. We applied ConsHMM to twenty different multiple sequence alignments with reference species of major model organisms. Here, we examine two avenues of comparison: between two models built from 30 and 100 species sequence alignments with the hg38 reference genome that identifies analogous conservation states across alignments and between models built from MultiZ and Ensembl alignment algorithms with the same reference genomes. Additionally, we present external enrichments for other genomic annotations, including CpG islands, exons, transcription start and end sites, and other existing measures of conservation. We showcase the ability of these annotations to identify differences between alignment algorithms, identify analogous patterns of constraint across species, and identify equivalent patterns of constraint for the same species based on different alignments.

FALLER, ZHOU: Probabilistic Model to Estimate Replication Rate in the Presence of Bias and Confounding in Genome-wide Association Studies (GWAS)

SARAH FALLER¹, JINJING ZHOU¹, Robert Brown², Eleazar Eskin^2,3

¹ Big Summer Program, Institute for Quantitative and Computational Biosciences,
² Department of Computer Science,
³ Department of Human Genetics, David Geffen School of Medicine, UCLA

GWAS have been successful in identifying many genetic variants that associate with specific phenotypes. Once a significant variant is found, replication studies are done to verify that the results are true positives. Our work attempts to eliminate statistical noise, ascertainment bias, and confounding factors to determine whether observed associations have a high probability of replication or are likely to be false positives. We hypothesize that many studies fail to replicate due to winners’ curse and the presence of unknown confounding factors. We use a probabilistic model to correct for the amplified variance when estimating the underlying effect size. We then extend this model to define the conditional distribution of the replication study test statistic given the discovery test statistic, allowing us to compute the expected rate of replication. Using simulated data, we predicted the replication rate with 98.73% accuracy. Currently, we are working with real data to test our hypothesis.

FRANKLIN, OSAKWE: Pandora: A Pan-Genome Database of the Human Oral Microbiome

DANA FRANKLIN¹, NNAMDI OSAKWE¹, Baochen Shi², Huiying Li²

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences,
² Department of Molecular and Medical Pharmacology, UCLA

The human oral microbiome is associated with health and disease. Over one thousand oral bacterial isolates have been sequenced to investigate the functional potentials in their genomes. Compared to individual genomes, a pan-genome describes the full complement of genes in a species. The pan-genome analysis estimates the number of new genes added per sequenced genome. Additionally, the pan-genome, as non-redundant genomic references, helps improve the resolution and efficiency of the metagenomic shotgun sequence analyses. However, currently the dataset of the pan-genome collection is not publicly available. In this study, we constructed a Pan-genome Database of the human Oral microbiome (PanDOra), which consists of pan-genomes of 375 bacterial species from 124 genera based on our collection of 1,309 human oral-associated microbial genomes. The pan-genome of each species can now be easily searched in and downloaded from our database. We calculated and provided the rarefaction curve for each species to determine if the pan-genome is open or closed. PanDOra is the first pan-genome database designed for human oral microbiome studies.

FREIBERG: Analysis of Mutation Patterns and Differing Amino Acid Compositions across the Reference Human Proteome

AMY FREIBERG¹, Maria Palafox², Valerie Arboleda^2,3

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences,
² Department of Human Genetics, David Geffen School of Medicine,
³ Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, UCLA

The human proteome contains approximately 20,000 unique proteins. Genomic variants that occur within the exome can result in synonymous, missense, or truncating amino acid changes. While many rare genetic disorders have been attributed to truncating variants that result in a loss-of-function, missense changes are historically more challenging to interpret at a functional level. We hypothesize that specific types of missense changes are more likely to be deleterious than others due to the biochemical composition of the amino acid. Additionally, we believe that specific patterns of missense variation are implicated in rare genetic disorders. Our project aims to understand and predict how specific amino acid changes alter protein function. In this project, we analyzed human reference proteome data from UniProt. We studied the distribution of amino acids in the proteome to understand mutation patterns between amino acids. These data will provide a reference for future proteomic studies regarding various disease states.

FU, LIU: Benchmarking on GWAS Tools with Multiplex Genetic Architecture Models

BOYANG FU¹, CLARA LIU¹, Robert Brown², Sriram Sankararaman^2,3

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences,
² Department of Computer Science,
³ Department of Human Genetics, David Geffen School of Medicine, UCLA

Genome-Wide Association Study (GWAS) is a powerful approach to detect association between phenotypes and genotypes. However, current statistical methods for conducting association tests are faced with two main obstacles. First, the true underlying causal architecture is unknown; this means that association methods must be sensitive to as many genetic architectures as possible. Second, these statistical techniques are computationally intensive, which increasingly becomes a bottleneck as the size of data increases. Software tools using different approaches have been developed to address these problems, but an updated benchmarking comparison is presently lacking. In this study, we systematically compare four commonly used GWAS tools: TASSEL, BOLT-LMM, GEMMA, and SKAT. We evaluate their performances on power, family-wise error rate, and run time under different causal architectures and with a range of simulated sample sizes. With most tools properly controlling the family-wise error rate, BOLT-LMM has the most power in detecting associated single nucleotide polymorphisms (SNP) under most common variant genetic architecture. At heritability 0.5, BOLT-LMM has a discovery rate of 0.19, higher than the rates of GEMMA and SKAT (0.097) and TASSEL (0.077). However, under a different genetic architecture where rare variants contribute to the phenotypes as genes, SKAT has a higher power of 0.16, while BOLT-LMM and GEMMA have powers of 0.094 and 0.064, respectively.

HAJY HEYDARY, YASCHENKO: Comparison of Method of Moments and Restricted Maximum Likelihood Estimators for Approximating SNP-Heritability and Genetic Correlation

MOHAMMADREZA HAJY HEYDARY¹, ANNA YASCHENKO¹, Yue Wu², and Sriram Sankararaman^2,3

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences,
² Department of Computer Science,
³ Department of Human Genetics, David Geffen School of Medicine, UCLA

Heritability, the proportion of variation of a trait explained by genetic variation, and genetic correlation are crucial parameters in determining the genetic architecture of complex phenotypes. Linear Mixed Models (LMMs) have emerged as a critical tool for estimating heritability and genetic correlation, where the parameters of the LMMs are related to the heritability attributable to the SNPs analyzed and genetic correlations across phenotypes. Likelihood-based inference in LMMs, however, poses severe computational burden and does not scale to sizeable datasets such as the UK Biobank. In this work, we propose a scalable randomized Method-of-Moments (MoM) estimator of SNP heritability and genetic correlations in LMMs. Our method, RHE-reg, leverages the structure of genotype data to obtain runtimes that are sub-linear in the number of individuals (assuming the number of SNPs is held constant). Comparing to existing software, using both simulations and real data, we show that RHE-reg is accurate and has a comparable statistical efficiency as likelihood-based inference. We also show that RHE-reg estimates are stable in situations of model misspecification. We further extend our estimator using a generalized MoM that achieves better statistical efficiency.

HUANG, RAMOS CORREA: Assessing the Global Landscape of Extracellular MicroRNA Expression

ELAINE HUANG¹, MICHELE RAMOS CORREA¹, Kikuye Koyano², Grace Xiao^3,4

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences,
² Bioinformatics Interdepartmental Program,
³ Department of Integrative Biology and Physiology,
⁴ Molecular Biology Institute, UCLA

MicroRNAs (miRNAs) can be found extracellularly, making them suitable biomarkers for disease. While their expression has been characterized across tissues and cell types, comprehensive comparisons across cell types and biofluids are lacking. Here, we analyzed miRNA expression across 5 extracellular biofluids and 39 cell types, and we identified 99 highly expressed miRNAs. EdgeR was used to normalize miRNA expression levels across multiple datasets. Hierarchical clustering on normalized expression levels showed reduced batch effects and improved clustering of samples from similar fluids. High levels of correlation in expression between certain cell types and fluids suggest potential intracellular sources contributing to extracellular miRNA profiles. Fluid-specific miRNAs were identified in plasma exosomes, plasma, and serum; their gene targets were found to be significantly enriched in multiple biological pathways (e.g. signaling processes). Notably, hsa-miR-22-3p was expressed across all tested fluids. Our findings suggest directions for future investigations into the origins and functions of extracellular miRNA.

HUANG, RAJPAL: Meta-Analysis of DNA Methylation for a Library of Biomarkers

HUILING HUANG¹, SIMRAN RAJPAL¹, Maria Melkonyan², Colin Farrell³, Matteo Pellegrini³

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Department of Math, Santa Monica College
³ Department of Molecular, Cell and Developmental Biology, UCLA

DNA carries distinct epigenetic markers through the methylation of cytosine. Analysis of DNA methylation is an extremely robust method for identifying distinct biomarkers. The cumulative effect of an epigenetic maintenance system is measured by the presence of DNA methylation biomarkers. We developed a tool to perform meta-analysis on DNA methylation data for a library of biomarkers. In this study, we gathered about 16,000 samples of methylation data measured in blood samples using the HumanMethylation450 BeadChip (Illumina) from the Gene Expression Omnibus (GEO), and a collection of published biomarkers. We also developed novel biomarkers. We developed a pipeline, which extracts and standardizes DNA methylation data, calculates biomarker scores—thus, predicting certain physiological status for each sample—and looks for correlations among different physiological traits and biomarker predictions. This pipeline provides results consistent with a normally distributed population. These predictions will allow us to use one data set to determine accurate physiological information without the need for self-reporting.

IDDIR: Examining Calling Biases Through Comparison of Variant Discovery Toolkit Versions

SABRINA IDDIR¹, Brandon Jew², Jae Hoon Sul³

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences,
² Bioinformatics Interdepartmental Program,
³ Department of Psychiatry and Biobehavioral Sciences, David Geffen School of Medicine, UCLA

Effective inquiry into the genetic basis of disease relies on variant calling consistency. There has been a recent, growing interest in combining sequence datasets across studies in order to increase sample size. However, many of these datasets are generated through identical algorithms—yet different variant calling processes—potentially resulting in discrepancies. Here we explore the analysis of genetic information from 10 individuals with bipolar disorder for variant calling using one version of the Genome Analysis Toolkit (GATK) for the dataset and repeated using two versions of GATK. We anticipate notable differences in genetic variants with potential for calling bias resulting in false positives or false negatives, which may shed light on future ability for data distribution. Divergences in results lend themselves to the development of troubleshooting techniques when approaching various data versions, while similarities indicate the capability of convenient and consistent data sharing across studies.

KANG, PACIUC: Hyperparameter Optimization on Archaic Ancestry Deep Learning Model

ELLIOT KANG¹, MARIO PACIUC¹, Arun Durvasula², Sriram Sankararaman²

¹BIG Summer Program, Institute for Quantitative and Computational Biology
²Department of Human Genetics, David Geffen School of Medicine, UCLA

As deep learning models are trained, model parameters are adjusted to reduce loss. However, models depend on hyperparameter settings, parameters that are not adjusted as the model trains but are set when the model is initialized. Determining the optimal hyperparameter configuration can be tedious, but several algorithms have been developed to speed up the process. We implemented two of these algorithms—Hyperband (Li et al.) and Harmonica (Yuan et al.)—on the deep learning model developed by Durvasula et al., which combines genetic statistics from a population to infer archaic ancestry. The model was then run with these new hyperparameter configurations and compared to the original. We found that these new configurations had practically no effect on the effectiveness of the model, with all three configurations leading to a validation loss of .0416. However, the configuration found by Harmonica was more efficient, requiring fewer epochs to reach this validation loss.

KISHIMOTO: Studying RelB’s Role as a Negative Regulator of IFNβ by ChIP-seq Analysis

KENSEI KISHIMOTO¹, Quen Cheng², Chen Sheng Ng², Alexander Hoffmann²

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences,
² Department of Microbiology, Immunology and Molecular Genetics, UCLA

RelB belongs to the NFkB family of transcription factors that controls immune response genes. RelB knockout (Relb^-/^–) mice show hyper-inflammation, auto-immunity, and irregular organ development such as enlarged spleens. In Relb^-/-mouse embryonic fibroblast (MEF) cells, we see elevated expression of interferon-b. IFNβ is usually induced in response to viral infection and signal through Interferon-alpha receptor (IFNAR) to elicit anti-viral responses. Interestingly, Ifnar^-/- rescued the auto-immune phenotype of Relb^-/^– in a double knockout. However, it remains unclear how RelB negatively regulates IFNβ, presumably by controlling the transcriptional activity of one of the hundreds of potential regulators of IFNβ. To identify the target genes of RelB, we conducted chromatin immunoprecipitation of RelB followed by high-throughput sequencing (ChIP-seq). Sequencing data was analyzed through a statistical analysis pipeline. The results and analysis will be discussed in the poster.

KOBAYASHI, ROGER: Investigating the Impact of Overlapping Read Mates on Variant Calling

EMILY KOBAYASHI¹, JACQUELYN ROGER¹, Shuo Li^2,3, Xianghong Jasmine Zhou³

Accurate variant calling is an integral part of cancer research and its clinical applications. Variant calling on paired-end next-generation sequencing data is used in analyses of both genomic DNA (gDNA) and cell-free DNA (cfDNA). Overlaps between paired-end read mates occur when the DNA fragment size is shorter than twice the read length, and can cause unreliable variant calls due to double-counting of variant alleles in overlapping regions. We quantified extent of the impact of overlapping read mates on variant calling by merging paired-end sequencing data and comparing variant profiles between unmerged and merged data. We found that read merging resulted in fewer called variants, underscoring the impact of variant double-counting. The decrease in the number of called variants was more pronounced for cfDNA than for gDNA; this is consistent with the shorter fragment length of cfDNA. Our findings suggest that the impact of read mate overlaps on variant calling should be considered in future analyses, especially for cfDNA paired-end sequencing data.

LOEFFLER: Databases Issues for Microbial Genomes Confounds Metagenomics Research

CAITLIN LOEFFLER¹, Serghei Mangul², Eleazar Eskin^{2, 3}, David Koslicki⁴

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Department of Computer Science, UCLA
³ Department of Human Genetics, David Geffen School of Medicine, UCLA
⁴ Mathematics Department, Oregon State University

There has been a recent shift in the field of microbiology, from culture-based experiments to culture-independent studies that take samples directly from the environment. This shift has led to the rise of novel metagenomics analysis approaches, which require a reference database in which environmental reads can be identified. There are many reference databases available, including those from NCBI, Ensembl, and JGI. We analyzed the presence of fungal genera in these three databases. Our analysis reveals that these reference databases have an alarmingly large area where there is no overlap; any one database is missing entire genera of references that are included in another database. Thus, a researcher looking for the most complete reference database available is required to combine multiple databases. Such database construction slows the advancement of the field of microbiology and metagenomics. Our work is a preliminary step in a larger project that is working towards identification of fungal reads from environmental samples to construct the Mycobiome.

LUO: Evaluating the Performance of Structural Variation Detecting Algorithms Using Whole Genome Sequencing Data from Families with Bipolar Disorder

JIANGYUAN LUO¹, Huajun Zhou², Jae Hoon Sul³

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences,
² Department of Biological Chemistry, David Geffen School of Medicine,
³ Department of Psychiatry and Biobehavioral Sciences, David Geffen School of Medicine, UCLA

Structural variation (SV) affects ~1% of the human genome, which may contain deleterious genes related to neurological disorders. Various algorithms detect different sets of SVs, and recently several approaches have been developed to merge SV calls from these algorithms. Here we evaluatedhether merging results from different SV algorithms generates more accurate SV calls than does each individual algorithm. Whole genome sequencing data of 15 trios relating to bipolar disorder were analyzed through a pipeline of 8 SV algorithms, and the results were combined using a merging algorithm, FusorSV. We found that the result merged by FusorSV detects more variations with a slightly higher Mendelian error rate and false discovery rate than that of certain individual algorithms. This result suggests that more SVs can be detected when different algorithms are used in supplement of each other, although some of these SVs may have low quality.

LYTTLE, ROLLE: Transcriptomic Characterization of Vascular Cell Types During Aging

JANAE M. LYTTLE¹, LYNDON ROLLE¹, Milagros C. Romay², M. Luisa Iruela-Arispe^2,3

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences,
² Department of Molecular, Cell and Developmental Biology,
³ Molecular Biology Institute, UCLA

The world population is aging rapidly and this brings a new challenge to health care. Cardiovascular disease continues to be a leading cause of death in this population. To elucidate the molecular changes that occur in the aging vasculature, we performed RNASeq on a longitudinal cohort of mice aortae ranging from 4 weeks to 78 weeks of age. Utilizing both cpm-based (EdgeR) and count-based (DESeq2) transcriptional analysis, our findings reveal a striking increase in the expression of inflammatory genes and decrease in transcripts associated with adhesion and fibrosis, processes known to contribute to age associated cardiovascular disorders such as dementia. Taken, together these findings suggest key molecular processes that can modulate disease risk predominate the vascular transcriptome in an age-dependent manner. Findings support the concept that vascular aging itself is a crucial modulator in the development of cardiovascular diseases, independent of additional risk factors.

MILLER: Comparing mRNA Half-Life Estimates from 4sU-seq and ActD-seq Data

NICK MILLER¹, Diane Lefaudeux², Alexander Hoffmann²

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences,
² Department of Microbiology, Immunology and Molecular Genetics, UCLA

Gene expression is regulated by mRNA synthesis and decay, which determine how much and how long mRNAs can be translated. ActDseq allows measurement of decay rates by inhibiting transcription, but it might lead to artifacts due to its toxicity. Another method, 4sU labeling, incorporates labeled uridine residues on newly synthesized transcripts without cell disturbance. Using 4sU data from stimulated macrophages we obtained mRNA synthesis and decay rates with the DTA package by Schwalb et. al., available on R’s Bioconductor. Despite well-correlated replicates, the calculated decay rates were poorly reproducible and largely inconsistent with previously obtained ActDseq decay rates. This suggests that the 4sU approach is overly sensitive to small aberrations in experimental data. As such, we have developed an R Shiny user interface for ActD-seq data. The tool calculates mRNA half-lives with confidence estimates from user-inputted Act-seq data.

MITCHELL: Comprehensive Benchmarking of Error Correction Methods for Short Reads

KEITH MITCHELL¹, Serghei Mangul², Igor Mandric³, Brian Hill², Alex Zelikovsky³, Eleazar Eskin^2,4

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
² Department of Computer Science, UCLA
³ Department of Computer Science, Georgia State University
⁴ Department of Human Genetics, David Geffen School of Medicine, UCLA

Error-correction is an important computational technique that promises to improve the results of next-generation sequencing (NGS) analysis. The optimal application of these error-correction methods is often unclear because we lack standardized comparison platforms and ‘gold standard’ benchmarking datasets. Here we provide the first comprehensive assessment of error-correction algorithms (n=10) and demonstrate methods for producing ‘gold standard’ datasets for benchmarking. Using raw reads and high fidelity reads, our approach provides an accurate and robust baseline for performing a realistic evaluation of error correction on sequenced genomes. We assessed accuracy, precision, sensitivity, and gain of the error correction tools using data derived from simulated T and B-cell receptor repertoires, real immune repertoire data, and simulated whole-genome sequencing (WGS) data. Among the tools assessed, we observed that some of the error correction algorithms were able to produce a positive gain for immune repertoire data. However, the majority of error correction tools fail to have a positive gain for WGS and can prove detrimental to the raw reads.

MOSHER, NWAUKWA: Prior Weights for Iterative Hard Thresholding

GORDON DAVID MOSHER¹, MARCEL NWAUKWA¹, Benjamin Chu², Jeanette Papp⁴, Kenneth Lange^2,3,4

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences,
² Department of Biomathematics, David Geffen School of Medicine,
³ Department of Biostatistics, Fielding School of Public Health,
⁴ Department of Human Genetics, David Geffen School of Medicine, UCLA

Iterative hard thresholding (IHT) is a relatively new and promising algorithm used to analyze genome-wide association studies (GWAS). Further, GWAS aim to identify single nucleotide polymorphisms (SNPs) associated with a particular trait. In IHT, we can guide the parameter selection by assigning prior weights to individual SNPs, resulting in greater predictive power. We implement two prior weighting algorithms; one for applying constant weight at SNP locations of interest, and another based on minor-allele-frequency (MAF). Accordingly, these weights are applied to the variables in the IHT analysis by changing the least squares formula to weighted least squares. Tests proved the ability to direct parameter selection upward, downward, both up and down; and, in the case of MAF weighting, these weights are able to pull up biologically rare SNPs. We are delighted that testing discovered four previously concealed, rare SNPs. Future work will incorporate gene and pathway knowledge to identify SNPs of interest.

NANFITO: Stem Cell Derived Rett Brain Organoids Demonstrate Developmental Differences in Interneuron Migration

BRANDON NANFITO¹, Ranmal Samarasinghe^2,3,4, Osvaldo Miranda^2,4, Bennett Novitch^2,4

¹ BIG Summer, Institute for Quantitative and Computational Biology,
² Department of Neurobiology, David Geffen School of Medicine,
³ Department of Neurology, David Geffen School of Medicine,
⁴ Eli and Edythe Broad Center for Regenerative Medicine and Stem Cell Research, UCLA

Pluripotent stem-cell-derived human brain organoids are an emerging three-dimensional technology with the potential to provide novel insights into neurological disease. We produced human brain organoids derived from induced pluripotent stem cells obtained from a patient with Rett syndrome. Rett is a neuro-regressive disease associated with a number of disorders, including intellectual and motor disability and epilepsy. Interneuron dysfunction is thought to be critical to this phenotype and our preliminary data suggest that Rett organoids have slower rates of cellular migration, which correlate with fewer migrated interneurons when compared to isogenic controls at certain time points. Extracellular recordings and two-photon calcium imaging results demonstrate epileptiform-like changes that may be a consequence of the observed cell migration changes. Despite slower migration in Rett, interneuron numbers equalize to control organoids past a certain age. Future studies therefore will focus on further delineating the effect of interneuron pathology on ictogenic changes in Rett brain organoids.

NKRUMAH, SCHIMKE: Verifying Identity of DNA and RNA Specimens Through SNP Analysis

SAMUEL NKRUMAH¹, KAYLA SCHIMKE¹, Alden Huang², Hane Lee², Stanley Nelson^2,3

Clinical molecular laboratories have incorporated next-generation sequencing of genomic DNA as an essential component to diagnosing rare congenital disorders. Additionally, there has been increased interest in performing whole-transcriptomic profiling of patient samples to further inform variant interpretation and, ultimately, improve diagnostic yield. As this practice becomes more commonplace, reliable methods of unequivocally identifying individuals from multi-omics data are required. Here we describe a computational workflow to fingerprint samples from whole-genome, whole-exome, and whole-transcriptome sequencing derived from both blood and fibroblast samples. Our method is accurate, efficient, and easily incorporated into existing pipelines.

PERRY: Using Clustering Algorithms to Merge Multi-Sample Histone Modification Regions

DANIELA PERRY¹, Malika Kumar Freund², Bogdan Pasaniuc^2,3

ChIP-seq has emerged as a powerful tool used to better understand gene regulation, and results can be used to map protein-DNA interactions to locations spanning the genome. Currently, in analyses comparing multiple ChIP-seq samples, no consensus exists on how to cohesively identify the presence of particular protein features (“peaks”) at different loci. Existing methods merge ChIP-seq peaks with overlapping start and end sites, yielding biologically nonsensical results. Alternatively, we propose a clustering scheme to cluster peaks. Using ChIP-seq to identify H3K27ac marks in healthy ovarian surface epithelial cells in 54 women, we compare the regulatory regions found using existing methods to those of our method. We observe differential peak distribution when we apply no transformation, merge overlapping peaks, and apply clustering techniques. Using clustering to identify ChIP-seq peaks serves as a promising method to identify biologically meaningful regulatory regions across the genome.

SARWAL: Comprehensive Benchmarking of Structural Variant Callers

VARUNI SARWAL¹, Serghei Mangul², Russell Jared Littman², Ram Ayyala¹, Emily Wesel¹, Jacqueline Castellanos¹, Margaret Distler³, Eleazar Eskin^2,4, and Jonathan Flint³

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences,
² Department of Computer Science,
³ Semel Institute for Neuroscience and Human Behavior,
⁴ Department of Human Genetics, David Geffen School of Medicine, UCLA

Structural variance (SV) in genomics—including insertions, deletions, duplications—have varying pathogenicity of disease. Currently there are no benchmarking studies available for SV detection, making it difficult to choose proper SV discovery software. In order to determine which tools offer the best sensitivity and precision, we compared over 30 SV tools’ abilities to detect deletions. We assessed the performance of these tools on mouse and human data in which all SV have been confirmed by PCR. We subsampled our data with varying coverage rates (80x – 1x), to determine the effects of coverage rate on each tool’s performance. Generally, as coverage increased, precision decreased and sensitivity increased. We observed that the length of deletion had a dramatic effect on the tool’s accuracy. Clever and sniffles are tools with the best balance of sensitivity and precision. Combining the results of various tools from different deletion length ranges produced a pseudo tool that outperformed all tools in sensitivity and precision. Our recommendations can help researchers choose the best SV detection software.

SCHULTE: Identifying Differential Splicing Patterns in Human Subcutaneous Adipose Tissue after Bariatric Surgery-Induced Weight Loss

GRANT SCHULTE¹, Marcus Alvarez², Zong Miao³, Päivi Pajukanta^2,4

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences,
² Department of Human Genetics, David Geffen School of Medicine,
³ Bioinformatics Interdepartmental Program,
⁴ Molecular Biology Institute, UCLA

Obesity is a major health concern in the United States, with nearly 40% of Americans being obese and 70% being overweight. To better understand the mechanisms of weight loss, we analyzed 262 and 168 subcutaneous adipose tissue samples taken during and one year after bariatric surgery, respectively, from the Finnish Kuopio Obesity Bariatric Surgery (KOBS) cohort. We hypothesized that RNA splicing patterns in adipose tissue would change due to this immense surgery-induced weight loss. To test this hypothesis, we performed RNA-Seq and quantified isoform expression using Kallisto, an ultra-efficient splice-aware pseudoaligner. We then performed differential expression analysis using the Sleuth R package, correcting for age, sex, and the first three expression principal components. We identified various genes that exhibited differential splicing in adipose tissue after weight loss. To determine if weight loss affects specific metabolic pathways, we plan to use DAVID to identify functional enrichments. In conclusion, we find evidence of differential isoform usage after weight loss in several genes belonging to subcutaneous adipose tissue.

TAM: Modeling NFκB–inducible Gene Expression to Quantify Transactivation Potential of NFκB

AMY TAM¹, Simon Mitchell², Alexander Hoffmann²

¹ BIG Summer Research Program, Institute for Quantitative and Computational Biology,
² Department of Microbiology, Immunology and Molecular Genetics, UCLA

NFκB/RelA is the potent transcription factor that activates hundreds of immune response genes. It possesses two distinct transactivation domains, TA1 and TA2. RNA-sequencing data of target genes shows that both transactivation domains are necessary to induce expression of some genes, but one is sufficient for others. To quantify the transactivation potential of each transactivation domain with gene-specific resolution, we developed a kinetic mathematical model of RelA-dependent gene expression and a workflow to fit the model to the available RNA-sequencing data. The approach developed here enables quantification of the transactivation potential of each transactivation domain for individual target genes. We will report our findings of target gene-specific functions of TA1 and TA2.

WEI: Enrichment of SNP-Heritability of Psychiatric Disorders in Gene and Isoform Modules

ANGELA WEI¹, Kathryn Burch², Claudia Giambartolomei³, Michael Gandal⁴, Bogdan Pasaniuc^3,5,6

¹ BIG Summer Program, Institute for Quantitative and Computational Biosciences,
² Bioinformatics Interdepartmental Program,
³ Department of Pathology and Laboratory Medicine, David Geffen School of Medicine,
⁴ Department of Psychiatry and Biobehavioral Sciences, David Geffen School of Medicine,
⁵ Department of Human Genetics, David Geffen School of Medicine,
⁶ Department of Biomathematics, David Geffen School of Medicine, UCLA

Gene co-expression networks, or modules, are groups of genes that share similar expression patterns. Integrating gene modules with large-scale genetic data can help uncover mechanisms underlying complex diseases, such as psychiatric disorders. Here, we integrate 34 gene and 56 isoform modules obtained from cerebral cortex samples (n=1,322) of patients with autism (ASD), schizophrenia (SCZ), and bipolar disorder (BIP) with genome-wide association studies (GWAS) of BIP (n=46,918), SCZ (n=105,318), and ASD (n=10,610). To quantify the enrichment of SNP-heritability within each module, we define annotations that mark SNPs within a window around gene bodies as 1 and 0 otherwise, and use stratified linkage disequilibrium score regression to examine whether SNPs inside each module have larger effects than expected by chance. A total of 31 isoform and 16 gene (SCZ) and 16 isoform and 4 gene (BIP) modules were significantly enriched; no modules were significantly enriched in ASD (FDR < 0.05).

2018 Bruins-In-Genomics Summer Undergraduate Research Program

2018 B.I.G. Summer Alumni Updates

2018 B.I.G. Summer Best Poster Award Winners

2018 B.I.G. Summer Participants

2018 B.I.G. Summer Poster Abstracts

Interesting links

Pages

Categories

Archive