2018 Bruins-In-Genomics Summer Undergraduate Research Program

2018 B.I.G. Summer Alumni Updates

  • September 2018: Congratulations to our 2018 alumni who have been selected by the Annual Biomedical Research Conference for Minority Students (ABRCMS) to present their research from B.I.G. Summer. Ruth Adewale (Howard University ’19), mentored by Dr. Valerie Arboleda, and Janae Lyttle (Spelman College ’19), mentored by Dr. Luisa Iruela-Arispe, will present results from their summer research projects at ABRCMS 2018 in Indianapolis, Indiana.

2018 B.I.G. Summer Best Poster Award Winners

2018 B.I.G. Summer Participants

Lab PIsMentorsStudents
VALERIE ARBOLEDAMaria PalafoxAmy Freiberg, Univ. of Central Florida
Ruth Adewale, Howard University
JASON ERNSTAdriana SperleaBrooke Felsheim, Washington University
Jennifer Chien, Wellesley College
ELEAZAR ESKINRobert BrownJinjing Zhou, UCLA
Sarah Faller, Duke University
Serghei MangulCaitlin Loeffler, UCLA
Emily Wesel, Westlake-Harvard High School
Jacqueline Castellanos, Santa Monica College
Keith Mitchell, UCLA
Ram Ayyala, UCLA
Varuni Sarwal, Indian Institute of Technology
ALEXANDER HOFFMANNQuen ChengKensei Kishimoto, UCLA
Diane LefaudeuxNick Miller, Cornell University, Ithaca
Simon MitchellAmy Tam, UCLA
LUISA IRUELA-ARISPEMilagros RomayJanae Lyttle, Spelman College
Lyndon Rolle, Fisk University
HUIYING LIBaochen ShiDana Franklin, Fisk University
Nnamdi Osakwe, North Carolina Central Univ.
STANLEY NELSONHane LeeKayla Schimke, UC Santa Cruz
Samuel Nkrumah, Fisk University
BENNETT NOVITCHRanmal SamarasingheBrandon Nanfito, UCLA
PÄIVI PAJUKANTAMarcus AlvarezGrant Schulte, UCLA
JEANETTE PAPPBenjamin ChuGordon Mosher, UC Riverside
Marcel Nwaukwa, Univ. of Arkansas at Pine Bluff
BOGDAN PASANIUCKathryn BurchAngela Wei, Univ. of Kentucky
Malika Kumar FreundDaniela Perry, Cornell University
MATTEO PELLEGRINIColin FarrellElaine Huang, Lafayette College
Simran Rajpal, UCLA
SRIRAM SANKARARAMANRobert BrownBoyang Fu, Rutgers University
Jingxian (Clara) Liu, Cornell University
Arun DurvasulaElliot Kang, Brown University
Mario Paciuc, Rice University
Ariel WuAnna Yaschenko, Univ. of Maryland, Baltimore Co.
Mohammadreza Hajy Heydary, CSU Fullerton
JAE HOON SULBrandon JewSabrina Iddir, Univ. of Chicago
Huajun ZhouJiangyuan (Jerome) Luo, UCLA
GRACE XIAOKikuye KoyanoHuiling Huang, UCLA
Michele Ramos Correa, CSU Northridge
JASMINE ZHOUShuo LiEmily Kobayashi, UC Santa Cruz
Jacquelyn Roger, UC Santa Cruz

2018 B.I.G. Summer Poster Abstracts


ADEWALE: Differential DNA Methylation in Patients with Global Development Delay Caused By De Novo Mutations in Gene KAT6A

RUTH ADEWALE1, Courtney Rauchman2, Valerie Arboleda2,3

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
2 Department of Pathology and Laboratory Medicine, David Geffen School of Medicine,
3 Department of Human Genetics, David Geffen School of Medicine, UCLA

Mutations in the KAT6A gene cause a rare neurodevelopmental syndrome that includes intellectual disability, cardiac defects, gastrointestinal dysmotility, and dysmorphic features. We obtained patient-derived skin fibroblasts and extracted DNA for sodium bisulfite treatment to identify differentially methylated regions genome-wide. Bisulfite treated DNA was then run on the Illumina Infinium Methylation Epic chip. We ran 12 KAT6A-syndrome biological replicates and 13 control samples. Data was processed using Python and the ChAMP R pipeline. We hypothesize that there will be differential DNA methylation between patients and controls, and these will inform our understanding of the mechanisms underlying KAT6A Syndrome. Quality control analyses identified that two chips were switched in the chips. Using the ChAMP R pipeline, we identified 91 differentially methylated probes and 99 differentially methylated regions that passed our significance threshold. Overall, we have identified consistently differentially methylated probes and regions that can help us identify the downstream effects of KAT6A mutations.


AYYALA, CASTELLANOS, WESEL: Analyzing the Adaptive Immune Repertoire Diversity and Microbiomes of African Individuals using RNA Sequence Data

RAM AYYALA1, JACQUELINE CASTELLANOS1, EMILY WESEL1, Serghei Mangul2, Eleazar Eskin2,3

 1 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
2 Department of Computer Science,
3 Department of Human Genetics, David Geffen School of Medicine, UCLA

RNA-Sequence data from African individuals provide an unprecedented opportunity to study features of the adaptive immune system across multiple populations. We hypothesize that due to different lifestyles, these populations have been exposed to a wide variety of viruses, and as a result, should have distinct CDR3 sequences and VJ recombination. Using ImReP, we were able to assemble distinct CDR3 sequences of 130 individuals from eight distinct African tribes and estimate their immune diversity. Upon further analysis using statistical methods, we found that there is a significant difference (p-value <10-3) between the diversity of T-cell receptor repertoire across African populations. These results support the hypothesis that the African populations have highly diverse T-cell repertoires, which could be influenced by their lifestyle. In the future, we plan to focus on analyzing populations of different countries in order to obtain a global perspective of the adaptive immune repertoire diversity.


CHIEN, FELSHEIM: Applying ConsHMM to Provide a Novel Annotation Resource for Various Multiple Species Alignments of Major Model Organisms

JENNIFER CHIEN1, BROOKE FELSHEIM1, Adriana Sperlea2,3, Jason Ernst3,4

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
2 Bioinformatics Interdepartmental Program,
3 Department of Biological Chemistry,
4 Department of Computer Science, UCLA

ConsHMM is an approach for whole genome annotation that uses a multivariate hidden Markov model to identify the combinatorial and spatial patterns in a multiple species DNA sequence alignment. We applied ConsHMM to twenty different multiple sequence alignments with reference species of major model organisms. Here, we examine two avenues of comparison: between two models built from 30 and 100 species sequence alignments with the hg38 reference genome that identifies analogous conservation states across alignments and between models built from MultiZ and Ensembl alignment algorithms with the same reference genomes. Additionally, we present external enrichments for other genomic annotations, including CpG islands, exons, transcription start and end sites, and other existing measures of conservation. We showcase the ability of these annotations to identify differences between alignment algorithms, identify analogous patterns of constraint across species, and identify equivalent patterns of constraint for the same species based on different alignments.


FALLER, ZHOU: Probabilistic Model to Estimate Replication Rate in the Presence of Bias and Confounding in Genome-wide Association Studies (GWAS)

SARAH FALLER1, JINJING ZHOU1, Robert Brown2, Eleazar Eskin2,3

1 Big Summer Program, Institute for Quantitative and Computational Biosciences,
2 Department of Computer Science,
3 Department of Human Genetics, David Geffen School of Medicine, UCLA

GWAS have been successful in identifying many genetic variants that associate with specific phenotypes. Once a significant variant is found, replication studies are done to verify that the results are true positives. Our work attempts to eliminate statistical noise, ascertainment bias, and confounding factors to determine whether observed associations have a high probability of replication or are likely to be false positives. We hypothesize that many studies fail to replicate due to winners’ curse and the presence of unknown confounding factors. We use a probabilistic model to correct for the amplified variance when estimating the underlying effect size. We then extend this model to define the conditional distribution of the replication study test statistic given the discovery test statistic, allowing us to compute the expected rate of replication. Using simulated data, we predicted the replication rate with 98.73% accuracy. Currently, we are working with real data to test our hypothesis.


FRANKLIN, OSAKWE: Pandora: A Pan-Genome Database of the Human Oral Microbiome

DANA FRANKLIN1, NNAMDI OSAKWE1, Baochen Shi2, Huiying Li2

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
2 Department of Molecular and Medical Pharmacology, UCLA

The human oral microbiome is associated with health and disease. Over one thousand oral bacterial isolates have been sequenced to investigate the functional potentials in their genomes. Compared to individual genomes, a pan-genome describes the full complement of genes in a species. The pan-genome analysis estimates the number of new genes added per sequenced genome. Additionally, the pan-genome, as non-redundant genomic references, helps improve the resolution and efficiency of the metagenomic shotgun sequence analyses. However, currently the dataset of the pan-genome collection is not publicly available. In this study, we constructed a Pan-genome Database of the human Oral microbiome (PanDOra), which consists of pan-genomes of 375 bacterial species from 124 genera based on our collection of 1,309 human oral-associated microbial genomes. The pan-genome of each species can now be easily searched in and downloaded from our database. We calculated and provided the rarefaction curve for each species to determine if the pan-genome is open or closed. PanDOra is the first pan-genome database designed for human oral microbiome studies.


FREIBERG: Analysis of Mutation Patterns and Differing Amino Acid Compositions across the Reference Human Proteome

AMY FREIBERG1, Maria Palafox2, Valerie Arboleda2,3

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
2 Department of Human Genetics, David Geffen School of Medicine,
3 Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, UCLA

The human proteome contains approximately 20,000 unique proteins. Genomic variants that occur within the exome can result in synonymous, missense, or truncating amino acid changes. While many rare genetic disorders have been attributed to truncating variants that result in a loss-of-function, missense changes are historically more challenging to interpret at a functional level. We hypothesize that specific types of missense changes are more likely to be deleterious than others due to the biochemical composition of the amino acid. Additionally, we believe that specific patterns of missense variation are implicated in rare genetic disorders. Our project aims to understand and predict how specific amino acid changes alter protein function. In this project, we analyzed human reference proteome data from UniProt. We studied the distribution of amino acids in the proteome to understand mutation patterns between amino acids. These data will provide a reference for future proteomic studies regarding various disease states.


FU, LIU: Benchmarking on GWAS Tools with Multiplex Genetic Architecture Models

BOYANG FU1, CLARA LIU1, Robert Brown2, Sriram Sankararaman2,3

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
2 Department of Computer Science,
3 Department of Human Genetics, David Geffen School of Medicine, UCLA

Genome-Wide Association Study (GWAS) is a powerful approach to detect association between phenotypes and genotypes. However, current statistical methods for conducting association tests are faced with two main obstacles. First, the true underlying causal architecture is unknown; this means that association methods must be sensitive to as many genetic architectures as possible. Second, these statistical techniques are computationally intensive, which increasingly becomes a bottleneck as the size of data increases. Software tools using different approaches have been developed to address these problems, but an updated benchmarking comparison is presently lacking. In this study, we systematically compare four commonly used GWAS tools: TASSEL, BOLT-LMM, GEMMA, and SKAT. We evaluate their performances on power, family-wise error rate, and run time under different causal architectures and with a range of simulated sample sizes. With most tools properly controlling the family-wise error rate, BOLT-LMM has the most power in detecting associated single nucleotide polymorphisms (SNP) under most common variant genetic architecture. At heritability 0.5, BOLT-LMM has a discovery rate of 0.19, higher than the rates of GEMMA and SKAT (0.097) and TASSEL (0.077). However, under a different genetic architecture where rare variants contribute to the phenotypes as genes, SKAT has a higher power of 0.16, while BOLT-LMM and GEMMA have powers of 0.094 and 0.064, respectively.


HAJY HEYDARY, YASCHENKO: Comparison of Method of Moments and Restricted Maximum Likelihood Estimators for Approximating SNP-Heritability and Genetic Correlation

MOHAMMADREZA HAJY HEYDARY1, ANNA YASCHENKO1, Yue Wu2, and Sriram Sankararaman2,3

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
2 Department of Computer Science,
3 Department of Human Genetics, David Geffen School of Medicine, UCLA

Heritability, the proportion of variation of a trait explained by genetic variation, and genetic correlation are crucial parameters in determining the genetic architecture of complex phenotypes. Linear Mixed Models (LMMs) have emerged as a critical tool for estimating heritability and genetic correlation, where the parameters of the LMMs are related to the heritability attributable to the SNPs analyzed and genetic correlations across phenotypes. Likelihood-based inference in LMMs, however, poses severe computational burden and does not scale to sizeable datasets such as the UK Biobank. In this work, we propose a scalable randomized Method-of-Moments (MoM) estimator of SNP heritability and genetic correlations in LMMs. Our method, RHE-reg, leverages the structure of genotype data to obtain runtimes that are sub-linear in the number of individuals (assuming the number of SNPs is held constant). Comparing to existing software, using both simulations and real data, we show that RHE-reg is accurate and has a comparable statistical efficiency as likelihood-based inference. We also show that RHE-reg estimates are stable in situations of model misspecification. We further extend our estimator using a generalized MoM that achieves better statistical efficiency.


HUANG, RAMOS CORREA: Assessing the Global Landscape of Extracellular MicroRNA Expression

ELAINE HUANG1, MICHELE RAMOS CORREA1, Kikuye Koyano2, Grace Xiao3,4

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
2 Bioinformatics Interdepartmental Program,
3 Department of Integrative Biology and Physiology,
4 Molecular Biology Institute, UCLA

MicroRNAs (miRNAs) can be found extracellularly, making them suitable biomarkers for disease. While their expression has been characterized across tissues and cell types, comprehensive comparisons across cell types and biofluids are lacking. Here, we analyzed miRNA expression across 5 extracellular biofluids and 39 cell types, and we identified 99 highly expressed miRNAs. EdgeR was used to normalize miRNA expression levels across multiple datasets. Hierarchical clustering on normalized expression levels showed reduced batch effects and improved clustering of samples from similar fluids. High levels of correlation in expression between certain cell types and fluids suggest potential intracellular sources contributing to extracellular miRNA profiles. Fluid-specific miRNAs were identified in plasma exosomes, plasma, and serum; their gene targets were found to be significantly enriched in multiple biological pathways (e.g. signaling processes). Notably, hsa-miR-22-3p was expressed across all tested fluids. Our findings suggest directions for future investigations into the origins and functions of extracellular miRNA.


HUANG, RAJPAL: Meta-Analysis of DNA Methylation for a Library of Biomarkers

HUILING HUANG1, SIMRAN RAJPAL1, Maria Melkonyan2, Colin Farrell3, Matteo Pellegrini3

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Department of Math, Santa Monica College
3 Department of Molecular, Cell and Developmental Biology, UCLA

DNA carries distinct epigenetic markers through the methylation of cytosine. Analysis of DNA methylation is an extremely robust method for identifying distinct biomarkers. The cumulative effect of an epigenetic maintenance system is measured by the presence of DNA methylation biomarkers. We developed a tool to perform meta-analysis on DNA methylation data for a library of biomarkers. In this study, we gathered about 16,000 samples of methylation data measured in blood samples using the HumanMethylation450 BeadChip (Illumina) from the Gene Expression Omnibus (GEO), and a collection of published biomarkers. We also developed novel biomarkers. We developed a pipeline, which extracts and standardizes DNA methylation data, calculates biomarker scores—thus, predicting certain physiological status for each sample—and looks for correlations among different physiological traits and biomarker predictions. This pipeline provides results consistent with a normally distributed population. These predictions will allow us to use one data set to determine accurate physiological information without the need for self-reporting.


IDDIR: Examining Calling Biases Through Comparison of Variant Discovery Toolkit Versions

SABRINA IDDIR1, Brandon Jew2, Jae Hoon Sul3

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
2 Bioinformatics Interdepartmental Program,
3 Department of Psychiatry and Biobehavioral Sciences, David Geffen School of Medicine, UCLA

Effective inquiry into the genetic basis of disease relies on variant calling consistency. There has been a recent, growing interest in combining sequence datasets across studies in order to increase sample size. However, many of these datasets are generated through identical algorithms—yet different variant calling processes—potentially resulting in discrepancies. Here we explore the analysis of genetic information from 10 individuals with bipolar disorder for variant calling using one version of the Genome Analysis Toolkit (GATK) for the dataset and repeated using two versions of GATK. We anticipate notable differences in genetic variants with potential for calling bias resulting in false positives or false negatives, which may shed light on future ability for data distribution. Divergences in results lend themselves to the development of troubleshooting techniques when approaching various data versions, while similarities indicate the capability of convenient and consistent data sharing across studies.


KANG, PACIUC: Hyperparameter Optimization on Archaic Ancestry Deep Learning Model

ELLIOT KANG1, MARIO PACIUC1, Arun Durvasula2, Sriram Sankararaman2

1 BIG Summer Program, Institute for Quantitative and Computational Biology
2 Department of Human Genetics, David Geffen School of Medicine, UCLA

As deep learning models are trained, model parameters are adjusted to reduce loss. However, models depend on hyperparameter settings, parameters that are not adjusted as the model trains but are set when the model is initialized. Determining the optimal hyperparameter configuration can be tedious, but several algorithms have been developed to speed up the process. We implemented two of these algorithms—Hyperband (Li et al.) and Harmonica (Yuan et al.)—on the deep learning model developed by Durvasula et al., which combines genetic statistics from a population to infer archaic ancestry. The model was then run with these new hyperparameter configurations and compared to the original. We found that these new configurations had practically no effect on the effectiveness of the model, with all three configurations leading to a validation loss of .0416. However, the configuration found by Harmonica was more efficient, requiring fewer epochs to reach this validation loss.


KISHIMOTO: Studying RelB’s Role as a Negative Regulator of IFNβ by ChIP-seq Analysis

KENSEI KISHIMOTO1, Quen Cheng2, Chen Sheng Ng2, Alexander Hoffmann2

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
2 Department of Microbiology, Immunology and Molecular Genetics, UCLA

RelB belongs to the NFkB family of transcription factors that controls immune response genes. RelB knockout (Relb-/) mice show hyper-inflammation, auto-immunity, and irregular organ development such as enlarged spleens. In Relb-/- mouse embryonic fibroblast (MEF) cells, we see elevated expression of interferon-b. IFNβ is usually induced in response to viral infection and signal through Interferon-alpha receptor (IFNAR) to elicit anti-viral responses. Interestingly, Ifnar-/- rescued the auto-immune phenotype of Relb-/ in a double knockout. However, it remains unclear how RelB negatively regulates IFNβ, presumably by controlling the transcriptional activity of one of the hundreds of potential regulators of IFNβ. To identify the target genes of RelB, we conducted chromatin immunoprecipitation of RelB followed by high-throughput sequencing (ChIP-seq). Sequencing data was analyzed through a statistical analysis pipeline. The results and analysis will be discussed in the poster.


KOBAYASHI, ROGER: Investigating the Impact of Overlapping Read Mates on Variant Calling

EMILY KOBAYASHI1, JACQUELYN ROGER1, Shuo Li2,3, Xianghong Jasmine Zhou3

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
2 Bioinformatics Interdepartmental Program,
3 Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, UCLA

Accurate variant calling is an integral part of cancer research and its clinical applications. Variant calling on paired-end next-generation sequencing data is used in analyses of both genomic DNA (gDNA) and cell-free DNA (cfDNA). Overlaps between paired-end read mates occur when the DNA fragment size is shorter than twice the read length, and can cause unreliable variant calls due to double-counting of variant alleles in overlapping regions. We quantified extent of the impact of overlapping read mates on variant calling by merging paired-end sequencing data and comparing variant profiles between unmerged and merged data. We found that read merging resulted in fewer called variants, underscoring the impact of variant double-counting. The decrease in the number of called variants was more pronounced for cfDNA than for gDNA; this is consistent with the shorter fragment length of cfDNA. Our findings suggest that the impact of read mate overlaps on variant calling should be considered in future analyses, especially for cfDNA paired-end sequencing data.


LOEFFLER: Databases Issues for Microbial Genomes Confounds Metagenomics Research

CAITLIN LOEFFLER1, Serghei Mangul2, Eleazar Eskin2, 3, David Koslicki4

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Department of Computer Science, UCLA
3 Department of Human Genetics, David Geffen School of Medicine, UCLA
4 Mathematics Department, Oregon State University

There has been a recent shift in the field of microbiology, from culture-based experiments to culture-independent studies that take samples directly from the environment. This shift has led to the rise of novel metagenomics analysis approaches, which require a reference database in which environmental reads can be identified. There are many reference databases available, including those from NCBI, Ensembl, and JGI. We analyzed the presence of fungal genera in these three databases. Our analysis reveals that these reference databases have an alarmingly large area where there is no overlap; any one database is missing entire genera of references that are included in another database. Thus, a researcher looking for the most complete reference database available is required to combine multiple databases. Such database construction slows the advancement of the field of microbiology and metagenomics. Our work is a preliminary step in a larger project that is working towards identification of fungal reads from environmental samples to construct the Mycobiome.


LUO: Evaluating the Performance of Structural Variation Detecting Algorithms Using Whole Genome Sequencing Data from Families with Bipolar Disorder

JIANGYUAN LUO1, Huajun Zhou2, Jae Hoon Sul3

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
2 Department of Biological Chemistry, David Geffen School of Medicine,
3 Department of Psychiatry and Biobehavioral Sciences, David Geffen School of Medicine, UCLA

Structural variation (SV) affects ~1% of the human genome, which may contain deleterious genes related to neurological disorders. Various algorithms detect different sets of SVs, and recently several approaches have been developed to merge SV calls from these algorithms. Here we evaluatedhether merging results from different SV algorithms generates more accurate SV calls than does each individual algorithm. Whole genome sequencing data of 15 trios relating to bipolar disorder were analyzed through a pipeline of 8 SV algorithms, and the results were combined using a merging algorithm, FusorSV. We found that the result merged by FusorSV detects more variations with a slightly higher Mendelian error rate and false discovery rate than that of certain individual algorithms. This result suggests that more SVs can be detected when different algorithms are used in supplement of each other, although some of these SVs may have low quality.


LYTTLE, ROLLE: Transcriptomic Characterization of Vascular Cell Types During Aging

JANAE M. LYTTLE1, LYNDON ROLLE1, Milagros C. Romay2, M. Luisa Iruela-Arispe2,3

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
2 Department of Molecular, Cell and Developmental Biology,
3 Molecular Biology Institute, UCLA

The world population is aging rapidly and this brings a new challenge to health care. Cardiovascular disease continues to be a leading cause of death in this population. To elucidate the molecular changes that occur in the aging vasculature, we performed RNASeq on a longitudinal cohort of mice aortae ranging from 4 weeks to 78 weeks of age. Utilizing both cpm-based (EdgeR) and count-based (DESeq2) transcriptional analysis, our findings reveal a striking increase in the expression of inflammatory genes and decrease in transcripts associated with adhesion and fibrosis, processes known to contribute to age associated cardiovascular disorders such as dementia. Taken, together these findings suggest key molecular processes that can modulate disease risk predominate the vascular transcriptome in an age-dependent manner. Findings support the concept that vascular aging itself is a crucial modulator in the development of cardiovascular diseases, independent of additional risk factors.


MILLER: Comparing mRNA Half-Life Estimates from 4sU-seq and ActD-seq Data

NICK MILLER1, Diane Lefaudeux2, Alexander Hoffmann2

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
2 Department of Microbiology, Immunology and Molecular Genetics, UCLA

Gene expression is regulated by mRNA synthesis and decay, which determine how much and how long mRNAs can be translated. ActDseq allows measurement of decay rates by inhibiting transcription, but it might lead to artifacts due to its toxicity. Another method, 4sU labeling, incorporates labeled uridine residues on newly synthesized transcripts without cell disturbance. Using 4sU data from stimulated macrophages we obtained mRNA synthesis and decay rates with the DTA package by Schwalb et. al., available on R’s Bioconductor. Despite well-correlated replicates, the calculated decay rates were poorly reproducible and largely inconsistent with previously obtained ActDseq decay rates. This suggests that the 4sU approach is overly sensitive to small aberrations in experimental data. As such, we have developed an R Shiny user interface for ActD-seq data. The tool calculates mRNA half-lives with confidence estimates from user-inputted Act-seq data.


MITCHELL: Comprehensive Benchmarking of Error Correction Methods for Short Reads

KEITH MITCHELL1, Serghei Mangul2, Igor Mandric3, Brian Hill2, Alex Zelikovsky3, Eleazar Eskin2,4

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences, UCLA
2 Department of Computer Science, UCLA
3 Department of Computer Science, Georgia State University
4 Department of Human Genetics, David Geffen School of Medicine, UCLA

Error-correction is an important computational technique that promises to improve the results of next-generation sequencing (NGS) analysis. The optimal application of these error-correction methods is often unclear because we lack standardized comparison platforms and ‘gold standard’ benchmarking datasets. Here we provide the first comprehensive assessment of error-correction algorithms (n=10) and demonstrate methods for producing ‘gold standard’ datasets for benchmarking. Using raw reads and high fidelity reads, our approach provides an accurate and robust baseline for performing a realistic evaluation of error correction on sequenced genomes. We assessed accuracy, precision, sensitivity, and gain of the error correction tools using data derived from simulated T and B-cell receptor repertoires, real immune repertoire data, and simulated whole-genome sequencing (WGS) data. Among the tools assessed, we observed that some of the error correction algorithms were able to produce a positive gain for immune repertoire data. However, the majority of error correction tools fail to have a positive gain for WGS and can prove detrimental to the raw reads.


MOSHER, NWAUKWA: Prior Weights for Iterative Hard Thresholding

GORDON DAVID MOSHER1, MARCEL NWAUKWA1, Benjamin Chu2, Jeanette Papp4, Kenneth Lange2,3,4

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
2 Department of Biomathematics, David Geffen School of Medicine,
3 Department of Biostatistics, Fielding School of Public Health,
4 Department of Human Genetics, David Geffen School of Medicine, UCLA

Iterative hard thresholding (IHT) is a relatively new and promising algorithm used to analyze genome-wide association studies (GWAS). Further, GWAS aim to identify single nucleotide polymorphisms (SNPs) associated with a particular trait. In IHT, we can guide the parameter selection by assigning prior weights to individual SNPs, resulting in greater predictive power. We implement two prior weighting algorithms; one for applying constant weight at SNP locations of interest, and another based on minor-allele-frequency (MAF). Accordingly, these weights are applied to the variables in the IHT analysis by changing the least squares formula to weighted least squares. Tests proved the ability to direct parameter selection upward, downward, both up and down; and, in the case of MAF weighting, these weights are able to pull up biologically rare SNPs. We are delighted that testing discovered four previously concealed, rare SNPs. Future work will incorporate gene and pathway knowledge to identify SNPs of interest.


NANFITO: Stem Cell Derived Rett Brain Organoids Demonstrate Developmental Differences in Interneuron Migration

BRANDON NANFITO1, Ranmal Samarasinghe2,3,4, Osvaldo Miranda2,4, Bennett Novitch2,4

1 BIG Summer, Institute for Quantitative and Computational Biology,
2 Department of Neurobiology, David Geffen School of Medicine,
3 Department of Neurology, David Geffen School of Medicine,
4 Eli and Edythe Broad Center for Regenerative Medicine and Stem Cell Research, UCLA

Pluripotent stem-cell-derived human brain organoids are an emerging three-dimensional technology with the potential to provide novel insights into neurological disease. We produced human brain organoids derived from induced pluripotent stem cells obtained from a patient with Rett syndrome. Rett is a neuro-regressive disease associated with a number of disorders, including intellectual and motor disability and epilepsy. Interneuron dysfunction is thought to be critical to this phenotype and our preliminary data suggest that Rett organoids have slower rates of cellular migration, which correlate with fewer migrated interneurons when compared to isogenic controls at certain time points. Extracellular recordings and two-photon calcium imaging results demonstrate epileptiform-like changes that may be a consequence of the observed cell migration changes. Despite slower migration in Rett, interneuron numbers equalize to control organoids past a certain age. Future studies therefore will focus on further delineating the effect of interneuron pathology on ictogenic changes in Rett brain organoids.


NKRUMAH, SCHIMKE: Verifying Identity of DNA and RNA Specimens Through SNP Analysis

SAMUEL NKRUMAH1, KAYLA SCHIMKE1, Alden Huang2, Hane Lee2, Stanley Nelson2,3

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
2 Department of Pathology and Laboratory Medicine, David Geffen School of Medicine,
3 Department of Human Genetics, David Geffen School of Medicine, UCLA

Clinical molecular laboratories have incorporated next-generation sequencing of genomic DNA as an essential component to diagnosing rare congenital disorders. Additionally, there has been increased interest in performing whole-transcriptomic profiling of patient samples to further inform variant interpretation and, ultimately, improve diagnostic yield. As this practice becomes more commonplace, reliable methods of unequivocally identifying individuals from multi-omics data are required. Here we describe a computational workflow to fingerprint samples from whole-genome, whole-exome, and whole-transcriptome sequencing derived from both blood and fibroblast samples. Our method is accurate, efficient, and easily incorporated into existing pipelines.


PERRY: Using Clustering Algorithms to Merge Multi-Sample Histone Modification Regions

DANIELA PERRY1, Malika Kumar Freund2, Bogdan Pasaniuc2,3

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
2 Department of Human Genetics, David Geffen School of Medicine,
3 Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, UCLA

ChIP-seq has emerged as a powerful tool used to better understand gene regulation, and results can be used to map protein-DNA interactions to locations spanning the genome. Currently, in analyses comparing multiple ChIP-seq samples, no consensus exists on how to cohesively identify the presence of particular protein features (“peaks”) at different loci. Existing methods merge ChIP-seq peaks with overlapping start and end sites, yielding biologically nonsensical results. Alternatively, we propose a clustering scheme to cluster peaks. Using ChIP-seq to identify H3K27ac marks in healthy ovarian surface epithelial cells in 54 women, we compare the regulatory regions found using existing methods to those of our method. We observe differential peak distribution when we apply no transformation, merge overlapping peaks, and apply clustering techniques. Using clustering to identify ChIP-seq peaks serves as a promising method to identify biologically meaningful regulatory regions across the genome.


SARWAL: Comprehensive Benchmarking of Structural Variant Callers

VARUNI SARWAL1, Serghei Mangul2, Russell Jared Littman2, Ram Ayyala1, Emily Wesel1, Jacqueline Castellanos1, Margaret Distler3, Eleazar Eskin2,4, and Jonathan Flint3 

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
2 Department of Computer Science,
3 Semel Institute for Neuroscience and Human Behavior,
4 Department of Human Genetics, David Geffen School of Medicine, UCLA

Structural variance (SV) in genomics—including insertions, deletions, duplications—have varying pathogenicity of disease. Currently there are no benchmarking studies available for SV detection, making it difficult to choose proper SV discovery software. In order to determine which tools offer the best sensitivity and precision, we compared over 30 SV tools’ abilities to detect deletions. We assessed the performance of these tools on mouse and human data in which all SV have been confirmed by PCR. We subsampled our data with varying coverage rates (80x – 1x), to determine the effects of coverage rate on each tool’s performance. Generally, as coverage increased, precision decreased and sensitivity increased. We observed that the length of deletion had a dramatic effect on the tool’s accuracy. Clever and sniffles are tools with the best balance of sensitivity and precision. Combining the results of various tools from different deletion length ranges produced a pseudo tool that outperformed all tools in sensitivity and precision. Our recommendations can help researchers choose the best SV detection software.


SCHULTE: Identifying Differential Splicing Patterns in Human Subcutaneous Adipose Tissue after Bariatric Surgery-Induced Weight Loss

GRANT SCHULTE1, Marcus Alvarez2, Zong Miao3, Päivi Pajukanta2,4

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
2 Department of Human Genetics, David Geffen School of Medicine,
3 Bioinformatics Interdepartmental Program,
4 Molecular Biology Institute, UCLA

Obesity is a major health concern in the United States, with nearly 40% of Americans being obese and 70% being overweight. To better understand the mechanisms of weight loss, we analyzed 262 and 168 subcutaneous adipose tissue samples taken during and one year after bariatric surgery, respectively, from the Finnish Kuopio Obesity Bariatric Surgery (KOBS) cohort. We hypothesized that RNA splicing patterns in adipose tissue would change due to this immense surgery-induced weight loss. To test this hypothesis, we performed RNA-Seq and quantified isoform expression using Kallisto, an ultra-efficient splice-aware pseudoaligner. We then performed differential expression analysis using the Sleuth R package, correcting for age, sex, and the first three expression principal components. We identified various genes that exhibited differential splicing in adipose tissue after weight loss. To determine if weight loss affects specific metabolic pathways, we plan to use DAVID to identify functional enrichments. In conclusion, we find evidence of differential isoform usage after weight loss in several genes belonging to subcutaneous adipose tissue.


TAM: Modeling NFκB–inducible Gene Expression to Quantify Transactivation Potential of NFκB

AMY TAM1, Simon Mitchell2, Alexander Hoffmann2

1 BIG Summer Research Program, Institute for Quantitative and Computational Biology,
2 Department of Microbiology, Immunology and Molecular Genetics, UCLA

NFκB/RelA is the potent transcription factor that activates hundreds of immune response genes. It possesses two distinct transactivation domains, TA1 and TA2. RNA-sequencing data of target genes shows that both transactivation domains are necessary to induce expression of some genes, but one is sufficient for others. To quantify the transactivation potential of each transactivation domain with gene-specific resolution, we developed a kinetic mathematical model of RelA-dependent gene expression and a workflow to fit the model to the available RNA-sequencing data. The approach developed here enables quantification of the transactivation potential of each transactivation domain for individual target genes. We will report our findings of target gene-specific functions of TA1 and TA2.


WEI: Enrichment of SNP-Heritability of Psychiatric Disorders in Gene and Isoform Modules


ANGELA WEI1, Kathryn Burch2, Claudia Giambartolomei3, Michael Gandal4, Bogdan Pasaniuc3,5,6

1 BIG Summer Program, Institute for Quantitative and Computational Biosciences,
2 Bioinformatics Interdepartmental Program,
3 Department of Pathology and Laboratory Medicine, David Geffen School of Medicine,
4 Department of Psychiatry and Biobehavioral Sciences, David Geffen School of Medicine,
5 Department of Human Genetics, David Geffen School of Medicine,
6 Department of Biomathematics, David Geffen School of Medicine, UCLA

Gene co-expression networks, or modules, are groups of genes that share similar expression patterns. Integrating gene modules with large-scale genetic data can help uncover mechanisms underlying complex diseases, such as psychiatric disorders. Here, we integrate 34 gene and 56 isoform modules obtained from cerebral cortex samples (n=1,322) of patients with autism (ASD), schizophrenia (SCZ), and bipolar disorder (BIP) with genome-wide association studies (GWAS) of BIP (n=46,918), SCZ (n=105,318), and ASD (n=10,610). To quantify the enrichment of SNP-heritability within each module, we define annotations that mark SNPs within a window around gene bodies as 1 and 0 otherwise, and use stratified linkage disequilibrium score regression to examine whether SNPs inside each module have larger effects than expected by chance. A total of 31 isoform and 16 gene (SCZ) and 16 isoform and 4 gene (BIP) modules were significantly enriched; no modules were significantly enriched in ASD (FDR < 0.05). [/av_toggle] [/av_toggle_container]