TITLE: “Sources of zeros in single-cell RNA-seq data and how they affect data analysis.”
ABSTRACT: Single-cell RNA sequencing (scRNA-seq) technologies have revolutionized biomedical sciences by enabling genome-wide profiling of gene expression levels at an unprecedented single-cell resolution. A distinct characteristic of scRNA-seq data is the vast proportion of zeros unseen in bulk RNA-seq data. Researchers view these zeros differently: some regard zeros as biological signals representing no or low gene expression, while others regard zeros as false signals or missing data to be corrected. As a result, the scRNA-seq field faces much controversy regarding how to handle zeros in data analysis. In this paper, we first discuss the sources of biological and non-biological zeros in scRNA-seq data. Second, we summarize the advantages, disadvantages, and suitable users of three input data types: original counts, imputed counts, and binarized counts. Third, we evaluate the impacts of non-biological zeros on cell clustering and differential gene expression analysis. Finally, we discuss the open questions regarding non-biological zeros, the need for benchmarking, and the importance of transparent analysis.