Best Practices for Differential Accessibility Analysis in Single-Cell

Best Practices for Differential Accessibility Analysis in Single-Cell

As a seasoned IT professional, well-versed in providing practical tips and in-depth insights on technology, computer repair, and IT solutions, I am excited to share my expertise on the topic of differential accessibility (DA) analysis in single-cell epigenomics.

Understanding the Landscape of Single-Cell Epigenomics

The remarkable diversity of cell types and tissues that compose the human body arise from a single genome. This diversity is orchestrated by cell-type- and context-specific epigenetic programs that regulate the accessibility of specific regions of the genome. Epigenetic regulatory programs also choreograph tissue- and cell-type-specific responses to the myriad of endogenous and exogenous perturbations that humans encounter in their lifetime.

The advent of single-cell assays, such as the assay for transposase-accessible chromatin by sequencing (ATAC-seq), has enabled the discovery of these regulatory programs at an unprecedented scale and resolution. Landmark single-cell ATAC-seq (scATAC-seq) studies have established atlases of chromatin accessibility during fetal development, throughout the nervous system, and even within the entire human body.

However, the rapid evolution and widespread application of scATAC-seq technology have also exposed a lack of consensus on how to analyze the resulting data. Even fundamental questions, such as whether chromatin accessibility should be considered a qualitative or quantitative measurement, remain debated. Arguably the most important of these questions is how to identify differentially accessible (DA) regions of the genome, as DA analysis is the methodological framework that permits the discovery of regulatory programs directing cell identity and perturbation responses.

Evaluating the Accuracy of Single-Cell DA Methods

To address this challenge, we undertook a comprehensive survey of the literature to chart the landscape of statistical methods that have been used to perform DA analysis in scATAC-seq data. Our analysis revealed that the Wilcoxon rank-sum test is the most widely used DA method, but no single method is used in more than 15% of published studies. Moreover, we observed substantial disagreement on fundamental principles of DA analysis, such as whether or not to binarize measures of genome accessibility.

To systematically evaluate the accuracy of these diverse DA methods, we leveraged a compendium of scATAC-seq datasets with matching bulk ATAC-seq or scRNA-seq data. By using the bulk or scRNA-seq data as a reference, we were able to assess the concordance between single-cell and bulk DA analyses for each of the 11 most widely used DA methods.

Our primary analysis revealed that most DA methods achieved comparable performance, with relatively small differences separating the ten top-performing methods. Notably, methods that aggregated cells within biological replicates to form ‘pseudobulks’ consistently ranked near the top. In contrast, negative binomial regression and a previously described permutation test were outliers that achieved substantially lower concordance to the bulk data than other DA methods.

We conducted a series of sensitivity analyses to test the robustness of these observations. Importantly, we found that the relative performance of single-cell DA methods was largely unchanged when applying different DA approaches to establish the experimental ground truth within the bulk data, when varying the number of top-ranked DA peaks used to calculate the concordance, or when filtering peaks that were not accessible in at least 5% of cells.

Leveraging Multimodal Data for Improved DA Analysis

To further validate our findings, we also leveraged a collection of single-cell multimodal datasets that profiled both chromatin accessibility and gene expression within the same individual cells. By aggregating chromatin accessibility around promoters into gene-level activity scores, we were able to compare DA analysis of the ATAC data to differential expression (DE) analysis of the matching RNA data.

In this analysis, we again identified a more pronounced difference between DA methods that aggregated cell-level chromatin accessibility profiles into ‘pseudobulks’ and those that did not. Specifically, the pseudobulk methods outperformed the other approaches, achieving the highest concordance between DA of chromatin accessibility and DE of gene expression.

We conducted a series of sensitivity analyses to confirm the robustness of this result. Importantly, we found that the relative performance of single-cell DA methods was largely unchanged when accounting for genes with overlapping promoter regions, when varying the number of top-ranked DA genes used to calculate the concordance, or when filtering genes whose promoters were not accessible in at least 1% of cells.

Taken together, these experiments emphasize that many widely used single-cell DA methods can produce thousands of false discoveries, particularly when sequencing depth or the number of profiled cells is limited. In contrast, DA methods that aggregate cells within replicates to form pseudobulks showed a markedly improved capacity to avoid false discoveries.

Mitigating Biases in Single-Cell DA Analysis

In addition to evaluating the accuracy and false discovery control of single-cell DA methods, we also explored potential biases in their performance. Specifically, we hypothesized that single-cell DA methods may preferentially identify peaks that are open in a larger proportion of cells, peaks that are supported by a greater number of sequencing reads, or peaks that are wider.

Our analysis confirmed these hypotheses, revealing that methods that treated chromatin accessibility as a quantitative measurement (t-test, Wilcoxon rank-sum test, negative binomial regression, LRclusters) preferentially called peaks supported by a greater number of reads and open in a greater number of cells as being DA. In contrast, methods that treated accessibility as a binary phenotype (Fisher’s exact test, LRpeaks, binomial test, permutation test), as well as pseudobulk DA methods, exhibited less bias towards highly accessible peaks.

We also found that negative binomial regression consistently called wider peaks as DA, in contrast to the other methods evaluated. Importantly, we observed that these biases were not limited to the top-ranked DA peaks, but were also reflected in the false discoveries identified in our earlier experiments.

Navigating the Complexities of Differential Accessibility Analysis

Our systematic evaluation of single-cell DA methods has revealed several important considerations for analysts seeking to perform accurate and robust DA analysis of scATAC-seq data:

  1. Aggregation Matters: Methods that aggregate cells within replicates to form ‘pseudobulks’ consistently outperformed other approaches in terms of biological accuracy, false discovery control, and computational scalability.

  2. Binarization Can Improve Performance: Contrary to the argument that binarization discards important quantitative information, we found that binarizing scATAC-seq data can actually improve the biological accuracy of DA analysis by mitigating biases towards highly accessible peaks.

  3. Normalization Strategies Vary in Impact: While most DA methods operate directly on count matrices, those that rely on normalized data (t-test, Wilcoxon rank-sum test, LRclusters) exhibited more sensitivity to the choice of normalization approach.

  4. Accounting for Technical Variation is Crucial: Failing to properly account for technical variation between replicates can lead to an inflated false discovery rate, underscoring the importance of methods that model this source of variation.

To empower users to apply these best practices, we have translated our understanding into an R package, Libra, that implements optimized workflows for DA analysis of scATAC-seq data. Libra seamlessly integrates with popular single-cell analysis frameworks, such as Seurat and Scanpy, to provide a robust and scalable solution for accurate differential accessibility analysis.

As the field of single-cell epigenomics continues to evolve, we anticipate that the insights and recommendations provided in this article will serve as a valuable resource for both novice and experienced analysts navigating the complexities of this rapidly advancing technology. By embracing these best practices, researchers will be better equipped to uncover the regulatory programs that underlie cellular identity and response to perturbations, ultimately advancing our understanding of health and disease.

To learn more about Libra and explore our other IT solutions, I encourage you to visit the IT Fix website. Our team of experts is dedicated to providing practical, cutting-edge guidance to help you stay ahead of the curve in the ever-changing world of technology.

Facebook
Pinterest
Twitter
LinkedIn

Newsletter

Signup our newsletter to get update information, news, insight or promotions.

Latest Post