Graphics

Benchmarking second-generation methods for cell-type deconvolution of bulk tissue transcriptomes

November 7, 2024

Understanding the Power of Cell-Type Deconvolution

In silico cell-type deconvolution from bulk transcriptomics data is a powerful technique that provides valuable insights into the cellular composition of complex tissues. While first-generation deconvolution methods relied on precomputed expression signatures covering a limited range of cell types and tissues, the emergence of second-generation tools has introduced a new level of flexibility. These newer methods leverage single-cell RNA sequencing (scRNA-seq) data to build custom signatures capable of deconvoluting arbitrary cell types, tissues, and organisms.

This flexibility, however, poses significant challenges in accurately assessing the deconvolution performance of these second-generation tools. Researchers at the University of Innsbruck have conducted a comprehensive benchmark study to disentangle the different sources of variation and bias that can impact deconvolution results. By leveraging a diverse panel of real and simulated data, this study sheds light on the strengths, limitations, and complementarity of state-of-the-art deconvolution methods, and how various data characteristics and confounders influence their performance.

The omnideconv Ecosystem: A Unified Approach to Deconvolution

To simplify the usage of second-generation deconvolution methods, the researchers have developed the omnideconv ecosystem, which includes several key components:

omnideconv R Package: This package provides a unified interface to multiple R- and Python-based deconvolution methods, unifying their workflows, input/output data, and semantics. This allows users to perform deconvolution analysis using just one or two simple commands.
SimBu: A pseudo-bulk simulation method that enables the efficient generation of synthetic bulk RNA-seq datasets, allowing the researchers to model cell-type-specific mRNA levels and other important biases that deconvolution methods need to account for.
deconvData: A data repository that provides access to the real and simulated RNA-seq samples, as well as the matched ground-truth cell fractions used in the benchmark study.
deconvBench: A Nextflow pipeline that reproduces and extends the presented benchmarking study, ensuring reproducibility and reusability.
deconvExplorer: A web application that allows users to interactively investigate deconvolution signatures and results.

The omnideconv ecosystem aims to aid researchers in deconvolving RNA-seq samples more easily, while also offering guidance for method selection in different scenarios. Its flexibility also allows for easy extension, enabling the inclusion of upcoming deconvolution methods and the benchmarking and optimization of these tools for specific applications.

Benchmarking Second-Generation Deconvolution Methods

The researchers selected eight second-generation deconvolution methods for their comprehensive benchmark study: AutoGeneS, BayesPrism, Bisque, CIBERSORTx, DWLS, MuSiC, SCDC, and Scaden. These methods were chosen as they:

Leverage annotated scRNA-seq data to directly perform cell-type deconvolution.
Do not strictly require context-specific marker genes.
Provide deconvolution results in the form of relative cell fractions.

The benchmark study was designed to address several key challenges in cell-type deconvolution, including:

Cell-type-specific mRNA bias: Cells with higher overall mRNA abundance can be overestimated, and vice versa.
Unknown cellular content: Cell types present in the bulk RNA-seq data but not in the scRNA-seq reference used for method training.
Transcriptional similarity between cell types: “Spillover” towards similar cell types.
Variation in scRNA-seq data: Differences in technology, tissue, and disease context.

The researchers also assessed the impact of the resolution of cell type annotations and how deconvolution performance and computational scalability are affected by the size of the reference scRNA-seq data.

Key Findings and Insights

The comprehensive benchmark study yielded several important findings and insights:

Data Transformation: Maintaining the data in linear scale consistently showed the best deconvolution performance, outperforming logarithmic and variance stabilization transformations.
Normalization Strategy: The choice of normalization strategy had a significant impact on some methods, but not on others.
Method Performance: The top-performing bulk deconvolution methods were ordinary least squares (OLS), non-negative least squares (NNLS), robust linear regression (RLR), and FARDEEP. The best-performing methods using scRNA-seq data as reference were DWLS, MuSiC, and SCDC.
Impact of Reference Size: Increasing the size of the scRNA-seq reference generally improved deconvolution performance, with some methods benefiting more than others. However, larger references also introduced systematic estimation biases for certain cell types in some cases.
Computational Efficiency: The researchers also evaluated the computational resources required by the different methods, finding that some were able to handle the full scRNA-seq reference (∼153,000 cells) within reasonable memory constraints, while others required substantial downsampling of the reference data.

Practical Guidance for Deconvolution Workflows

Based on the benchmark study’s findings, the researchers provide the following general guidelines for researchers conducting cell-type deconvolution:

Data Transformation: Use the original linear scale for your input data, rather than applying logarithmic or variance stabilization transformations.
Normalization Strategy: Avoid row scaling, column min-max, column z-score, or quantile normalization, and instead choose from the other scaling/normalization approaches evaluated in the study.
Method Selection: Consider using regression-based bulk deconvolution methods (e.g., RLR, CIBERSORT, or FARDEEP), and also try methods that use scRNA-seq data as a reference (DWLS, MuSiC, SCDC).
Marker Selection: Employ a stringent marker selection strategy that focuses on the differences between the cell types with the highest expression values.
Reference Matrix: Ensure that your reference matrix includes all relevant cell types present in the mixtures you are deconvolving.

By following these guidelines and leveraging the tools and resources provided in the omnideconv ecosystem, researchers can navigate the complex landscape of cell-type deconvolution with greater confidence and efficiency.

Conclusion: Unlocking the Potential of Transcriptomic Deconvolution

The comprehensive benchmark study conducted by the University of Innsbruck researchers has shed valuable light on the strengths, limitations, and complementarity of second-generation cell-type deconvolution methods. By providing the scientific community with the omnideconv ecosystem, they have made it easier for researchers to apply, benchmark, and optimize these powerful tools for their specific needs.

As the field of single-cell genomics continues to advance, the integration of scRNA-seq data into deconvolution workflows will become increasingly crucial for unlocking the full potential of transcriptomic analysis in complex biological systems. The insights and resources shared in this study will undoubtedly serve as a valuable guide for researchers navigating this rapidly evolving landscape.