The Limitations of Variant Call Format (VCF) in the Era of Big Genomics Data
As seasoned IT professionals, we’re well-versed in the challenges of managing and processing large datasets, especially in the field of bioinformatics. One such challenge that has become increasingly pressing is the management of Variant Call Format (VCF) data at the Biobank scale.
VCF is the standard file format for exchanging genetic variation data, encoding information about DNA sequence polymorphisms among a set of samples with associated quality control metrics and metadata. However, the row-wise encoding of the VCF data model, either as text or packed binary, emphasizes efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient.
Today’s Biobank-scale datasets consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF data. This row-wise data storage is fundamentally unsuitable and a more scalable approach is needed. The common workflows for processing these large-scale datasets, such as quality control, genome-wide association studies (GWAS), and exploratory analyses, cannot be efficiently supported by VCF.
The limitations of VCF’s design include:
- Inefficient Access: The row-wise layout of VCF data means that information for a particular sample or field cannot be obtained without retrieving the entire dataset.
- File-oriented Paradigm: VCF is ill-suited to the realities of modern datasets, which are too large to download and often required to stay in-situ by data-access agreements. Large files are currently stored in cloud environments, where the file systems required by classical file-oriented tools are expensively emulated on the basic building blocks of object storage.
- Substantial Costs: The multiple layers of inefficiencies around processing VCF data at scale in the cloud make it time-consuming and expensive, leading to these vast datasets not being utilized to their full potential.
To address these challenges, we need a new generation of tools that operate directly on a primary data representation that supports efficient access across a range of applications, with native support for cloud object storage. Such a representation can be termed “analysis-ready” and “cloud-native”, and it must also be accessible, using protocols that are “open, free, and universally implementable”.
Introducing the VCF Zarr Specification: A Cloud-Native Solution
In this article, we present the VCF Zarr specification, an encoding of the VCF data model using Zarr, a cloud-native format for storing multi-dimensional data that is widely used in scientific computing. The VCF Zarr specification decouples the VCF data model from its row-oriented file definition, allowing the data to be compactly stored and efficiently analyzed in a cloud-native, FAIR manner.
Key Benefits of the VCF Zarr Specification:
- Efficient Data Access: By storing each field in the VCF as a separately-stored array, the VCF Zarr specification makes retrieving subsets of the data much more efficient. This allows for better support of common workflows, such as quality control, GWAS, and exploratory analyses.
- Improved Compression: We show that the VCF Zarr format is much more compact than standard VCF-based approaches, and competitive with specialized methods for storing genotype data in terms of compression ratios.
- Enhanced Computation Performance: The Zarr storage of data in an analysis-ready format greatly facilitates computation, with various benchmarks being substantially faster than bcftools-based pipelines and competitive with state-of-the-art file-oriented methods.
- Cloud-Native Design: Zarr’s elegant simplicity and first-class support for cloud object stores have led to it gaining substantial traction across the sciences, making it an ideal choice for storing and analyzing large-scale genetic variation data.
Benchmarking the VCF Zarr Specification
To demonstrate the benefits of the VCF Zarr specification, we have conducted a series of benchmarks using a large and highly realistic simulation of French-Canadian genetic data, as well as a case study on the Genomics England aggV2 dataset.
Compression Performance
Our benchmarks show that the VCF Zarr format is far more efficient than standard VCF-based approaches in terms of compression ratios. When comparing the total stored bytes for VCF data produced by subsets of the French-Canadian simulation, we found that the Zarr and Savvy (a specialized VCF compression tool) formats had almost identical compression performance, both significantly outperforming gzip-compressed VCF.
Computation Performance
In terms of computation performance, we used the bcftools +af-dist plugin (which computes a table of deviations from Hardy-Weinberg expectations in allele frequency bins) as a representative example. Our benchmarks show that the Zarr-based implementation is substantially faster than the equivalent bcftools-based pipeline, even when using the highly optimized BCF format. This is due to Zarr’s ability to efficiently retrieve and process only the required data, rather than having to read and decompress the entire dataset.
Subset Extraction Performance
As datasets grow ever larger, the ability to efficiently access subsets of the data becomes increasingly important. Our benchmarks demonstrate that Zarr’s two-dimensional chunking allows for extremely efficient subset extraction, with the computation time depending very weakly on the overall dataset size. In contrast, the row-wise encoding of VCF and other alternatives means that CPU time scales linearly with the number of samples, as the entire dataset must be read and decompressed.
Case Study: Genomics England aggV2 Dataset
To further illustrate the practical benefits of the VCF Zarr specification, we performed a case study on the Genomics England aggV2 dataset, which comprises approximately 722 million annotated single-nucleotide variants and small indels from 78,195 samples. By converting this dataset to the VCF Zarr format using the vcf2zarr utility, we achieved a 5X reduction in storage and greater than 300X reduction in CPU usage for some representative benchmarks, compared to the original VCF data.
The Path Forward: Integrating VCF Zarr into the Bioinformatics Ecosystem
The VCF Zarr specification and the vcf2zarr conversion utility presented in this article provide a necessary starting point for cloud-native biobank repositories and open up many possibilities. However, significant investment and development would be needed to provide a viable alternative to standard bioinformatics workflows.
Two initial directions for development that could quickly yield significant results are:
- Compatibility with Existing Workflows: Providing a “vcztools” command-line utility that implements a subset of bcftools functionality on a VCF Zarr dataset. This would speed up common queries by orders of magnitude and reduce the need for user orchestration of operations among manually split VCF chunks.
- Zarr-Native Applications: Creating new applications that can take advantage of the efficient data representation provided by VCF Zarr, particularly in the Python data science ecosystem. Tools like Xarray and Dask can leverage Zarr’s grid-based array representation to enable scalable, distributed computation on genetic variation data.
By embracing the VCF Zarr specification and investing in these types of solutions, we can greatly reduce the costs of analyzing large-scale genetic variation data, effectively opening access to biobanks for a much broader group of researchers than currently possible.
Conclusion
Large row-encoded VCF files are a major bottleneck for current research, and storing and processing these files incurs a substantial cost. The VCF Zarr specification, building on widely-used, open-source technologies, has the potential to greatly reduce these costs and enable a diverse ecosystem of next-generation tools for analyzing genetic variation data directly from cloud-based object stores.
As seasoned IT professionals, we’re excited to see the adoption of the VCF Zarr specification and the development of innovative solutions that can unlock the full potential of Biobank-scale genetic variation data. By staying at the forefront of these advancements, we can continue to provide practical, in-depth insights to our readers and empower the broader bioinformatics community.