Operating Systems

Optimizing Storage Management and File System Selection for High-Performance Computing

November 7, 2024

Understanding the Evolving Needs of HPC Workloads

High-performance computing (HPC) has become an essential tool across diverse scientific and engineering domains, from life sciences and geology to computational fluid dynamics and financial modeling. As the complexity and scale of HPC workloads continue to grow, the underlying storage and file system infrastructure must evolve to meet the increasing demands for performance, scalability, and flexibility.

In the past, HPC storage systems were often designed and tuned for specific application profiles, optimizing for a particular set of workload characteristics. However, the rapid expansion of big data-driven scientific simulations has introduced new challenges, with HPC workflows now requiring a delicate balance of computing power, I/O capabilities, and intelligent data management.

To address these evolving needs, IT professionals must take a strategic approach to optimizing storage management and file system selection for HPC environments. By carefully analyzing application I/O behaviors, benchmarking system performance, and leveraging the latest advancements in parallel file systems and cloud technologies, organizations can unlock the full potential of their HPC infrastructure and accelerate scientific discovery.

Analyzing Application I/O Behavior

The first step in optimizing an HPC storage system is to understand the I/O characteristics of the applications running on the system. By analyzing real-world I/O traces, IT professionals can gain valuable insights into the unique patterns and requirements of their HPC workloads.

For example, a study conducted at the Icahn School of Medicine at Mount Sinai found that their genomics analysis projects primarily use a three-step computational workflow:

Primary Analysis: Concatenating the raw data from genomic sequencers into input files, which requires significant storage capacity but minimal computation.
Secondary Analysis: Aligning sequences to a reference genome and calling variants, a highly compute-intensive process.
Tertiary Analysis: Downstream variant calling, pathway analysis, expression profiling, and network modeling, which can be both compute- and I/O-intensive.

The researchers also observed that over 80% of the files in their HPC storage system were smaller than 1 MB, highlighting the importance of efficient file system management and storage allocation strategies.

By understanding the specific I/O patterns and data characteristics of their HPC applications, the IT team was able to make informed decisions about the most appropriate storage and file system configurations to support their evolving workloads.

Optimizing Parallelism and Data Management

One of the key challenges in HPC storage optimization is ensuring efficient parallelism and data management across the system. This is particularly critical for in-memory data analytics frameworks, parallel file systems, and object storage solutions.

For in-memory data analytics frameworks, the IT team can focus on optimizing parallelism by:

Analyzing Memory Usage: Identifying memory-intensive jobs and ensuring adequate memory allocation per node to avoid performance degradation.
Implementing Intelligent Job Scheduling: Developing queue structures and scheduling policies that can handle the bursty nature of genomics workloads, with thousands of jobs submitted at once.

When it comes to parallel file systems, the focus shifts to data management strategies, such as:

Leveraging Parallel Metadata Servers: Ensuring that the file system can effectively handle the massive number of tiny files common in genomics workflows by distributing metadata across multiple servers.
Enabling Sub-block Allocation: Allowing small files to occupy minimal storage space, while still providing efficient allocation for larger files.
Implementing Tiered Storage: Utilizing a combination of high-performance flash storage for metadata and caching, along with high-capacity HDD arrays for bulk data storage.

For object storage solutions, the optimization approach may involve:

Optimizing Data Placement and Replication: Ensuring that data is strategically placed and replicated across the object storage cluster to maximize performance and availability.
Leveraging Intelligent Data Tiering: Automating the movement of data between different storage tiers based on access patterns and performance requirements.
Integrating with Parallel File Systems: Seamlessly integrating object storage with parallel file systems to provide a unified, high-performance storage solution.

By carefully addressing parallelism and data management challenges, IT professionals can unlock the full potential of their HPC storage infrastructure, ensuring that applications can leverage the available resources efficiently and achieve the desired performance and throughput.

Leveraging Advancements in File Systems and Cloud Technologies

The rapid evolution of parallel file systems and the emergence of cloud-based storage solutions have introduced new opportunities for optimizing HPC storage management.

Parallel File Systems:
Modern parallel file systems, such as IBM Spectrum Scale (formerly GPFS) and Lustre, offer a range of advanced features that can significantly benefit HPC workloads:

Parallel Metadata Servers: Distributing metadata processing across multiple servers to handle the massive number of small files common in genomics and bioinformatics workflows.
Sub-block Allocation: Efficiently allocating storage space for files of varying sizes, minimizing wastage for small files.
Tiered Storage: Integrating multiple storage tiers, including high-performance flash and high-capacity HDDs, to balance performance and cost.
Data Placement and Migration: Leveraging intelligent data placement and automated migration policies to optimize data access and storage utilization.

Cloud-based Storage Solutions:
The availability of cloud-based storage services, such as Amazon S3 and Azure Blob Storage, can also play a role in HPC storage optimization. While the costs of cloud storage may be prohibitive for large-scale HPC workloads, cloud technologies can still be valuable for specific use cases:

Archiving and Backup: Leveraging cloud storage for long-term data archiving and backup, reducing the burden on on-premises storage systems.
Hybrid Cloud Workflows: Integrating cloud storage with on-premises HPC resources to enable seamless data movement and collaboration across geographic boundaries.
Burst Capacity: Utilizing cloud storage to handle temporary spikes in storage demand, complementing the on-premises HPC storage infrastructure.

By staying up-to-date with the latest advancements in parallel file systems and cloud storage technologies, IT professionals can design and implement HPC storage solutions that are more flexible, scalable, and cost-effective, ultimately supporting the evolving needs of their scientific and engineering communities.

Benchmarking and System Optimization

Effective HPC storage optimization requires a data-driven approach, which includes thorough benchmarking and system performance analysis. IT teams should identify the most critical applications and workloads, and then benchmark them across different hardware and software configurations to determine the optimal setup.

For example, the Icahn School of Medicine at Mount Sinai team conducted extensive benchmarking of key genomics applications, such as BWA for DNA alignment and STAR for RNA alignment, on various CPU architectures (AMD, Intel Haswell, and Intel Xeon Gold). They found that the Intel Xeon Gold processors offered significant performance improvements over the older AMD and Intel Haswell processors, with up to a 2.5x speedup in some cases.

Armed with these benchmarking results, the IT team was able to make an informed decision to upgrade their HPC system to the Intel Xeon Gold processors, ensuring that their genomics workflows could achieve faster time-to-solution and higher overall throughput.

In addition to hardware optimization, the IT team also focused on improving the efficiency of their HPC storage system by:

Implementing Parallel File Systems: Deploying IBM Spectrum Scale (GPFS) to handle the massive number of small files and provide advanced data management capabilities.
Optimizing Queue Structures and Scheduling Policies: Developing a flexible queue system that could effectively manage the bursty nature of genomics workloads, with the ability to prioritize jobs based on user needs.
Integrating Cloud Technologies: Leveraging cloud-based storage solutions for archiving and backup, while maintaining the core HPC infrastructure on-premises.

By continuously benchmarking their HPC system, analyzing application performance, and iterating on the storage and system configurations, the IT team was able to achieve significant improvements in the overall efficiency and productivity of their HPC environment.

Ensuring Sustainable and Cost-effective HPC Operations

As HPC systems become more complex and the demand for computational resources continues to grow, IT professionals must also focus on ensuring the long-term sustainability and cost-effectiveness of their HPC infrastructure.

One key aspect of this is the implementation of a well-designed chargeback model. The Icahn School of Medicine at Mount Sinai team developed a chargeback scheme that covers approximately 90% of their actual HPC operational costs, ensuring the financial sustainability of their Minerva HPC system.

The chargeback model is based on a combination of Facilities and Administration (F&A) fees from NIH-funded grants and a storage fee charged directly to researchers. The storage fee, currently set at $109 per terabyte per year, covers not only the cost of storage but also provides access to compute cycles, GPUs, and archival storage.

By implementing this chargeback model, the IT team was able to:

Incentivize Efficient Resource Utilization: The storage fee encouraged researchers to be more mindful of their data usage and move inactive data to archival storage, freeing up valuable storage space for active projects.
Diversify Funding Sources: Relying on a mix of F&A fees and direct user charges helped to reduce the institution’s financial burden and ensure the long-term viability of the HPC system.
Maintain Transparency and Accountability: Regular reporting and invoicing kept users informed of their usage and costs, fostering a sense of shared responsibility for the HPC infrastructure.

In addition to the chargeback model, the IT team also focused on maximizing the return on investment (ROI) of their HPC system by tracking the research productivity enabled by Minerva. Over the past seven years, the HPC system has facilitated the publication of over 900 research articles, demonstrating the substantial impact it has had on the institution’s scientific output.

By adopting a comprehensive approach to sustainable HPC operations, including cost-effective storage management, intelligent chargeback models, and ROI tracking, IT professionals can ensure that their HPC infrastructure remains a vital and long-lasting resource for their research communities.

Conclusion

As the demands on HPC systems continue to evolve, IT professionals must take a strategic and data-driven approach to optimizing storage management and file system selection. By understanding application I/O behaviors, leveraging the latest advancements in parallel file systems and cloud technologies, and implementing sustainable operational models, organizations can unlock the full potential of their HPC infrastructure and accelerate scientific discovery.

The Icahn School of Medicine at Mount Sinai’s experience with the Minerva HPC system demonstrates the importance of continuous optimization, benchmarking, and cost-effective management. By adopting these best practices, IT professionals can ensure that their HPC environments remain responsive, efficient, and aligned with the needs of their research communities, now and in the future.