As high performance computing (HPC) platforms continue to evolve, large-scale simulations face critical challenges:
Modern supercomputers incorporate heterogeneous GPU architectures, such as NVIDIA A100 GPUs on Polaris and AMD MI250X GPUs on Frontier. While compute unified device architecture (CUDA) provides mature optimizations for NVIDIA hardware, maintaining comparable performance on AMD GPUs requires translation of code to heterogeneous-computing interface for portability (HIP), exposing differences in memory hierarchy, cache behavior, and compute unit configurations. Achieving performance parity across these platforms remains a non-trivial task.
Traditionally, distributed simulations rely on CPU-driven communication tasks with significant host-to-device data transfers. These transfers result in substantial latency penalties, especially as simulation sizes scale to tens of millions of elements across multiple nodes. Host-side operations, such as cell ownership updates and message buffer management, introduce further overhead that limits scalability.
In large-scale simulations, such as fluid-structure interaction (FSI) models involving millions of red blood cells, inter-node communication can dominate the runtime. Efficient communication strategies and load balancing become essential to ensure that the compute resources, particularly GPUs, are fully utilized.
We addressed these challenges through a combination of performance portability strategies and GPU-optimized communication techniques, creating a unified solution that delivers significant speedups and scalability across both NVIDIA and AMD platforms.
We adopted a dual programming approach to maintain performance portability. CUDA kernels were ported to HIP to ensure compatibility with AMD GPUs while preserving the optimized structure for NVIDIA platforms. The solution includes:
The result is a unified codebase that achieves strong performance across heterogeneous platforms, providing a scalable solution for future GPU hardware.
To eliminate host-device bottlenecks, we re-engineered the communication layer to operate entirely on the GPU. Key optimizations include:
Our optimizations were tested on large-scale FSI simulations involving up to 32 million red blood cells across hundreds of nodes. The fully GPU-optimized communication layer, combined with efficient load balancing, demonstrated excellent weak scaling performance:
By ensuring that communication tasks remained GPU-resident and removing host-device dependencies, we achieved consistent performance improvements at scale.
This work presents a unified solution to address the challenges of portability and communication bottlenecks in GPU-accelerated simulations. The dual CUDA-HIP programming approach ensures that the codebase runs efficiently across NVIDIA and AMD platforms, while the fully GPU-resident communication layer eliminates host-induced overheads. These optimizations collectively enable large-scale, high-fidelity simulations with excellent weak scaling performance.