This role involves designing, optimizing, and maintaining high-performance networking solutions for GPU supercomputing clusters supporting AI training and inference workloads.
Responsibilities include developing RDMA-based communication systems, implementing GPUDirect RDMA, automating network management with IaC tools, integrating solutions with Kubernetes, and troubleshooting network issues to ensure reliability and performance.
The ideal candidate has experience with NVIDIA RDMA technologies, HPC or AI network environments, programming in Python, Go, Rust, C/C++, and familiarity with networking protocols like RDMA, InfiniBand, TCP/IP, and BGP. Knowledge of cluster management (Slurm), communication frameworks (NCCL, MPI), and container orchestration is essential.
This position offers a salary range of $200,000 - $400,000, visa sponsorship, and benefits including health coverage, bonuses, 401K, paid leave, and more.