Rate - + Expenses paid for travel
Will do. They said they have three positions total now. They need to be pre-sales minded when it comes to this experience also. They will be meeting with clients during the pre-sales process too. Really strong comm skills.
- Machine Learning Performance Engineer - CUDA Python -
U.S. Citizenship Status: U.S. Citizen; Green Card; Other legal status
Duration: 6 month contract with the likelihood to extend
Location: Remote but candidates must be willing to travel to different customer sites.
*Must be willing to travel
*Must have strong pre-sales abilities i.e. presentation skills, communication skills, etc.
*Must be willing to help train WWT employees and customers
Position Category: Infrastructure
Job Description: Your part here is optimizing the performance of our models - both training and inference. We care about efficient large-scale training, low-latency inference in real-time systems, and high-throughput inference in research. Part of this is improving straightforward CUDA, but the interesting part needs a whole-systems approach, including storage systems, networking, and host- and GPU-level considerations. Zooming in, we also want to ensure our platform makes sense even at the lowest level - is all that throughput actually goodput? Does loading that vector from the L2 cache really take that long?
• An understanding of modern ML techniques and toolsets
• The experience and systems knowledge required to debug a training run's performance end to end
• Low-level GPU knowledge of PTX, SASS, warps, cooperative groups, Tensor Cores, and the memory hierarchy
• Debugging and optimization experience using tools like CUDA GDB, NSight Systems, NSight Compute
• Library knowledge of Triton, CUTLASS, CUB, Thrust, cuDNN, and cuBLAS
• Intuition about the latency and throughput characteristics of CUDA graph launch, tensor core arithmetic, warp-level synchronization, and asynchronous memory loads
• Background in Infiniband, RoCE, GPUDirect, PXN, rail optimization, and NVLink, and how to use these networking technologies to link up GPU clusters
• An understanding of the collective algorithms supporting distributed GPU training in NCCL or MPI
• An inventive approach and the willingness to ask hard questions about whether we're taking the right approaches and using the right tools