The role
We are looking for a Staff ML Performance Engineer to join our Training Tech team working on optimizing large scale ML jobs to enable scaling our models to the next order of magnitude. A successful candidate will increase efficiency of training and inference workloads in order to allow Wayve to train larger models faster.
Key responsibilities:
• Profile ML workloads to identify their bottlenecks, e.g. using NVIDIA Nsight Systems • Design and implement efficiency improvements to maximize MFU and throughput, e.g. parallelism, model compilation, mixed precision • Design and implement observability tools to identify bottlenecks and drive performance improvements, e.g. to track MFU, throughput, latency, etc • Design and implement benchmarking tools, e.g. to track efficiency gains or regressions
Collaborate closely with Research teams to integrate training efficiency improvements and create a culture of performance optimization
About you
In order to set you up for success in this role, we're looking for the following skills and experience.
Essential:
• 10+ years of industry experience driving performance engineering across ML systems, GPU compute infrastructure, distributed platforms or similar field • Experience optimizing large scale jobs on GPU compute clusters • Experience in working in platform teams and working with research teams • Experience in writing, reporting, and tracking performance benchmarks in an open and accessible way • Ability to write high quality, well-structured and tested Python code • BS or MS in Machine Learning, Computer Science, Engineering, or a related technical discipline or equivalent experience
Desirable:
• Experience working with concurrent, parallel and distributed computing • Experience using NVIDIA NSight Systems or other system profilers • Experience implementing GPU kernels (CUDA, Triton, etc) • Knowledge of computing fundamentals - what makes code fast, secure and reliable