Staff Engineer - Perf and Benchmarking

CoreWeave, Inc

Sunnyvale, CA

JOB DETAILS
SALARY
$188,000–$275,000 Per Year
SKILLS
Auditing, Automation, Benchmarking, Bid Analysis, CUDA (Compute Unified Device Architecture), Cloud Computing, Communication Skills, Competitive Analysis/Strategy, Continuous Deployment/Delivery, Continuous Integration, Cross-Functional, Data Analysis, Data Sets, Data Warehousing, Distributed Computing, Establish Priorities, GPU (Graphics Processing Unit), Leadership, Machine Tool, Memory Hardware, Mentoring, Network Operations Center, OLAP (OnLine Analytical Processing), Open Source, Operational Support Systems (OSS), PCI Express (PCI-E), Performance Analysis, Performance Engineering, Production Systems, Publications, Sales Support, Technical Leadership, Telemetry, Vehicle Fleets, Warehousing
LOCATION
Sunnyvale, CA
POSTED
30+ days ago

About this role

We're looking for a Staff Engineer to be the technical lead of CoreWeave's Benchmarking & Performance team. You will be responsible for our planet-scale performance data warehouse: Ingesting, storing, transforming and analyzing performance events in all the data centers across our global infrastructure.

You will also be an integral part of achieving industry-leading end-to-end performance benchmarking publications: If MLPerf (Training & Inference), Working closely with NVIDIA (Megatron-LM, TensorRT-LLM & DGX cloud) and the open-source community (llm-d, vLLM and all popular ML frameworks) speak to you, come help us demonstrate CoreWeave's performance reliability leadership in the field.

What you'll do

Strategy & Leadership - Define the multi-year benchmarking strategy and roadmap; prioritize models/workloads (LLMs, diffusion, vision, speech) and hardware tiers. Build, lead, and mentor a high-performing team of performance engineers and data analysts. Establish governance for claims: documented methodologies, versioning, reproducibility, and audit trails.

Perf Ownership - Lead end-to-end MLPerf Inference and Training submissions: workload selection, cluster planning, runbooks, audits, and result publication. Coordinate optimization tracks with NVIDIA (CUDA, cuDNN, TensorRT/TensorRT-LLM, Triton, NCCL) to hit competitive results; drive upstream fixes where needed.

Internal Latency & Throughput Benchmarks - Design a Kubernetes-native, repeatable benchmarking service that exercises CoreWeave stacks across SUNK (Slurm on Kubernetes), Kueue, and Kubeflow pipelines. Measure and report p50/p95/p99 latency, jitter, tokens/s, time-to-first-token, cold-start/warm-start, and cost-per-token/request across models, precisions (BF16/FP8/FP4), batch sizes, and GPU types. Maintain a corpus of representative scenarios (streaming, batch, multi-tenant) and data sets; automate comparisons across software releases and hardware generations.

Tooling & Automation - Build CI/CD pipelines and K8s controllers/operators to schedule benchmarks at scale; integrate with observability stacks (Prometheus, Grafana, OpenTelemetry) and results warehouses. Implement supply-chain integrity for benchmark artifacts (SBOMs, Cosign signatures).

Cross-functional & Community - Partner with NVIDIA, key ISVs, and OSS projects (vLLM, Triton, KServe, PyTorch/DeepSpeed, ONNX Runtime) to co-develop optimizations and upstream improvements. Support Sales/SEs with authoritative numbers for RFPs and competitive evaluations; brief analysts and press with rigorous, defensible data.

Who you are

10+ years building distributed systems or HPC/cloud services, with deep expertise on large-scale ML training or similar high-performance workloads.

Proven track record of architecting or building planet-scale data systems (e.g., telemetry platforms, observability stacks, cloud data warehouses, large-scale OLAP engines).

Deep understanding of GPU performance (CUDA, NCCL, RDMA, NVLink/PCIe, memory bandwidth), model-server stacks (Triton, vLLM, TensorRT-LLM, TorchServe), and distributed training frameworks (PyTorch FSDP/DeepSpeed/Megatron-LM).

Proficient with Kubernetes and ML control planes; familiarity with SUNK, Kueue, and Kubeflow in production environments.

Excellent communicator able to interface with executives, customers, auditors, and OSS communities.

Nice to have

Experience with time-series databases, log-structured merge trees (LSM), or custom storage engine development.

Experience running MLPerf submissions (Inference and/or Training) or equivalent audited benchmarks at scale.

Contributions to MLPerf, Triton, vLLM, PyTorch, KServe, or similar OSS projects.

Experience benchmarking multi-region fleets and large clusters (thousands of GPUs).

Publications/talks on ML performance, latency engineering, or large-scale benchmarking methodology.

Base Salary Range ----------------

The base salary range for this role is $188,000 to $275,000. The starting salary will be determined based on job-related knowledge, skills, experience, and market location. We strive for both market alignment and internal equity when determining compensation. In addition to base salary, our total rewards package includes a discretionary bonus, equity awards, and a comprehensive benefits program (all based on eligibility).

About the Company

C

CoreWeave, Inc