Minimum Qualifications** + 7+ years of industry experience building large-scale distributed systems or cloud infrastructure + Strong programming skills in Python, Go, C++, or similar systems languages + Extensive experience with compute infrastructure and workload scheduling + Strong expertise in distributed systems, scalability, reliability, and performance engineering + Experience with Kubernetes, container orchestration, or large-scale cluster management systems + Experience designing backend services or infrastructure platforms operating at production scale + Strong communication and collaboration skills across engineering and research teams + Bachelor's degree in Computer Science, Engineering, or related field **Preferred Qualifications** + Experience building schedulers, resource managers, or orchestration systems for distributed workloads + Experience with accelerator infrastructure such as TPU, GPU + Experience with distributed ML training or inference systems + Familiarity with frameworks such as JAX, PyTorch, TensorFlow, Ray, Pathways + Experience operating large-scale multi-tenant infrastructure in cloud or hybrid environments + Background in performance optimization, fault tolerance, or resource efficiency for large distributed systems + MS or PhD in Computer Science, Engineering, or related field You will work on distributed systems that manage thousands of accelerators and enable reliable, efficient execution of large-scale training and inference jobs.