Research Engineer - LLM Infra training - Seed Infra

Beijing ByteDance Technology Co Ltd

Seattle, WA

JOB DETAILS
SKILLS
Artificial Intelligence (AI), Benchmarking, C++ Programming Language, CUDA (Compute Unified Device Architecture), Computer Programming, Conferences, Cross-Functional, GPU (Graphics Processing Unit), Leadership, Memory Hardware, Memory Management, Mentoring, Open Source, Performance Analysis, Performance Management, Performance Tuning/Optimization, Process Improvement, Publications, Python Programming/Scripting Language, Reinforcement Learning, Research & Development (R&D), Research Skills, Systems Reliability
LOCATION
Seattle, WA
POSTED
30+ days ago

Team Information: The Seed Infrastructures team oversees the distributed training, reinforcement learning framework, high-performance inference, and heterogeneous hardware compilation technologies for AI foundation models.

Responsibilities

  • Conduct research and development on large-scale LLM training infrastructure and efficiency
  • Design and optimize distributed training strategies for LLMs, including parallelism schemes, computation and communication optimization, and throughput scaling on large GPU clusters
  • Investigate system reliability and resilience techniques, such as fast checkpointing, fault tolerance, and failure diagnosis for long-running training workloads
  • Research and optimize network, scheduling, and GPU memory management across the training stack, driving cross-layer performance improvements
  • Analyze performance bottlenecks in exascale training systems and propose principled, data-driven optimization methods
  • Bridge cutting-edge research and large-scale production deployment by translating research ideas into scalable, real-world AI infrastructure solutionsMinimum Qualifications
  • Experience with large-scale distributed training for LLMs
  • Strong programming skills in Python and/or C++
  • Strong background in ML systems / training infrastructure development
  • Proficiency in parallelism strategies (DDP, FSDP, model/pipeline/expert parallelism)
  • Solid understanding of training stack internals (PyTorch, CUDA, NCCL)
  • Experience in performance optimization (memory, communication, throughput)

Preferred Qualifications

  • Hands-on experience with distributed training frameworks and large-scale LLM infrastructure
  • Experience leading or mentoring engineering teams or cross-functional projects
  • Publications in top-tier AI, systems, or HPC conferences (ICML, OSDI, SOSP, NSDI, SIGCOMM, MLSys) or strong open-source contributions
  • Familiarity with benchmarking AI accelerators or large-scale LLM evaluation (e.g., ByteMLPerf)

About the Company

B

Beijing ByteDance Technology Co Ltd