Research Engineer - LLM Infra training - Seed Infra

Beijing ByteDance Technology Co Ltd

Seattle, WA

Apply

JOB DETAILS

SKILLS

Artificial Intelligence (AI), Benchmarking, C++ Programming Language, CUDA (Compute Unified Device Architecture), Computer Programming, Conferences, Cross-Functional, GPU (Graphics Processing Unit), Leadership, Memory Hardware, Memory Management, Mentoring, Open Source, Performance Analysis, Performance Management, Performance Tuning/Optimization, Process Improvement, Publications, Python Programming/Scripting Language, Reinforcement Learning, Research & Development (R&D), Research Skills, Systems Reliability

LOCATION

Seattle, WA

POSTED

30+ days ago

Team Information: The Seed Infrastructures team oversees the distributed training, reinforcement learning framework, high-performance inference, and heterogeneous hardware compilation technologies for AI foundation models.

Responsibilities

Conduct research and development on large-scale LLM training infrastructure and efficiency
Design and optimize distributed training strategies for LLMs, including parallelism schemes, computation and communication optimization, and throughput scaling on large GPU clusters
Investigate system reliability and resilience techniques, such as fast checkpointing, fault tolerance, and failure diagnosis for long-running training workloads
Research and optimize network, scheduling, and GPU memory management across the training stack, driving cross-layer performance improvements
Analyze performance bottlenecks in exascale training systems and propose principled, data-driven optimization methods
Bridge cutting-edge research and large-scale production deployment by translating research ideas into scalable, real-world AI infrastructure solutionsMinimum Qualifications
Experience with large-scale distributed training for LLMs
Strong programming skills in Python and/or C++
Strong background in ML systems / training infrastructure development
Proficiency in parallelism strategies (DDP, FSDP, model/pipeline/expert parallelism)
Solid understanding of training stack internals (PyTorch, CUDA, NCCL)
Experience in performance optimization (memory, communication, throughput)

Preferred Qualifications

Hands-on experience with distributed training frameworks and large-scale LLM infrastructure
Experience leading or mentoring engineering teams or cross-functional projects
Publications in top-tier AI, systems, or HPC conferences (ICML, OSDI, SOSP, NSDI, SIGCOMM, MLSys) or strong open-source contributions
Familiarity with benchmarking AI accelerators or large-scale LLM evaluation (e.g., ByteMLPerf)

About the Company

Beijing ByteDance Technology Co Ltd

Resume Resources

Free Resume Templates Free Resume Builder

Research Engineer - LLM Infra training - Seed Infra

Beijing ByteDance Technology Co Ltd

Seattle, WA

About the Company

Beijing ByteDance Technology Co Ltd

Resume Resources

Similar Job Searches