Research Engineer Graduate (AI Training Systems Reliability & Performance - Seed Infra) - 2026 Start (PhD)

Beijing ByteDance Technology Co Ltd

Seattle, WA

Apply

JOB DETAILS

SKILLS

Artificial Intelligence (AI), C++ Programming Language, CUDA (Compute Unified Device Architecture), Communication Systems, Computer Engineering, Computer Programming, Computer Science, Debugging Tools, GPU (Graphics Processing Unit), Large-Scale Systems, Onboarding, Operational Audit, Performance Analysis, Performance Tuning/Optimization, Python Programming/Scripting Language, Reinforcement Learning, Reliability Engineering, Software Development, Systems Reliability, Systems Scalability

LOCATION

Seattle, WA

POSTED

30+ days ago

About the Team The Seed Infrastructures team oversees the distributed training, reinforcement learning framework, high-performance inference, and heterogeneous hardware compilation technologies for AI foundation models.

We are looking for talented individuals to join our team in 2026. As a graduate, you will get opportunities to pursue bold ideas, tackle complex challenges, and unlock limitless growth. Launch your career where inspiration is infinite at our Company. Successful candidates must be able to commit to an onboarding date by end of year 2026. Please state your availability and graduation date clearly in your resume.

Responsibilities

Improve the reliability and performance of large-scale training systems across pre-training, fine-tuning, evaluation, and inference
Build observability, profiling, and debugging tools for distributed ML workloads
Identify and optimize performance bottlenecks across GPU, networking, and storage layers
Contribute to distributed training frameworks in multi-GPU and multi-node environments
Collaborate with model and infrastructure teams to improve system scalability and efficiency
Support incident analysis and operational stabilityMinimum Qualifications
Individuals who are completing or have recently completed a PhD degree in Software Development, Computer Science, Computer Engineering, or a related technical discipline.
Strong programming skills in C++ and Python
Solid understanding of PyTorch training workflows and distributed runtime behavior
Familiarity with CUDA execution, NCCL communication, and GPU systems fundamentals

Preferred Qualifications

Experience with performance profiling and debugging tools (e.g., torch.profiler, Nsight)
Familiarity with distributed training or parallelization strategies (e.g., FSDP, Megatron-LM)
Ability to analyze and optimize performance in complex ML training systems

About the Company

Beijing ByteDance Technology Co Ltd

Resume Resources

Free Resume Templates Free Resume Builder

Research Engineer Graduate (AI Training Systems Reliability & Performance - Seed Infra) - 2026 Start (PhD)

Beijing ByteDance Technology Co Ltd

Seattle, WA

About the Company

Beijing ByteDance Technology Co Ltd

Resume Resources

Similar Job Searches