Artificial Intelligence (AI), C++ Programming Language, CUDA (Compute Unified Device Architecture), Communication Systems, Computer Engineering, Computer Programming, Computer Science, Debugging Tools, GPU (Graphics Processing Unit), Large-Scale Systems, Onboarding, Operational Audit, Performance Analysis, Performance Tuning/Optimization, Python Programming/Scripting Language, Reinforcement Learning, Reliability Engineering, Software Development, Systems Reliability, Systems Scalability
About the Team
The Seed Infrastructures team oversees the distributed training, reinforcement learning framework, high-performance inference, and heterogeneous hardware compilation technologies for AI foundation models.
We are looking for talented individuals to join our team in 2026. As a graduate, you will get opportunities to pursue bold ideas, tackle complex challenges, and unlock limitless growth. Launch your career where inspiration is infinite at our Company.
Successful candidates must be able to commit to an onboarding date by end of year 2026. Please state your availability and graduation date clearly in your resume.
Responsibilities
- Improve the reliability and performance of large-scale training systems across pre-training, fine-tuning, evaluation, and inference
- Build observability, profiling, and debugging tools for distributed ML workloads
- Identify and optimize performance bottlenecks across GPU, networking, and storage layers
- Contribute to distributed training frameworks in multi-GPU and multi-node environments
- Collaborate with model and infrastructure teams to improve system scalability and efficiency
- Support incident analysis and operational stabilityMinimum Qualifications
- Individuals who are completing or have recently completed a PhD degree in Software Development, Computer Science, Computer Engineering, or a related technical discipline.
- Strong programming skills in C++ and Python
- Solid understanding of PyTorch training workflows and distributed runtime behavior
- Familiarity with CUDA execution, NCCL communication, and GPU systems fundamentals
Preferred Qualifications
- Experience with performance profiling and debugging tools (e.g., torch.profiler, Nsight)
- Familiarity with distributed training or parallelization strategies (e.g., FSDP, Megatron-LM)
- Ability to analyze and optimize performance in complex ML training systems
B
Beijing ByteDance Technology Co Ltd