Research Engineer - LLM/VLM Inference Optimization (Seed Infra)

Beijing ByteDance Technology Co Ltd

Seattle, WA

JOB DETAILS
SKILLS
Algorithms, Artificial Intelligence (AI), C Programming Language, C++ Programming Language, CPU (Central Processing Unit), CUDA (Compute Unified Device Architecture), Computer Science, Concurrency, Data Modeling, Data Structures, Debugging Skills, Ecosystems, Electrical Engineering Software, GPU (Graphics Processing Unit), Inference Engine, Kernel Programming, Large-Scale Systems, Machine Learning, OpenCL, Parallel Computing, Performance Analysis, Performance Modeling, Performance Tuning/Optimization, Python Programming/Scripting Language, Reinforcement Learning, Software Engineering, Systems/Internals Programming
LOCATION
Seattle, WA
POSTED
30+ days ago

About the Team The Seed Infrastructures team oversees the distributed training, reinforcement learning framework, high-performance inference, and heterogeneous hardware compilation technologies for AI foundation models.

Responsibilities

  1. Design, develop, and optimize high-performance inference systems for large-scale LLMs and VLMs, covering inference engines, serving frameworks, and end-to-end deployment pipelines.
  2. Build state-of-the-art model inference engines through advanced performance optimization techniques such as compiler-level optimizations, parallel computing, graph fusion, efficient CUDA kernel development, low-precision computation, streaming inference, speculative decoding, and high-concurrency request optimization.
  3. Collaborate closely with other research teams to identify performance bottlenecks, conduct in-depth performance analysis, and optimize large models; contribute to the development of model toolchains and the broader technical ecosystem.Minimum Qualifications:
  4. Bachelor's degree or above in Computer Science, Electrical Engineering, Software Engineering, or a related field.
  5. Strong proficiency in C/C++ and Python; solid foundations in algorithms, data structures, and systems programming; familiarity with containerization and server-side debugging.
  6. Hands-on experience with at least one mainstream machine learning framework (e.g., PyTorch, TensorFlow).
  7. Experience deploying or optimizing LLM/VLM inference at production scale, with demonstrated impact on latency, throughput, or serving cost.
  8. Familiarity with GPU architecture and experience optimizing compute-intensive operators (e.g., FlashAttention, GEMM, GEMV, Conv2D).

Preferred Qualifications:

  1. Experience with large-scale LLM serving infrastructure or equivalent production LLM deployment experience.
  2. Experience in GPU programming (CUDA/OpenCL) and familiarity with frameworks such as TensorRT, Triton, or CUTLASS.
  3. Experience in performance modeling, profiling, and optimization, or strong knowledge of CPU/GPU architectures.
  4. Familiarity with model/data parallelism frameworks for distributed inference.

About the Company

B

Beijing ByteDance Technology Co Ltd