Artificial Intelligence (AI), Benchmarking, CUDA (Compute Unified Device Architecture), Caching, Distributed Computing, GPU (Graphics Processing Unit), Incident Response, Load Testing, Performance Analysis, Performance Modeling, Performance Testing, Performance Tuning/Optimization, Python Programming/Scripting Language
Indent : SF_OP_204606-1-1
Role : Senior AI Platform / LLM Infrastructure Engineer
Location : Charlotte, NC (Hybrid)
Rate: $75/hr - $77/hr
We are hiring a Senior AI Platform Engineer to build and optimize on-prem LLM inference platforms. The role focuses on high-performance model serving, GPU workloads, and scalable ML infrastructure using modern inference frameworks and Kubernetes.
Must-Have Skills
• LLM Inference Frameworks: vLLM, TensorRT-LLM, Triton Inference Server, SGLang
• Model Optimization: Continuous Batching, Speculative Decoding, KV Cache / Prefix Caching, FP8 / AWQ / GPTQ
• Distributed/Parallel Systems: Tensor Parallelism
• Platform & Orchestration: Kubernetes, KServe, OpenShift AI, Helm / Operators
• GPU & Performance: CUDA, NCCL, MIG, GPU Orchestration (Run:AI)
• Monitoring: Prometheus, Grafana, ML Observability
• Programming: Python
• GenAI Tools: Arize AI, Claude (CoWork)
• Load / performance testing: GuideLLM, Locust
=' Key Responsibilities
• Build and manage LLM inference platforms on on-prem GPU infrastructure
• Optimize model performance using advanced inference techniques (batching, caching, quantization)
• Deploy and operate ML workloads on Kubernetes (KServe/OpenShift AI)
• Enable GPU scheduling and orchestration for large-scale workloads
• Implement monitoring and performance benchmarking frameworks
• Drive SRE practices for platform reliability and scalability (observability, incident handling)
• Collaborate with AI/ML teams to enable production-grade GenAI deployments