HPC Kubernetes Architect

GTN Technical Staffing

Dallas, TX

JOB DETAILS
LOCATION
Dallas, TX
POSTED
30+ days ago

HPC Kubernetes Architect

Location: Dallas, TX (Hybrid) - relo available

175-250K Base + Bonus + 100% company paid benefits 

Overview
We are seeking an HPC Kubernetes Architect to lead the design and adoption of GPU-accelerated Kubernetes platforms supporting AI/ML, simulation, and high-performance computing workloads.

This is a customer-facing architecture role responsible for translating complex workload requirements into scalable, high-performance platform designs. You will guide customers across the full solution lifecycle—from discovery and architecture through proof-of-concept, deployment, and long-term optimization—while partnering closely with internal engineering and product teams to shape next-generation HPC and AI infrastructure.

This role is ideal for someone who combines deep Kubernetes and GPU platform expertise with strong consultative and customer engagement skills.

Key Responsibilities

Customer Architecture & Solution Delivery
•Serve as the primary architectural advisor for customers adopting Kubernetes-based HPC and AI platforms
•Translate workload, performance, and scaling requirements into reference architectures and deployable solutions
•Lead proof-of-concept engagements, including benchmarking, workload profiling, and performance validation
•Guide customers through onboarding, deployment, and integration with enterprise and HPC environments
•Present architecture strategies in technical workshops, design sessions, and customer engagements

Kubernetes & GPU Platform Architecture
•Design and optimize Kubernetes clusters for GPU-intensive workloads using NVIDIA GPU Operator, DCGM, and device plugins
•Implement GPU scheduling strategies including MIG, sharing, and advanced workload placement
•Extend Kubernetes functionality through custom operators and controllers (Go/Python)
•Integrate HPC schedulers such as Slurm or Volcano with Kubernetes environments

Infrastructure Integration & Performance
•Architect end-to-end platform integration across compute, storage, networking, and orchestration layers
•Support high-performance storage systems (Lustre, GPFS, Ceph, VAST) within Kubernetes environments
•Design and optimize high-performance networking (InfiniBand, RDMA, RoCE) for containerized workloads
•Drive performance tuning across compute, network, and storage to maximize throughput and efficiency

Security, Multi-Tenancy & Governance
•Design secure, multi-tenant Kubernetes environments with RBAC, namespace isolation, and policy enforcement
•Implement governance frameworks using OPA/Gatekeeper and workload-level controls
•Ensure alignment with enterprise security and compliance requirements

Observability, Automation & DevOps
•Implement monitoring and telemetry using Prometheus, Grafana, DCGM Exporter, and OpenTelemetry
•Support GitOps-based CI/CD workflows using ArgoCD, FluxCD, Helm, and Kustomize
•Contribute to infrastructure-as-code practices and platform automation

Cross-Functional & Ecosystem Collaboration
•Partner with product, engineering, and operations teams to align customer needs with platform roadmap
•Collaborate with ecosystem vendors (e.G., NVIDIA, networking and storage partners) to integrate emerging technologies
•Provide forward-looking guidance on GPU, interconnect, and orchestration trends

Required Experience
•Extensive experience designing and operating Kubernetes platforms in HPC or GPU-intensive environments
•Deep expertise with NVIDIA GPU ecosystem (GPU Operator, device plugins, MIG, DCGM)
•Strong understanding of Kubernetes internals (CRDs, controllers, RBAC, scheduling)
•Experience integrating distributed storage systems with Kubernetes for HPC workloads
•Experience with high-performance networking (InfiniBand, RDMA, RoCE) in containerized environments
•Proven ability to design scalable, secure, and resilient architectures for AI/ML and HPC workloads
•Proficiency in Go or Python for automation and operator development
•Experience with performance tuning, workload profiling, and benchmarking
•Strong customer-facing and solution design experience

Preferred Experience
•Experience delivering end-to-end customer solutions from design through production deployment
•Familiarity with containerized HPC tools (e.G., Singularity/Apptainer)
•Experience with GitOps and infrastructure automation practices
•Contributions to Kubernetes or NVIDIA-related open-source projects
•Kubernetes certifications (CKA, CKAD, CKS) and/or cloud certifications (AWS, Azure)
•Advanced degree in Computer Science, Engineering, or related field

This is a high-impact role at the intersection of Kubernetes, AI infrastructure, and high-performance computing, offering the opportunity to shape how next-generation compute platforms are designed, deployed, and scaled.

About the Company

G

GTN Technical Staffing