Engineering Manager –Kubernetes Platform (AI / Distributed Compute)
Location: Dallas, TX (Hybrid)
•Competitive base salary + performance bonus
•100% company-paid benefits
Overview
We are seeking an Engineering Manager to lead the development and evolution of a large-scale Kubernetes platform supporting compute-intensive workloads across distributed environments.
This role blends technical leadership with hands-on platform expertise, focusing on building highly reliable, high-performance infrastructure that supports advanced data processing, AI/ML workloads, and large-scale compute operations. You will play a key role in shaping the platform strategy, guiding architectural decisions, and driving continuous improvement across performance, scalability, and automation.
The ideal candidate is a strong technical leader with experience managing engineering teams while remaining close to system design and platform engineering challenges.
Key Responsibilities
Team Leadership & Technical Direction
•Lead, mentor, and grow a team of engineers responsible for platform development and operations
•Define technical direction, roadmap, and best practices across platform engineering initiatives
•Provide hands-on guidance in system design, performance optimization, and infrastructure strategy
Platform Architecture & Performance
•Design and evolve Kubernetes-based infrastructure supporting high-throughput, distributed workloads
•Optimize resource allocation, workload scheduling, and system performance across shared compute environments
•Ensure platform scalability, reliability, and efficient utilization of compute resources
Automation & Reliability
•Drive automation across infrastructure and platform operations using Infrastructure-as-Code and CI/CD practices
•Establish and enhance observability, monitoring, and incident response processes
•Define and track key performance and reliability metrics across large-scale environments
Cross-Functional Collaboration
•Partner with engineering, data, and infrastructure teams to integrate storage, networking, and compute systems
•Collaborate on system design decisions involving distributed storage, high-speed networking, and runtime environments
•Engage with external partners and vendors to improve tooling and platform capabilities
Capacity Planning & Operations
•Oversee platform health, capacity planning, and long-term scalability across distributed infrastructure
•Ensure operational readiness for high-demand workloads and evolving system requirements
Required Experience
•7+ years of experience in platform engineering, infrastructure engineering, or SRE environments
•2+ years of experience leading or managing engineering teams
•Strong experience operating Kubernetes in large-scale, production environments
•Experience supporting compute-intensive workloads (e.G., AI/ML, data processing, or distributed systems)
•Deep understanding of Linux systems, networking fundamentals, and performance optimization
•Experience working with shared, multi-tenant infrastructure environments
•Hands-on experience with Infrastructure-as-Code tools (e.G., Terraform, Ansible)
•Familiarity with observability and monitoring tools (e.G., Prometheus, Grafana, logging platforms)
•Strong communication skills with the ability to align technical execution with business objectives
Preferred Experience
•Familiarity with workload orchestration or scheduling frameworks (e.G., Slurm or similar)
•Experience with container runtimes such as containerd or CRI-O
•Exposure to distributed storage systems or high-performance networking concepts
•Contributions to open-source projects within Kubernetes, infrastructure, or AI/ML ecosystems