Lead DevOps/MLOps Engineer

Razor Talent

Reston, VA(remote)

JOB DETAILS
SALARY
$120,000–$160,000 Per Year
SKILLS
Amazon Web Services (AWS), Automation, Autoscaling, Continuous Deployment/Delivery, Continuous Integration, DevOps, Distributed Computing, Docker, GPU (Graphics Processing Unit), Image Management, Machine Tool, Metrics, Reporting Dashboards
LOCATION
Reston, VA(remote)
POSTED
16 days ago

We're looking for a strong DevOps engineer who can help scale and operationalize our infrastructure as the platform grows. This is not a pure platform-architecture role the focus is CI/CD, infrastructure automation, deployment reliability, observability, and GPU-oriented workload scaling.
What You'll Own
  • Improve CI/CD pipelines, deployment workflows, and release reliability
  • Standardize infrastructure and deployment patterns across environments
  • Improve observability through logging, metrics, tracing, dashboards, and rollout monitoring
  • Partner closely with backend engineering on:
    • deployment strategies
    • infrastructure automation
    • environment consistency
    • migration workflows
    • possible Kubernetes migration efforts
  • Support ML-oriented infrastructure as a secondary responsibility:
    • SageMaker workloads
    • Ray clusters
    • GPU scaling patterns
    • distributed batch execution
    • autoscaling behavior
    • runtime/image management
    • artifact delivery/versioning
The Kind of Problems You'll Work On
  • Deployment safety and rollback strategies
  • Infrastructure consistency across environments
  • Release automation and environment promotion flows
  • Autoscaling and runtime stability
  • GPU workload orchestration and scaling efficiency
  • Operational tooling that reduces friction for engineering teams
Stack
  • AWS
  • Terraform
  • Docker
  • Kubernetes
  • CI/CD systems
  • SageMaker
  • Ray
  • GPU compute infrastructure
You'll Probably Do Well Here If
  • You've operated production infrastructure at meaningful scale
  • You're strong in practical DevOps execution and operational reliability
  • You care about automation, observability, and deployment safety
  • You're comfortable improving developer workflows and infrastructure tooling
  • You've worked with distributed systems or GPU-oriented workloads before

About the Company

R

Razor Talent