We are seeking a highly skilled Site Reliability Engineer (SRE) to ensure the reliability, scalability, performance, and availability of mission-critical systems and services. This role bridges software engineering and infrastructure operations, focusing on building and maintaining highly reliable distributed systems in cloud-native environments.
The ideal candidate has strong experience in systems engineering, cloud infrastructure, automation, incident response, and performance optimization. They are passionate about improving system reliability through automation, observability, and engineering best practices.
To support collaboration with engineering and operations teams, candidates must currently reside in one of the following metropolitan areas in the United States:
Dallas
Houston
Austin
Atlanta
Jacksonville
Miami
Nashville
Charlotte
Phoenix
Candidates outside of these locations will not be considered.
Design, build, and maintain highly reliable and scalable distributed systems
Define and implement Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs)
Improve system uptime, resilience, and fault tolerance across services
Identify and eliminate single points of failure in infrastructure and applications
Participate in capacity planning and scalability engineering
Manage and optimize cloud infrastructure across AWS, Azure, and/or GCP
Work with containerized environments using Docker and Kubernetes
Automate infrastructure provisioning using Infrastructure as Code (Terraform, CloudFormation, etc.)
Optimize cloud resource utilization, performance, and cost efficiency
Support hybrid and multi-cloud infrastructure environments
Implement and maintain monitoring, logging, and tracing systems
Build dashboards and alerting systems using observability tools (e.g., Prometheus, Grafana, Datadog, ELK stack)
Analyze system performance metrics and identify bottlenecks
Improve detection and response to system anomalies
Ensure actionable and low-noise alerting systems
Lead and participate in incident response and on-call rotations
Troubleshoot production issues and perform root cause analysis (RCA)
Implement corrective and preventive actions to reduce recurrence
Coordinate cross-functional response during outages and incidents
Document incident reports and improve operational readiness
Automate repetitive operational tasks and infrastructure processes
Improve CI/CD pipelines for reliability and efficiency
Implement self-healing systems and automated remediation
Collaborate with DevOps teams to improve deployment workflows
Reduce manual operational overhead through tooling and scripting
Collaborate with Security teams to ensure secure system configurations
Implement best practices for system hardening and access control
Support compliance requirements (SOC 2, ISO 27001, NIST, etc.)
Ensure secure handling of production systems and data
Participate in vulnerability remediation and risk reduction efforts
Work closely with Software Engineers, DevOps, Security, and Cloud teams
Improve system design through reliability-focused architecture reviews
Advocate for reliability engineering best practices across teams
Contribute to engineering standards and operational guidelines
Support product teams in delivering stable and scalable services
Bachelor's degree in Computer Science, Information Technology, Engineering, or related field
3+ years of experience in Site Reliability Engineering, DevOps, Infrastructure Engineering, or Systems Engineering
Strong experience with cloud platforms (AWS, Azure, and/or GCP)
Proficiency with Linux systems and networking fundamentals
Experience with containerization (Docker) and orchestration (Kubernetes)
Experience with CI/CD pipelines and automation tools
Strong scripting/programming skills (Python, Go, Bash, or similar)
Experience with monitoring and observability tools
Strong problem-solving and incident troubleshooting skills
Must currently reside in one of the approved locations listed above
Experience with high-scale distributed systems
Knowledge of microservices and event-driven architectures
Familiarity with Infrastructure as Code (Terraform, Pulumi, CloudFormation)
Experience with SRE frameworks (Google SRE principles)
Knowledge of database systems (SQL/NoSQL) and performance tuning
Experience with service mesh technologies (Istio, Linkerd)
Familiarity with security practices in cloud-native environments
Experience in high-availability or mission-critical systems
Certifications in cloud or DevOps technologies
System uptime / availability percentage
Reduction in service outages and incidents
Achievement of SLO/SLA targets
Mean Time Between Failures (MTBF)
Mean Time to Detect (MTTD)
Mean Time to Recover (MTTR)
Number of recurring incidents reduced
Incident response effectiveness and resolution time
System latency and throughput improvements
Capacity utilization and scaling efficiency
Performance optimization improvements
Infrastructure bottleneck reduction
Percentage of operational tasks automated
Reduction in manual intervention for production issues
CI/CD pipeline efficiency improvements
Deployment success rate and frequency
Engineering team satisfaction with platform reliability
Adoption of reliability best practices
Contribution to architectural improvements
Effectiveness in cross-team incident coordination
Head of Site Reliability Engineering
Director of Infrastructure
Cloud Engineering Manager
Head of Platform Engineering
Chief Technology Officer (CTO)
Full-Time
Remote (Candidates must reside in approved locations)
Hybrid opportunities may be available based on business needs
Participation in on-call rotation for production systems
Agile and DevOps-driven engineering environment
High-availability, production-critical systems environment
Collaboration with Engineering, DevOps, Cloud, and Security teams
Strong focus on automation, observability, and reliability engineering
Fast-paced environment supporting scalable distributed systems
Continuous improvement culture with emphasis on resilience and performance
Career growth opportunities into Senior SRE, Platform Engineering Lead, or Infrastructure Architect roles