Site Reliability Engineer

CoreWork Staffing

Florida, Florida

Apply

JOB DETAILS

SKILLS

Access Control, Amazon Web Services (AWS), Architectural Services, Automation, Bash Scripting, Best Practices, Bridge Building, Capacity Management, Capacity Utilization, Civil Engineering, Cloud Computing, Computer Programming, Computer Science, Continuous Deployment/Delivery, Continuous Improvement, Continuous Integration, Corrective Action, Cross-Functional, Database Technology, DevOps, Distributed Computing, Docker, Engineering Management, GCP (Good Clinical Practices), Go Programming Language (Golang), High Availability, ISO (International Organization for Standardization), Identify Issues, Incident Management, Incident Response, Information Technology & Information Systems, Machine Tool, Microservices, Microsoft Windows Azure, NoSQL, On Call, Operational Improvement, Operations Guidelines, Performance Analysis, Performance Management, Performance Metrics, Performance Tuning/Optimization, Product Support, Production Support, Production Systems, Python Programming/Scripting Language, Regulatory Compliance, Reliability Engineering, Reporting Dashboards, Resource Utilization, Risk Management, Root Cause Analysis, SQL (Structured Query Language), Scripting (Scripting Languages), Service Level Agreement (SLA), Software Engineering, Systems Administration/Management, Systems Analysis, Systems Engineering, Systems Reliability, Systems Scalability, U.S. National Institute of Standards and Technology (NIST)

LOCATION

Florida, Florida

POSTED

1 day ago

Site Reliability Engineer (SRE)

Position Overview

We are seeking a highly skilled Site Reliability Engineer (SRE) to ensure the reliability, scalability, performance, and availability of mission-critical systems and services. This role bridges software engineering and infrastructure operations, focusing on building and maintaining highly reliable distributed systems in cloud-native environments.

The ideal candidate has strong experience in systems engineering, cloud infrastructure, automation, incident response, and performance optimization. They are passionate about improving system reliability through automation, observability, and engineering best practices.

Location Requirement

To support collaboration with engineering and operations teams, candidates must currently reside in one of the following metropolitan areas in the United States:

Dallas
Houston
Austin
Atlanta
Jacksonville
Miami
Nashville
Charlotte
Phoenix

Candidates outside of these locations will not be considered.

Key Responsibilities

System Reliability & Engineering

Design, build, and maintain highly reliable and scalable distributed systems
Define and implement Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs)
Improve system uptime, resilience, and fault tolerance across services
Identify and eliminate single points of failure in infrastructure and applications
Participate in capacity planning and scalability engineering

Cloud Infrastructure & Platform Engineering

Manage and optimize cloud infrastructure across AWS, Azure, and/or GCP
Work with containerized environments using Docker and Kubernetes
Automate infrastructure provisioning using Infrastructure as Code (Terraform, CloudFormation, etc.)
Optimize cloud resource utilization, performance, and cost efficiency
Support hybrid and multi-cloud infrastructure environments

Observability, Monitoring & Alerting

Implement and maintain monitoring, logging, and tracing systems
Build dashboards and alerting systems using observability tools (e.g., Prometheus, Grafana, Datadog, ELK stack)
Analyze system performance metrics and identify bottlenecks
Improve detection and response to system anomalies
Ensure actionable and low-noise alerting systems

Incident Response & Production Support

Lead and participate in incident response and on-call rotations
Troubleshoot production issues and perform root cause analysis (RCA)
Implement corrective and preventive actions to reduce recurrence
Coordinate cross-functional response during outages and incidents
Document incident reports and improve operational readiness

Automation & DevOps Engineering

Automate repetitive operational tasks and infrastructure processes
Improve CI/CD pipelines for reliability and efficiency
Implement self-healing systems and automated remediation
Collaborate with DevOps teams to improve deployment workflows
Reduce manual operational overhead through tooling and scripting

Security, Reliability & Compliance

Collaborate with Security teams to ensure secure system configurations
Implement best practices for system hardening and access control
Support compliance requirements (SOC 2, ISO 27001, NIST, etc.)
Ensure secure handling of production systems and data
Participate in vulnerability remediation and risk reduction efforts

Collaboration & Engineering Partnership

Work closely with Software Engineers, DevOps, Security, and Cloud teams
Improve system design through reliability-focused architecture reviews
Advocate for reliability engineering best practices across teams
Contribute to engineering standards and operational guidelines
Support product teams in delivering stable and scalable services

Qualifications

Required

Bachelor's degree in Computer Science, Information Technology, Engineering, or related field
3+ years of experience in Site Reliability Engineering, DevOps, Infrastructure Engineering, or Systems Engineering
Strong experience with cloud platforms (AWS, Azure, and/or GCP)
Proficiency with Linux systems and networking fundamentals
Experience with containerization (Docker) and orchestration (Kubernetes)
Experience with CI/CD pipelines and automation tools
Strong scripting/programming skills (Python, Go, Bash, or similar)
Experience with monitoring and observability tools
Strong problem-solving and incident troubleshooting skills
Must currently reside in one of the approved locations listed above

Preferred (Nice-to-Have)

Experience with high-scale distributed systems
Knowledge of microservices and event-driven architectures
Familiarity with Infrastructure as Code (Terraform, Pulumi, CloudFormation)
Experience with SRE frameworks (Google SRE principles)
Knowledge of database systems (SQL/NoSQL) and performance tuning
Experience with service mesh technologies (Istio, Linkerd)
Familiarity with security practices in cloud-native environments
Experience in high-availability or mission-critical systems
Certifications in cloud or DevOps technologies

Key Performance Indicators (KPIs)

System Reliability

System uptime / availability percentage
Reduction in service outages and incidents
Achievement of SLO/SLA targets
Mean Time Between Failures (MTBF)

Incident Management

Mean Time to Detect (MTTD)
Mean Time to Recover (MTTR)
Number of recurring incidents reduced
Incident response effectiveness and resolution time

Performance & Scalability

System latency and throughput improvements
Capacity utilization and scaling efficiency
Performance optimization improvements
Infrastructure bottleneck reduction

Automation & Efficiency

Percentage of operational tasks automated
Reduction in manual intervention for production issues
CI/CD pipeline efficiency improvements
Deployment success rate and frequency

Collaboration & Engineering Impact

Engineering team satisfaction with platform reliability
Adoption of reliability best practices
Contribution to architectural improvements
Effectiveness in cross-team incident coordination

Reporting To

Head of Site Reliability Engineering
Director of Infrastructure
Cloud Engineering Manager
Head of Platform Engineering
Chief Technology Officer (CTO)

Employment Type & Work Setup

Full-Time
Remote (Candidates must reside in approved locations)
Hybrid opportunities may be available based on business needs
Participation in on-call rotation for production systems
Agile and DevOps-driven engineering environment

Work Environment & Conditions

High-availability, production-critical systems environment
Collaboration with Engineering, DevOps, Cloud, and Security teams
Strong focus on automation, observability, and reliability engineering
Fast-paced environment supporting scalable distributed systems
Continuous improvement culture with emphasis on resilience and performance
Career growth opportunities into Senior SRE, Platform Engineering Lead, or Infrastructure Architect roles

About the Company

CoreWork Staffing

Resume Resources

Free Resume Templates Free Resume Builder