Site Reliability Engineer

CoreWork Staffing

Florida, Florida

JOB DETAILS
SKILLS
Access Control, Amazon Web Services (AWS), Architectural Services, Automation, Bash Scripting, Best Practices, Bridge Building, Capacity Management, Capacity Utilization, Civil Engineering, Cloud Computing, Computer Programming, Computer Science, Continuous Deployment/Delivery, Continuous Improvement, Continuous Integration, Corrective Action, Cross-Functional, Database Technology, DevOps, Distributed Computing, Docker, Engineering Management, GCP (Good Clinical Practices), Go Programming Language (Golang), High Availability, ISO (International Organization for Standardization), Identify Issues, Incident Management, Incident Response, Information Technology & Information Systems, Machine Tool, Microservices, Microsoft Windows Azure, NoSQL, On Call, Operational Improvement, Operations Guidelines, Performance Analysis, Performance Management, Performance Metrics, Performance Tuning/Optimization, Product Support, Production Support, Production Systems, Python Programming/Scripting Language, Regulatory Compliance, Reliability Engineering, Reporting Dashboards, Resource Utilization, Risk Management, Root Cause Analysis, SQL (Structured Query Language), Scripting (Scripting Languages), Service Level Agreement (SLA), Software Engineering, Systems Administration/Management, Systems Analysis, Systems Engineering, Systems Reliability, Systems Scalability, U.S. National Institute of Standards and Technology (NIST)
LOCATION
Florida, Florida
POSTED
1 day ago

Site Reliability Engineer (SRE)

Position Overview

We are seeking a highly skilled Site Reliability Engineer (SRE) to ensure the reliability, scalability, performance, and availability of mission-critical systems and services. This role bridges software engineering and infrastructure operations, focusing on building and maintaining highly reliable distributed systems in cloud-native environments.

The ideal candidate has strong experience in systems engineering, cloud infrastructure, automation, incident response, and performance optimization. They are passionate about improving system reliability through automation, observability, and engineering best practices.

Location Requirement

To support collaboration with engineering and operations teams, candidates must currently reside in one of the following metropolitan areas in the United States:

  • Dallas

  • Houston

  • Austin

  • Atlanta

  • Jacksonville

  • Miami

  • Nashville

  • Charlotte

  • Phoenix

Candidates outside of these locations will not be considered.

Key Responsibilities

System Reliability & Engineering

  • Design, build, and maintain highly reliable and scalable distributed systems

  • Define and implement Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs)

  • Improve system uptime, resilience, and fault tolerance across services

  • Identify and eliminate single points of failure in infrastructure and applications

  • Participate in capacity planning and scalability engineering

Cloud Infrastructure & Platform Engineering

  • Manage and optimize cloud infrastructure across AWS, Azure, and/or GCP

  • Work with containerized environments using Docker and Kubernetes

  • Automate infrastructure provisioning using Infrastructure as Code (Terraform, CloudFormation, etc.)

  • Optimize cloud resource utilization, performance, and cost efficiency

  • Support hybrid and multi-cloud infrastructure environments

Observability, Monitoring & Alerting

  • Implement and maintain monitoring, logging, and tracing systems

  • Build dashboards and alerting systems using observability tools (e.g., Prometheus, Grafana, Datadog, ELK stack)

  • Analyze system performance metrics and identify bottlenecks

  • Improve detection and response to system anomalies

  • Ensure actionable and low-noise alerting systems

Incident Response & Production Support

  • Lead and participate in incident response and on-call rotations

  • Troubleshoot production issues and perform root cause analysis (RCA)

  • Implement corrective and preventive actions to reduce recurrence

  • Coordinate cross-functional response during outages and incidents

  • Document incident reports and improve operational readiness

Automation & DevOps Engineering

  • Automate repetitive operational tasks and infrastructure processes

  • Improve CI/CD pipelines for reliability and efficiency

  • Implement self-healing systems and automated remediation

  • Collaborate with DevOps teams to improve deployment workflows

  • Reduce manual operational overhead through tooling and scripting

Security, Reliability & Compliance

  • Collaborate with Security teams to ensure secure system configurations

  • Implement best practices for system hardening and access control

  • Support compliance requirements (SOC 2, ISO 27001, NIST, etc.)

  • Ensure secure handling of production systems and data

  • Participate in vulnerability remediation and risk reduction efforts

Collaboration & Engineering Partnership

  • Work closely with Software Engineers, DevOps, Security, and Cloud teams

  • Improve system design through reliability-focused architecture reviews

  • Advocate for reliability engineering best practices across teams

  • Contribute to engineering standards and operational guidelines

  • Support product teams in delivering stable and scalable services

Qualifications

Required

  • Bachelor's degree in Computer Science, Information Technology, Engineering, or related field

  • 3+ years of experience in Site Reliability Engineering, DevOps, Infrastructure Engineering, or Systems Engineering

  • Strong experience with cloud platforms (AWS, Azure, and/or GCP)

  • Proficiency with Linux systems and networking fundamentals

  • Experience with containerization (Docker) and orchestration (Kubernetes)

  • Experience with CI/CD pipelines and automation tools

  • Strong scripting/programming skills (Python, Go, Bash, or similar)

  • Experience with monitoring and observability tools

  • Strong problem-solving and incident troubleshooting skills

  • Must currently reside in one of the approved locations listed above

Preferred (Nice-to-Have)

  • Experience with high-scale distributed systems

  • Knowledge of microservices and event-driven architectures

  • Familiarity with Infrastructure as Code (Terraform, Pulumi, CloudFormation)

  • Experience with SRE frameworks (Google SRE principles)

  • Knowledge of database systems (SQL/NoSQL) and performance tuning

  • Experience with service mesh technologies (Istio, Linkerd)

  • Familiarity with security practices in cloud-native environments

  • Experience in high-availability or mission-critical systems

  • Certifications in cloud or DevOps technologies

Key Performance Indicators (KPIs)

System Reliability

  • System uptime / availability percentage

  • Reduction in service outages and incidents

  • Achievement of SLO/SLA targets

  • Mean Time Between Failures (MTBF)

Incident Management

  • Mean Time to Detect (MTTD)

  • Mean Time to Recover (MTTR)

  • Number of recurring incidents reduced

  • Incident response effectiveness and resolution time

Performance & Scalability

  • System latency and throughput improvements

  • Capacity utilization and scaling efficiency

  • Performance optimization improvements

  • Infrastructure bottleneck reduction

Automation & Efficiency

  • Percentage of operational tasks automated

  • Reduction in manual intervention for production issues

  • CI/CD pipeline efficiency improvements

  • Deployment success rate and frequency

Collaboration & Engineering Impact

  • Engineering team satisfaction with platform reliability

  • Adoption of reliability best practices

  • Contribution to architectural improvements

  • Effectiveness in cross-team incident coordination

Reporting To

  • Head of Site Reliability Engineering

  • Director of Infrastructure

  • Cloud Engineering Manager

  • Head of Platform Engineering

  • Chief Technology Officer (CTO)

Employment Type & Work Setup

  • Full-Time

  • Remote (Candidates must reside in approved locations)

  • Hybrid opportunities may be available based on business needs

  • Participation in on-call rotation for production systems

  • Agile and DevOps-driven engineering environment

Work Environment & Conditions

  • High-availability, production-critical systems environment

  • Collaboration with Engineering, DevOps, Cloud, and Security teams

  • Strong focus on automation, observability, and reliability engineering

  • Fast-paced environment supporting scalable distributed systems

  • Continuous improvement culture with emphasis on resilience and performance

  • Career growth opportunities into Senior SRE, Platform Engineering Lead, or Infrastructure Architect roles


About the Company

C

CoreWork Staffing