Infrastructure SRE Architect & Engineering Lead

Apolis

Chicago, IL

Apply

JOB DETAILS

SALARY

$65–$68 Per Hour

SKILLS

Amazon Web Services (AWS), Ansible, Automation, Automation Systems, Backlog Prioritization, Budget Management, Budgeting, Business Services, Change Control, Cloud Computing, Cloud Storage, Communication Skills, Continuous Deployment/Delivery, Continuous Improvement, Continuous Integration, Cost Control, Cross-Functional, DevOps, Embedded Systems, Event Correlation, Incident Management, Incident Response, Instrumentation, Leadership, Management Strategy, Metrics, Microsoft Windows Azure, Operational Measurement, Performance Metrics, Predictive Modeling, Process Improvement, Productivity Model, Reliability Analysis, Reliability Engineering, Reporting Dashboards, Risk Analysis, Service Delivery, Software Engineering, Splunk, Systems Analysis, Technical Leadership, Telemetry, Trend Analysis

LOCATION

Chicago, IL

POSTED

1 day ago

Role: Infrastructure SRE Architect & Engineering Lead
Location: USA Remote

Job Description:
The Infrastructure SRE Architect & Engineering Lead is responsible for defining and driving the enterprise-scale reliability, observability, and automation strategy across infrastructure services. This role operates at the intersection of architecture, engineering, and operations—establishing standards, guiding engineering practices, and ensuring that reliability engineering principles are embedded into day-to-day service delivery.
As a senior technical leader, this role challenges traditional operations models by introducing measurable reliability frameworks, advanced observability patterns, and automation-driven operations. The position requires leading cross-functional transformation initiatives, influencing platform and infrastructure teams, and continuously improving service resilience, performance, and efficiency through data-driven insights and engineering discipline.

Responsibilities

Define and govern enterprise observability and reliability engineering standards, including SLO frameworks, service health models, and instrumentation strategies
Lead the design and evolution of observability architectures, including dashboards, alerting strategies, and telemetry integration patterns
Establish and drive reliability practices such as SLO management, error budget governance, and proactive risk identification
Oversee the development and scaling of automation capabilities, including self-healing workflows, validation pipelines, and configuration compliance controls
Provide technical leadership for reliability analysis, including identification of systemic risks, failure patterns, and resilience gaps across infrastructure domains
Drive continuous improvement through structured analytics, including performance trends, capacity insights, and cost optimization opportunities
Partner with platform engineering and client stakeholders to evaluate and implement new observability and automation capabilities
Lead post-incident review frameworks focused on detection effectiveness, diagnostic quality, and prevention strategies
Maintain and prioritize a strategic backlog of reliability and automation initiatives aligned to business objectives
Mentor engineering teams and promote adoption of SRE principles, modern operational practices, and engineering-driven service delivery
Required Skills
Strong expertise in Site Reliability Engineering (SRE) principles, including SLO/SLI design, error budget management, and reliability modeling
Deep knowledge of observability platforms (e.g., Datadog, Dynatrace, Prometheus, Grafana, Splunk) and telemetry design (metrics, logs, traces)
Advanced experience designing automation solutions using tools such as Ansible, Terraform, or cloud-native orchestration frameworks
Experience building and operationalizing monitoring, alerting, and incident response frameworks at scale
Strong understanding of infrastructure platforms (cloud, compute, storage, network) and their reliability characteristics
Demonstrated ability to perform system-level analysis, including trend analysis, capacity modeling, and failure pattern identification
Experience leading large-scale engineering or transformation initiatives across distributed teams
Strong stakeholder management and communication skills with the ability to influence senior technical and business leaders

Desired Skills

Experience implementing SRE practices within managed services or enterprise IT operating models
Familiarity with AIOps, event correlation, and predictive analytics platforms
Experience with CI/CD pipelines and integrating observability and automation into software delivery lifecycles
Knowledge of FinOps practices related to observability and telemetry cost optimization
Exposure to platform engineering concepts and internal developer platforms (IDPs)
Relevant certifications such as AWS/Azure Architect, Google Professional Cloud DevOps Engineer, or Certified Kubernetes Administrator (CKA)
Experience defining and measuring operational KPIs tied to business outcomes and service performance"

About the Company

Apolis

Since 1996, RJT has provided successful SAP, Oracle, and IT consulting solutions and staffing services to clients around the world. The new Apolis brings you the same personalized service fortified with a greater array of IT solutions, global expertise, and cost-management strategies.

We are a global IT consultancy that seamlessly integrates experts and leading-edge solutions into your organization so you can focus on what really matters.

COMPANY SIZE

500 to 999 employees

INDUSTRY

Computer/IT Services

EMPLOYEE BENEFITS

Paid Sick Days, Employee Referral Program, Employee Events, Retirement / Pension Plans

WEBSITE

https://www.apolisrises.com/

Resume Resources

Free Resume Templates Free Resume Builder