Infrastructure SRE Architect & Engineering Lead

Ampcus Incorporated

Chicago, IL

JOB DETAILS
SKILLS
Amazon Web Services (AWS), Ansible, Automation, Automation Systems, Backlog Prioritization, Budget Management, Budgeting, Business Services, Change Control, Cloud Computing, Cloud Storage, Communication Skills, Consulting, Continuous Deployment/Delivery, Continuous Improvement, Continuous Integration, Cost Control, Cross-Functional, DevOps, Embedded Systems, Event Correlation, Incident Management, Incident Response, Instrumentation, Leadership, Management Strategy, Metrics, Microsoft Windows Azure, Operational Measurement, Performance Metrics, Predictive Modeling, Process Improvement, Productivity Model, Reliability Analysis, Reliability Engineering, Reporting Dashboards, Risk Analysis, Service Delivery, Software Engineering, Splunk, Systems Analysis, Technical Leadership, Telemetry, Trend Analysis
LOCATION
Chicago, IL
POSTED
1 day ago

Ampcus Inc. is a certified global provider of a broad range of Technology and Business consulting services. We are in search of a highly motivated candidate to join our talented Team.

 

Job Title: Infrastructure SRE Architect & Engineering Lead

Location(s): Chicago, IL
(Remote)


Job Description
The Infrastructure SRE Architect & Engineering Lead is responsible for defining and driving the enterprise-scale reliability, observability, and automation strategy across infrastructure services. This role operates at the intersection of architecture, engineering, and operations—establishing standards, guiding engineering practices, and ensuring that reliability engineering principles are embedded into day-to-day service delivery.

As a senior technical leader, this role challenges traditional operations models by introducing measurable reliability frameworks, advanced observability patterns, and automation-driven operations. The position requires leading cross-functional transformation initiatives, influencing platform and infrastructure teams, and continuously improving service resilience, performance, and efficiency through data-driven insights and engineering discipline.

Responsibilities
  • Define and govern enterprise observability and reliability engineering standards, including SLO frameworks, service health models, and instrumentation strategies.
  • Lead the design and evolution of observability architectures, including dashboards, alerting strategies, and telemetry integration patterns.
  • Establish and drive reliability practices such as SLO management, error budget governance, and proactive risk identification.
  • Oversee the development and scaling of automation capabilities, including self-healing workflows, validation pipelines, and configuration compliance controls.
  • Provide technical leadership for reliability analysis, including identification of systemic risks, failure patterns, and resilience gaps across infrastructure domains.
  • Drive continuous improvement through structured analytics, including performance trends, capacity insights, and cost optimization opportunities.
  • Partner with platform engineering and client stakeholders to evaluate and implement new observability and automation capabilities.
  • Lead post-incident review frameworks focused on detection effectiveness, diagnostic quality, and prevention strategies.
  • Maintain and prioritize a strategic backlog of reliability and automation initiatives aligned to business objectives.
  • Mentor engineering teams and promote adoption of SRE principles, modern operational practices, and engineering-driven service delivery.
Required Skills
  • Strong expertise in Site Reliability Engineering (SRE) principles, including SLO/SLI design, error budget management, and reliability modeling
  • Deep knowledge of observability platforms (e.g., Datadog, Dynatrace, Prometheus, Grafana, Splunk) and telemetry design (metrics, logs, traces).
  • Advanced experience designing automation solutions using tools such as Ansible, Terraform, or cloud-native orchestration frameworks.
  • Experience building and operationalizing monitoring, alerting, and incident response frameworks at scale.
  • Strong understanding of infrastructure platforms (cloud, compute, storage, network) and their reliability characteristics.
  • Demonstrated ability to perform system-level analysis, including trend analysis, capacity modeling, and failure pattern identification.
  • Experience leading large-scale engineering or transformation initiatives across distributed teams.
  • Strong stakeholder management and communication skills with the ability to influence senior technical and business leaders.
Desired Skills
  • Experience implementing SRE practices within managed services or enterprise IT operating models.
  • Familiarity with AIOps, event correlation, and predictive analytics platforms.
  • Experience with CI/CD pipelines and integrating observability and automation into software delivery lifecycles.
  • Knowledge of FinOps practices related to observability and telemetry cost optimization.
  • Exposure to platform engineering concepts and internal developer platforms (IDPs).
  • Relevant certifications such as AWS/Azure Architect, Google Professional Cloud DevOps Engineer, or Certified Kubernetes Administrator (CKA).
  • Experience defining and measuring operational KPIs tied to business outcomes and service performance.
 


Ampcus is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, protected veterans or individuals with disabilities.

About the Company

A

Ampcus Incorporated

Ampcus Inc is a global technology and business consulting firm specializing in Digital Transforrmation, Big Data, Analytics, Cyber Security, Testing, IV&V, Infrastructure Management and Enterprise Solutions. Ampcus Inc is an SBA 8(a) certified Women and Minority Owned global Provider of broad range of consulting Services. From strategy to execution, our disciplined yet flexible approach starts and ends with our clients. By listening hard and working harder, their goals become our goals. We are an ISO 9000, ISO 20000, ISO 27000 and CMMi Level certified company.

Ampcus consultants have significant business, engineering and technology experience. Our consultants have over 20 years of business experience and an average of over 10 years of engineering and technology experience. This means that the project teams understand how systems work and how the technology impacts the business processes of organizations.

We believe that success of an engagement is determined by strong project management, clear communication and mutual commitment working collaboratively. Our methodology begins by listening to the customer needs, then working with their teams to gain a clear understanding of the requirements, while providing a knowledge transfer of best practices for the organization. As a recognized leader providing customized software services, management and engineering solutions to companies around the world, our ability to deliver is a "granted"​ that makes companies put their trust in us to answer their day-to-day business challenges and put them on a path for greater success. We are the choice for our clients because we look at our clients business from a growth perspective.

Industry: Information Technology and Services

Specialties: Digital Transformation, Big Data and Analytics, Infrastructure Management Services, Testing and IV&V, Cyber Security, Active Directory and E-mail Infrastructure, Project Management, Training, and ERP, CRM. EAI, BI

COMPANY SIZE
500 to 999 employees
INDUSTRY
Staffing/Employment Agencies
WEBSITE
http://www.ampcus.com