Automation, Best Practices, Cataloguing, Coaching, Continuous Improvement, Ecosystems, Event Management, ITIL (IT Infrastructure Library), Incident Management, Incident Response, Knowledge Management, Metadata, Microsoft Windows Azure, Network Operations Center, On Call, Performance Metrics, ServiceNow, Splunk, Standards Development
Pay Rate: $65/hour
Duration: 6 Months
Location: Fairfield, CA
Responsibilities:
- Establish and maintain a department-wide alert rationalization framework.
- Lead continuous improvement efforts to reduce alert fatigue while preserving the detection of true incidents.
- Define and enforce alerting standards including severity definitions, required metadata, naming conventions, and routing rules.
- Create a standardized Alert Design Checklist and approval workflow.
- Act as a gatekeeper for determining alert routing to 24x7 Eyes-on-Glass, on-call engineering, or business-hours handling.
- Establish a consistent approach to cataloging response instructions for actionable alerts.
- Define and publish KPIs demonstrating alerting health and operational performance.
- Facilitate governance forums with service owners and engineering leads to review alert quality and backlog.
- Coach service teams on best practices and drive adoption of observability patterns.
Requirements:
- Minimum 5 years in IT Operations, SRE, Observability, Monitoring Engineering, or Incident Management.
- Demonstrated success in reducing noise and improving actionability across enterprise alerting ecosystems.
- Experience with common monitoring/observability tools such as Splunk, AppDynamics, Dynatrace, Datadog, Prometheus/Grafana, Azure Monitor, CloudWatch, or ServiceNow Event Management.
- Strong understanding of incident response workflows, operational coverage models, CMDB/service ownership concepts, and knowledge management.
- Excellent stakeholder management skills and ability to drive standards across teams.
Preferred Skills:
- Experience designing or operating an Operations Command Center / NOC / SOC-style “eyes-on-glass” model.
- Familiarity with ITIL Event Management, SRE principles, and service reliability practices.
- Experience with automation for alert enrichment, correlation, and routing.
- Background in governance frameworks and operating rhythm design.
A
Axelon Services Corporation