Position Responsibilities As a Senior Staff Engineer, you will: Lead the strategy and execution for incident retrospective and correction of error (COE) processes across the engineering organization Help conduct deep technical root cause analysis and incident forensics across distributed systems using observability data, logs, metrics, and traces Establish continuous improvement loops through automated trend analysis, pattern recognition algorithms, and predictive analytics Design, code, and deploy automation platforms and self-service tools using Python, Go, Java, or C# that scale incident retrospective workflows and eliminate manual tracking Build production-grade data pipelines, analytics systems, and real-time dashboards to measure incident trends, COE effectiveness, and action item completion rates Write code for workflow automation, integrations with observability platforms, and APIs that connect incident management tools across the engineering ecosystem Leverage SQL and NoSQL databases to store, query, and analyze incident data at scale using Azure tools and cloud-native services Develop and maintain systems that ensure rigorous follow-through on action items, remediation plans, and preventive measures with automated tracking Partner with service engineering teams to implement preventive measures and architectural improvements based on incident patterns Present data-driven insights and incident trend analysis to leadership and engineering teams to drive preventive action Influence and educate leadership on incident patterns, prevention strategies, and reliability best practices Mentor engineers on coding best practices, automation techniques, and strengthen technical expertise across the engineering community Stay current with industry advances in SRE, observability, incident management, and automation; educate teams on emerging practices Qualifications Experience building automation platforms and self-service tools for workflow management, analytics, or engineering productivity Fluency in at least two modern languages such as Python, Go, Java, C++, or C# including object-oriented design Experience building microservices architectures, REST APIs, and distributed systems Experience with data pipelines, analytics platforms, and visualization tools for operational metrics and KPIs Experience with SQL and NoSQL databases (e.g., PostgreSQL, MongoDB, Cassandra, CosmosDB) for data storage and analytics Experience with observability platforms (Prometheus, Grafana, Datadog, Splunk, ELK) and distributed systems monitoring, logging, and tracing Experience with cloud providers (Azure, AWS, or GCP) and cloud-native architectures Experience with CI/CD pipelines, infrastructure as code, and container orchestration (Kubernetes, Docker) Experience writing workflow automation code (YAML pipelines, GitHub Actions, Azure DevOps pipelines) Strong understanding of distributed systems architecture, design patterns, reliability, and scaling Knowledge of retrospective facilitation, continuous improvement processes, and blameless culture principles Strong architecture and design skills with ability to influence engineering direction and technical roadmap Experience solving complex analytical problems with data-driven approaches Proven ability to partner with cross-functional engineering teams and drive systemic improvements Excellent communication skills with ability to present technical insights to leadership and influence decision-making Experience leveraging GenAI or LLMs is a plus Experience 10+ years of professional platform development or general development experience 8+ years of experience with architecture and design 6+ years of experience in open-source frameworks 4+ years of experience with AWS, GCP, Azure, or another cloud service Education Bachelors degree in Computer Science, Information Systems, or equivalent education or work experience #LI-RM2 Annual Salary The above annual salary range is a general guideline. Position Description The Senior Staff Engineer in Availability and Incident Management will engineer solutions and empower the engineering community with automated processes, data-driven insights, and technical tools that reduce incident recurrence, improve system reliability, and accelerate incident resolution.