Senior Site Reliability Engineer- Sunnyvale, CA, the US

Kody

Sunnyvale, CA

Apply

JOB DETAILS

SKILLS

Amazon Web Services (AWS), Automation, Banking Services, Budget Management, Budgeting, Capacity and Performance Management, Cloud Architecture, Cloud Computing, Communication Skills, Cross-Functional, DevOps, Disaster Recovery, Distributed Computing, Establish Priorities, High Availability, Identify Issues, Incident Management, Incident Response, Leadership, Linux Operating System, Mentoring, Messaging Technology, Metrics, On Call, Operational Improvement, Operations Processes, PCI-DSS, Payment Processing, Performance Tuning/Optimization, PostgreSQL, Process Improvement, Production Systems, Redis, Regulatory Compliance, Reliability Engineering, Root Cause Analysis, Scalable System Development, Software Engineering, System Operations, Systems Scalability, Team Player, Technical Leadership, Telemetry

LOCATION

Sunnyvale, CA

POSTED

12 days ago

About the Role

Senior Site Reliability Engineer (Payments Infrastructure)
Kody is seeking a Senior Site Reliability Engineer to ensure the reliability, availability, scalability, and operational excellence of our global payment platform. You will own production observability, incident response, service-level management, and cloud infrastructure reliability across mission-critical payment processing systems operating in Europe, Asia, and North America.

Responsibilities

Participate in a follow-the-sun production on-call rotation as a primary incident responder.
Diagnose, triage, mitigate, and coordinate resolution of production incidents across payment services, Kubernetes platforms, databases, messaging systems, and cloud infrastructure.
Define and maintain SLOs, SLIs, error budgets, alerting standards, and operational readiness processes.
Drive reliability improvements through automation, observability, capacity planning, performance optimization, and post-incident reviews.
Partner with engineering teams to improve resilience, security, and operational maturity in PCI-DSS-regulated environments.
Lead incident management during SEV1/SEV2 events and improve response effectiveness and MTTR.

Requirements

5+ years of experience in Site Reliability Engineering, Platform Engineering, DevOps, or Cloud Infrastructure roles supporting mission-critical production systems.
Strong hands-on experience with AWS, Kubernetes (EKS), Terraform, PostgreSQL, Redis, Kafka, Linux, networking, and modern observability platforms.
Deep understanding of distributed systems, cloud-native architectures, high availability, disaster recovery, capacity planning, and performance optimization.
Proven experience operating payment, banking, fintech, or other highly regulated systems with stringent security, compliance, and uptime requirements.
Strong knowledge of SRE principles, including SLOs, SLIs, error budgets, incident management, alert governance, and operational excellence.

Leadership & Operational Excellence

Demonstrates strong ownership and accountability, taking end-to-end responsibility for service reliability and customer impact.
Possesses a strong sense of urgency during production incidents while maintaining sound judgment and structured decision-making under pressure.
Applies a systematic and methodical approach to troubleshooting, root-cause analysis, and incident resolution in complex distributed environments.
Data-driven mindset with the ability to leverage metrics, telemetry, trends, and service-level indicators to prioritize reliability investments and operational improvements.
Continuously drives engineering excellence through iterative improvement, automation, standardization, and elimination of operational toil.
Proven ability to lead cross-functional incident response efforts, coordinate stakeholders, and communicate effectively during high-severity production events.
Champions a culture of operational readiness, continuous learning, post-incident improvement, and blameless accountability.
Demonstrates strong mentoring and technical leadership skills, influencing engineering teams to build reliable, scalable, and resilient systems by design.

Benefits

Lead a dynamic and innovative team in a very rapidly growing company.
Competitive package.
Collaborative, inclusive environment where your contributions are recognized and valued.

About the Company

Kody

Resume Resources

Free Resume Templates Free Resume Builder

Senior Site Reliability Engineer- Sunnyvale, CA, the US

Kody

Sunnyvale, CA

About the Role

About the Company

Kody

Resume Resources

Similar Job Searches