Lead Site Reliability Engineer

McGraw-Hill Education

New York, NY(remote)

Apply

JOB DETAILS

SALARY

$124,000–$155,000 Per Year

SKILLS

Agile Programming Methodologies, Amazon Web Services (AWS), Analysis Skills, Automation, Cloud Computing, Continuous Deployment/Delivery, Continuous Integration, DevOps, Ecosystems, Enterprise Applications, GitHub, Health Maintenance, Identify Issues, Injections, Internet Security, Leadership, Network Configuration Management, Performance Analysis, Performance Metrics, Problem Solving Skills, Production Support, Reliability Engineering, Risk, Root Cause Analysis, Software Administration, Software Development, Software Engineering, Sprint Planning, Systems Engineering, Team Lead/Manager, Telemetry

LOCATION

New York, NY

POSTED

Today

Could your creative thinking build the future? A Lead Site Reliability Engineer at McGraw Hill makes a difference for learners and educators across the world. Our team needs individuals with new ideas who connect with people in innovative ways.Impact the MomentMcGraw Hill, a leading provider of digital educational resources and content, is seeking a Lead Site Reliability Engineer to lead a team of 6 Engineers for our Digital Platform Group. You will support our K‑12 learning platforms that serve millions of students and educators nationwide, ensuring their reliability, scalability, and performance. Working closely with engineering and product teams, you will leverage expertise in AWS, Terraform, and observability tools to drive automation, enhance resiliency, and maintain the health of our cloud‑based infrastructure.Remote position – open to applicants authorized to work in the United States.What you will be doing:Lead a 6‑member SRE team supporting production infrastructure and servicesManage backlog, sprint planning, and team velocityOwn reliability, uptime, security, cost, and performance of servicesDefine and monitor SLOs for application workloadsPlan on‑call rotations and work to reduce alert fatigueForecast seasonal growth and capacity planningMentor engineers and foster professional growthReport status and issues to leadership monthlyPartner with development teamsCollaborate with CyberSecurity on risk mitigationCollaborate with FinOps on cost reductionDesign and troubleshoot highly‑distributed, cloud‑based production systemsMaintain infrastructure‑as‑code and monitoring‑as‑code practicesImprove system resiliency through failure injection and chaos testingParticipate in on‑call rotation and resolve operational issuesOptimize existing systems for performance and costEnsure telemetry provides visibility to application performanceSupport agile development practices and code reviewsWe're looking for someone with:5+ years of experience in SRE, DevOps, or Software Engineering roles supporting enterprise applications.Strong problem‑solving, triage, and root cause analysis skills with a systems engineering mindset.Deep expertise in the AWS ecosystem, with hands‑on experience across core services including ECS, RDS, EKS, IAM, CloudWatch, and networking configurations.Expertise with Terraform for managing and automating scalable cloud infrastructure.Skilled in CI/CD pipelines (e.g., GitHub Actions) and managing end‑to‑end software delivery lifecycles.Strong familiarity with telemetry and observability tools (e.g., New Relic, Datadog), including querying logs and metrics for performance monitoring.Why work for us?The work you do at McGraw Hill will matter. We are collectively designing content that will build the future of education. Play your part and experience a sense of fulfillment that will inspire you to even greater heights.The pay range for this position is between $124,000 and $155,000 annually. Base pay may vary based on experience and location. A full range of medical and other benefits may be provided. Learn more about our benefit offerings.#J-18808-Ljbffr

About the Company

McGraw-Hill Education

Resume Resources

Free Resume Templates Free Resume Builder