San Francisco, CA30+ days ago
What Youll Be Working On: • Leading and growing a team of SREs embedded within Crusoes Managed AI and Managed Services product areas, setting technical direction and fostering a culture of ownership and continuous improvement • Contributing as an IC - reviewing code, building tooling, and driving automation to reduce toil and improve the reliability and scalability of production managed services • Owning SLA/SLO performance, incident response, and on-call health for managed service offerings; leading blameless post-mortems and driving systemic remediation • Partnering with embedded product and platform engineering teams to influence infrastructure design, observability strategy, and operational readiness for new and existing managed services • Defining and tracking reliability, performance, and operational maturity metrics across the team; translating data into prioritized roadmap investments • Serving as a technical escalation point for high-severity production incidents affecting enterprise customers, and collaborating with Cloud Support and Customer Success on resolution and communication. Youll own the production health of the services Crusoe delivers to enterprise customers, including Managed Kubernetes, Managed Inference, and AutoClusters, partnering closely with embedded engineering teams to raise the bar on operational excellence, automation, and customer experience.