Database Operations: Manage database reliability, performance, and scaling (where not handled by dedicated DB teams)Service Mesh & Networking: Implement and maintain service discovery, load balancing, and network policiesDeveloper Tools: Create and maintain tools and platforms that improve developer productivity and system reliabilityRequired QualificationsTechnical SkillsProgramming Languages: Proficiency in at least two of: Python, Shell, PHP, Java, or similar languagesCloud Platforms: Experience with AWS, GCP, or Azure infrastructure and servicesContainerization: Hands-on experience with Docker, Kubernetes, and container orchestrationMonitoring & Observability: Experience with Prometheus, Grafana, ELK stack, or similar toolsInfrastructure as Code: Proficiency with Terraform, CloudFormation, or similar toolsVersion Control: Expert-level Git usage and collaborative development practicesOperational ExperienceProduction Systems: 3+ years managing large-scale production environmentsOn-call Experience: Comfortable with 24/7 on-call responsibilities and incident responseSystem Administration: Strong Linux/Unix system administration skillsNetworking: Understanding of TCP/IP, DNS, load balancing, and network securityDatabase Systems: Experience with SQL and NoSQL databases in production environmentsSRE-Specific KnowledgeSLI/SLO Management: Experience defining and maintaining service level objectivesError Budget Policy: Understanding of error budget concepts and implementationToil Reduction: Track record of identifying and eliminating repetitive manual workCapacity Planning: Experience with performance testing and capacity managementPreferred QualificationsBachelor's degree in Computer Science, Engineering, or equivalent experienceExperience with microservices architecture and distributed systemsKnowledge of security best practices and compliance frameworksExperience with chaos engineering and reliability testingPrevious experience in an SRE or DevOps role at a technology companyContributions to open-source projects or technical communitiesSuccess MetricsReliability: Maintain or improve service availability and reliability metricsToil Reduction: Measurable reduction in manual operational work through automationIncident Response: Effective participation in incident response with focus on preventionCode Quality: High-quality, well-tested code contributions to infrastructure and toolingCollaboration: Effective partnership with development teams to improve system reliabilityTeam Culture & ValuesBlameless Post-mortems: Learn from failures without blame or punishmentAutomation First: Prefer automated solutions over manual processesMeasuring Everything: Data-driven decision making and continuous improvementSharing Knowledge: Document and share expertise across the teamWork-Life Balance: Sustainable on-call practices and reasonable operational loadGrowth OpportunitiesOpportunity to work on cutting-edge infrastructure and reliability challengesExposure to large-scale distributed systems and modern cloud technologiesProfessional development budget for conferences, training, and certificationsCareer progression path toward senior SRE, staff engineer, or management rolesCollaboration with engineering teams across the organizationWork Location: This role is fully remote for candidates who reside outside the 50 mile radius of our San Ramon office. Performance Monitoring: Build alerting systems and performance monitoring to proactively identify and resolve issues before they impact usersIncident Response: Participate in on-call rotations and lead incident response efforts, including post-mortem analysis and remediation.