Senior Site Reliability Engineer (Remote Poland)

Tech Insights

New York, NY(remote)

JOB DETAILS
SALARY
$18,800–$20,000 Per Year
SKILLS
AWS Lambda, Amazon Web Services (AWS), Application Programming Interface (API), Architectural Design, Architectural Services, Artificial Intelligence (AI), Artificial Intelligence (AI) Agents, Automation, Bash Scripting, Budgeting, Cloud Computing, Coaching, Computer Science, Continuous Deployment/Delivery, Continuous Integration, Cost Control, Cost Modeling, DevOps, Diversity, Docker, English Language, Establish Priorities, Fitness, GitHub, Incident Management, Incident Response, Java, Leadership, Machine Tool, Mentoring, Microservices, Multiplatform/Cross-Platform, Problem Solving Skills, Project/Program Management, Python Programming/Scripting Language, Registered Training Organisation (RTO), Reliability Engineering, Reporting Dashboards, Requirements Management, Risk, Semiconductors, Software Engineering, Software as a Service (SaaS), System Architecture, System Operations, Team Building, Team Lead/Manager, Technical Leadership, Training/Teaching, Willing to Travel
LOCATION
New York, NY
POSTED
3 days ago

WHY WORK WITH USCompany-sponsored training and development opportunitiesComprehensive benefits package (health, wellness, life insurance, fitness, English classes)Flexible vacation policyCommunity involvement opportunities through charitable alliancesWellness resources and supportInclusive environment that prioritizes diversity, equity, and accessibilityHigh-growth company driven by high performanceExpected salary range: 18,800 - 20,000 PLNTHE OPPORTUNITYTechInsights is building the reliability and AI operations foundation for its next chapter — an AI‑first intelligence platform that runs the most demanding semiconductor intelligence workflows in the world. We're looking for a Senior Site Reliability Engineer who wants to own that foundation.This role is a senior individual contributor at the technical leadership tier of our Site Reliability Engineering team. You'll own strategic reliability initiatives end‑to‑end: setting technical direction, defining SLOs and error budgets across our production platform, designing reliability patterns for the AI agent pipelines that power our platform's AI‑first capabilities, and enabling our development and AI Engineering teams to build and ship with confidence.What sets this role apart is its scope. You're not just keeping the lights on — you're building the observability, internal Developer Platform (IDP), and service catalog that a fast‑scaling AI platform needs from day one. You'll be the reliability voice in architectural decisions, the engineer who closes the loop between agent failure modes and platform resilience, and the mentor who builds the team's capability rather than their own indispensability.If you have deep SRE experience and want to apply it to AI workloads — agent loop observability, blast radius management, LLM infrastructure reliability — this is the role where that expertise becomes a differentiator.This role is a remote position for candidates based in Poland.WHAT YOU'LL DOPlatform Reliability & AI OperationsOwn SLOs, SLIs, and error budgets for all production services; drive error budget discipline across engineeringDesign reliability patterns for AI agent pipelines: LLM observability, tool‑use tracking, failure detection, and graceful degradationArchitect for blast radius containment — agent failures must have bounded customer impact through isolation, circuit breaking, and rapid recoveryMature our Canada Central/West active‑active architecture toward 24‑hour RTO with full regional failoverLead incident response and post‑incident reviews that produce durable fixes; maintain DR procedures through regular testingDeveloper & AI Engineering EnablementServe as the primary reliability liaison to Software and AI Engineering, translating requirements into actionable standardsPartner with AI Engineering on compute provisioning, model serving, inference latency, and workload isolationOwn CI/CD pipeline strategy (Bitbucket Pipelines, GitHub Actions) — set standards, optimize deployment frequency, and ensure teams can ship confidentlyDrive IDP adoption and enable teams on SRE practices: on‑call readiness, SLO definition, runbook development, and self‑service toolingRepresent reliability in architectural discussions; surface risk before it's committed to designObservability, IDP & Service CatalogOwn the service catalog — a living inventory of all services, AI agents, dependencies, ownership, and SLOsOperate Datadog as the single pane of glass for service health, infrastructure, and agentic pipeline telemetryExtend observability to AI workloads: LLM latency, token consumption, agent completion rates, and pipeline throughputBuild golden path templates in Backstage and/or Atlassian Compass so teams ship reliably without routine SRE involvementApply AIOps in Datadog to automate anomaly detection, incident triage, and remediation recommendationsFinOps, IaC & Continuous ImprovementOwn infrastructure as code via Terraform and GitOps; enforce IaC policy in partnership with Trust AssuranceOwn FinOps visibility into AWS cost segments; model cloud cost impact as AI/ML workloads scaleFormally mentor junior and intermediate SRE engineers, with accountability for their technical growth and career progressionBuild AI‑assisted automation to progressively reduce toil and scale the team's operational capacityWHAT YOU'LL BRINGTechnical RequirementsBachelor's degree in Computer Science, Engineering, or equivalent combination of education and experience6–8 years of progressive experience in site reliability engineering, platform engineering, or DevOps, with demonstrated technical leadership at the senior individual contributor levelDeep expertise in AWS (EKS, Lambda, CloudWatch, AWS Config) and multi‑region architecture patternsProficiency with Terraform and GitOps; experience with policy‑as‑code (Sentinel, OPA/Rego, or equivalent)Hands‑on Datadog experience at operational depth: dashboards, SLO tracking, alerting, log management, distributed tracingStrong containerization expertise: Docker, Kubernetes (EKS preferred)Proficiency in Python and/or Bash; experience building operational tooling; solid understanding of Java and Spring Boot microservice architecture sufficient to make reliability and deployment decisions for EKS‑hosted servicesDeep expertise in CI/CD pipeline design and optimization using Bitbucket Pipelines and GitHub ActionsFamiliarity with IDP tooling (Backstage, Atlassian Compass, or equivalent) strongly preferredExperience with AI/ML workload infrastructure, LLM API integration, or agentic system operations considered a strong assetProfessional SkillsLeads and owns strategic reliability initiatives end‑to‑end with a high degree of autonomy; accountable for outcomes, not just tasksSets technical direction and influences team and department strategySolves complex, ambiguous reliability problems through systematic analysis and first‑principles thinkingFormally mentors junior and intermediate engineers; builds team capability through coaching and knowledge transferCommunicates technical reliability concepts clearly to engineering, product, and leadership audiencesApproaches operational work with an AI‑first posture: builds automation and intelligent tooling as the defaultPreferred QualificationsExperience designing reliability architecture for agentic AI systems: agent loop observability, blast radius isolation, graceful degradation for LLM‑dependent servicesAWS certifications: Solutions Architect Professional, DevOps Engineer Professional, or equivalentFinOps Certified Practitioner or demonstrated cloud cost management experience at scaleIDP implementation or developer experience program leadershipExperience in semiconductor, SaaS, or data‑intensive platform environmentsExperience operating in environments with export‑controlled or regulated dataKnowledge of BCP/DR program management and formal recovery testingWORKING ARRANGEMENTThis is a remote position for candidates based in Poland. Occasional travel may be required.#J-18808-Ljbffr

About the Company

T

Tech Insights