Senior Observability Engineer

Artech LLC

Woodland Hills, CA, CA

JOB DETAILS
SALARY
$40–$48 Per Hour
SKILLS
AWS Lambda, Amazon Simple Storage Service (S3), Amazon Web Services (AWS), Analysis Skills, Application Programming Interface (API), Artificial Intelligence (AI), Budgeting, CPU (Central Processing Unit), Capacity Analysis, Centralized Operations/Management, Cloud Computing, Continuous Deployment/Delivery, Continuous Integration, Cost Control, Data Import/Export, Data Management, Distributed Computing, Docker, Fleet Management, Forecasting, GCP (Good Clinical Practices), Guidewire, High Availability, Identify Issues, Instrumentation, Insurance, Jenkins, Memory Hardware, Metrics, Microservices, Microsoft Outlook, Microsoft Windows Azure, Microsoft Word, Multiplatform/Cross-Platform, Noise Reduction, Predictive Modeling, Reporting Dashboards, Root Cause Analysis, Salesforce.com, Service Level Agreement (SLA), Software as a Service (SaaS), Telemetry, Vehicle Fleets
LOCATION
Woodland Hills, CA, CA
POSTED
2 days ago
Request ID: 94973-1
Title: Senior Observability Engineer

Locations: Onsite: Woodland Hills, CA
Duration: 6+ Months with Possible Extension.
Pay Range: $40 -$48/Hour on W2/C2C (All inclusive)

 
Job Descriptions:
Skills: AI Agents
Experience Required: 10 & Above
 
Role overview:
We are seeking a seasoned Observability expert who doesn't just manage dashboards but actively lives and breathes telemetry architecture. In this role, Personnel will elevate customer observability maturity across infrastructure, applications, and business transactions.
Personnel will own, design, and optimize the following core domains:


1. Operations & Noise Reduction
• Alert-to-Incident Signal Optimization: Analyze and optimize our Alert-to-Incident noise ratio (targeting a baseline better than 10:1). Drive the evolution from chaotic alerting to high-fidelity, actionable incident creation.
• Dynamic Baselining & Anomaly Detection: Shift the paradigm away from rigid static thresholds. Implement dynamic baseline that intelligently accounts for time-of-day, day-of-week, and seasonal traffic patterns.


2. Guardrails, Standards, & Observability-as-Code
• Observability-as-Code (OaC): Drive the maturity of our telemetry infrastructure by ensuring all dashboards, alerts, SLOs, and monitor configurations are defined, versioned, and deployed as code.
• CI/CD Instrumentation Gates: Establish and enforce automated instrumentation compliance gates within our deployment pipelines to ensure code is observable before it hits production.
• Fleet Health Management: Centrally manage, version, and monitor the health of our Open Telemetry (OTel) collectors and agent fleets.


3. Advanced Diagnostics & Next-Gen Tech
• Automated Root Cause Analysis (RCA): Implement platform capabilities that automatically surface probable root cause the moment an incident fire.
• Change & Deployment Correlation: Ensure all deployments, configuration changes, feature flag toggles, and database migrations are automatically annotated on dashboards and correlated to active incident timelines.
• GenAI/LLM-Assisted Triage: Evaluate and adopt GenAI/LLM capabilities for advanced log pattern explanation and accelerated incident troubleshooting.


4. Telemetry Architecture & Data Strategy
• Cloud-Native & Third-Party Monitoring: Ensure deep telemetry integration across cloud-managed services (AWS/Azure/GCP, EKS/AKS, Lambda, RDS) and critical third-party SaaS dependencies (e.g., Guidewire, Salesforce, Earnix, Uniphore, payment gateways).
• Lakehouse & Data Pipeline Integration: Architect pipelines to export raw telemetry data to our data Lakehouse (S3/ADLS) to power advanced ML pipelines and predictive analytics.
• Predictive Capacity Analytics: Leverage the observability platform for capacity forecasting—predicting utilization trends for CPU, memory, queue depth, and storage before saturation occurs.
• Log Standardization: Drive org-wide standards for log structure and serialization to ensure seamless cross-platform parsing and querying.


5. Culture, SLOs, & Business Impact
• End-to-End Business Transaction Tracing: Map and trace complex, multi-service customer journeys (e.g., policy quote bind pay) to provide full-context business transaction visibility.
• SLO/SLA Governance: Define, implement, and track Service Level Objectives (SLOs) across all production services.
• Developer Empowerment & Self-Service: Democratize observability by fostering a proactive culture where developers instrument their own services during active development, backed by standardized, self-service health dashboards.
  • Monitoring, logging, tracing design (metrics, logs, traces)
  • Dashboarding, alerting, and telemetry pipelines
  • Observability platform design & optimization
  • Root Cause Analysis (RCA), incident analysis
  • SLO / SLI / SLA definition and error budgets
  • Strong understanding of AWS / Azure / GCP environments [PennyMac - SRE | Word]
 
Expertise in:
  • Microservices architecture
  • Distributed systems & event-driven systems
  • High availability & scalability patterns
  • CI/CD pipelines (GitLab, Jenkins) [West - Req...quirements | Excel]
  • Infrastructure as Code (Terraform, CloudFormation) [PennyMac - SRE | Word]
  • Containerization (Docker, Kubernetes troubleshooting) [West - Req...quirements | Excel]
  • Release observability & rollback readiness
  • Advanced / Differentiator Skills
  • AIOps / AI-driven observability [RE: Senior...Insurance | Outlook]
  • Predictive alerting / anomaly detection
  • Observability cost optimization
  • Chaos engineering basics
  • API & integration observability"
     
Company Benefits & Culture
  • Inclusive and diverse work environment
  • Opportunities for professional growth and development
  • Comprehensive health and wellness benefits

Appreciate your quick response and please feel free to reach me out for any query you may have.

Thanks

 

About the Company

A

Artech LLC