ARM (Advanced RISC Machine), Amazon Relational Database Service (RDS), Amazon Web Services (AWS), Analysis Skills, Architectural Services, Artificial Intelligence (AI), Business Services, CPU (Central Processing Unit), Cloud Computing, Computer Science, Continuous Deployment/Delivery, Continuous Improvement, Continuous Integration, Contract Requirements, Cost Control, Database Administration, Database Analysis, Go Programming Language (Golang), Hybrid Cloud, IT Service Management (ITSM), Identity Data Management, Information Technology & Information Systems, Instrumentation, Java, Leadership, Lift/Move 25 Pounds, Memory Hardware, Metrics, Microsoft .NET, Microsoft SQL Server, Microsoft Windows Azure, Middleware, Multi-tier Architecture, MySQL, Network Monitoring, Network Performance/Analysis, NoSQL, Node.js, On Call, Operational Audit, Oracle Database, Performance Analysis, Performance Engineering, Performance Metrics, Platform as a Service (PaaS), PostgreSQL, Problem Solving Skills, Process Improvement, Python Programming/Scripting Language, Query Analysis, Redis, Reliability Engineering, Replication and Remote Mirroring, Reporting Dashboards, Risk, Risk Analysis, Root Cause Analysis, SQL (Structured Query Language), Securities and Exchange Commission (SEC), Service Level Agreement (SLA), ServiceNow, Software Engineering, Splunk, Technical Leadership, Technical Recruiting, Telemetry, Trend Analysis, World Wide Web Consortium (W3C)
Job Summary
The Cloud Engineer - Senior (Observability) supports the SEC ISS contract by engineering, operating, and continuously improving the enterprise observability platform across hybrid cloud and containerized environments. This hands-on role involves instrumenting services with distributed tracing, code-level profiling, and custom metrics; building and tuning Datadog (or comparable) dashboards, alerts, APM, log pipelines, RUM, and synthetic monitors; and using telemetry to solve production performance, reliability, and capacity problems. The engineer collaborates with cloud, platform, and application teams to embed observability into Azure, AWS, and container platforms (OpenShift/Kubernetes), aiming to reduce alert noise, mean time to detect (MTTD), and mean time to resolve (MTTR). This position provides senior technical leadership for APM/distributed tracing strategy, SLO/SLI engineering, and data-driven operational decision-making in a 24x7x365 operating environment.
Primary Responsibilities
- Observability Platform Engineering
- Engineer and operate the enterprise observability stack (Datadog or comparable), including metrics, logs, traces, APM, RUM, synthetic monitoring, and network performance monitoring.
- Build, tune, and maintain dashboards, monitors, SLOs/SLIs, and alerting policies that produce actionable signals and minimize noise.
- Instrument services, infrastructure, and containerized workloads using agents, OpenTelemetry, and language-specific APM tracers (Java, .NET, Python, Node.js, Go) with consistent span tagging, W3C TraceContext propagation, and unified service tagging across the estate.
- Develop and maintain integrations between observability platforms, ITSM (ServiceNow), CI/CD pipelines, and on-call/paging workflows.
- Define and enforce a unified tagging standard (environment, service, version, team/ownership, data classification, cost center) across metrics, logs, and traces; manage tag cardinality, governance, and custom business tags to keep telemetry queryable, attributable, and cost-controlled.
- Cloud and Container Monitoring Engineering
- Design and deliver monitoring coverage for Microsoft Azure and AWS workloads, including PaaS services, serverless, networking, identity, managed databases, and cloud-native data services.
- Engineer managed database observability across AWS RDS/Aurora (MySQL, PostgreSQL, SQL Server, Oracle), Azure SQL/PostgreSQL/MySQL, and NoSQL/cache services (DynamoDB, Cosmos DB, ElastiCache/Redis), including query-level performance analytics, slow-query and execution-plan capture, lock/deadlock/wait analysis, connection pool and session monitoring, replication lag, storage/IOPS saturation, and backup/HA health -- correlating database spans with upstream APM traces.
- Engineer container-platform observability for OpenShift/Kubernetes, covering cluster health, control plane, nodes, pods, namespaces, ingress, service mesh, and workload APM.
- Build standardized, reusable monitoring modules deployable via infrastructure-as-code (Terraform, Bicep, ARM) and CI/CD.
- Support hybrid visibility across on-premises, cloud, and containerized workloads with correlated telemetry.
- Performance Engineering and Problem Solving
- Lead data-driven investigation and resolution of complex performance, latency, saturation, and reliability issues across the estate.
- Use APM distributed traces, service/dependency maps, continuous code profiling (CPU, memory, lock contention), database query analytics, exception/error tracking, and RUM-to-backend trace correlation to isolate bottlenecks in applications, platforms, middleware, and downstream dependencies.
- Partner with engineering teams to define and implement remediation, tuning, and architectural improvements based on telemetry evidence.
- Define and implement trace-based SLOs, deployment tracking, and change-correlation workflows so performance regressions are detected and attributed to specific releases, versions, or configuration changes.
- Provide senior technical leadership during major incidents, delivering impact analysis, contributing to root-cause analysis, and owning post-incident observability gaps.
- Capacity, Reliability, and Continuous Improvement
- Analyze operational telemetry and trend data to identify capacity risks, recurring constraints, and opportunities for efficiency.
- Build and maintain capacity and performance dashboards and reports that communicate posture, risk, and recommendations to technical and leadership stakeholders.
- Define capacity thresholds, alert baselines, and trigger points for scaling, technology refresh, and resource reallocation.
- Drive continuous improvement of observability coverage, alert quality, runbook linkage, and operational maturity aligned to SEC SLA/KPI expectations.
Required Qualifications
- Citizenship/Work Authorization: Must meet contract requirements.
- Clearance: Ability to obtain and maintain SEC Public Trust (or higher if required).
- Education: Bachelor's degree in a relevant field (e.g., Information Technology, Computer Science, Engineering).
- Experience:
- Minimum 8 years of experience in IT infrastructure or platform engineering roles, including 5 years focused on observability, performance engineering, or site reliability engineering.
- Demonstrated experience engineering and operating an enterprise observability platform (Datadog strongly preferred; equivalent experience with Dynatrace, New Relic, Splunk Observability, or Grafana/Prometheus stacks considered).
- Proven experience building APM and distributed tracing coverage for production multi-tier applications -- including language-specific tracer deployment, custom instrumentation of business transactions, service/dependency mapping, continuous profiling, and RUM-to-backend trace correlation -- across cloud and containerized workloads.
- Proven experience leading complex production performance and reliability problem-solving from telemetry to remediation.
Location: This position is based in a vibrant city, offering a rich blend of cultural, recreational, and professional opportunities.
Benefits
PEAK's benefit offerings available for our associates include medical, dental, vision, Flexible Spending Account (FSA), Dependent Care Savings Account (DCA), and 401K plan. PEAK believes that taking care of our team is essential for success and we are proud to provide benefits that enhance both your well-being and your future. Additionally, our associates may be eligible for Paid Sick Leave as required by Federal, State, or local laws.
Equal Opportunity Employer (EEO)
PEAK Technical Staffing is committed to creating a diverse and inclusive environment and is proud to be an Equal Opportunity Employer. PEAK does not discriminate on the basis of race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability, or veteran status, or any other characteristic protected by applicable law. All employment decisions are made based on qualifications, merit, and business need. We encourage all individuals to apply.
Americans Disabilities Act (ADA)
The physical and mental requirements described in this job description are representative of those that must be met by an employee to successfully perform the essential functions of the position. Reasonable accommodations may be made to enable qualified individuals with disabilities to perform the essential functions. Must be able to perform the essential physical functions of the position, including sitting, standing, walking, stooping, kneeling, and lifting up to 25 pounds, with or without reasonable accommodation.
Candidate Privacy
To read our Candidate Privacy Information Statement, which explains how we will use your information, please navigate to https://peaktechnical.com/privacy-policy/ and https://peaktechnical.com/ca-residents-privacy-rights/
AI Recruiting Disclosure
We use AI-assisted tools to help review applications and compare your experience to job requirements, but all hiring decisions are made by human recruiters. You may request a human-only process or opt out of automated communication at any time. Required notices and our latest bias audit are available on our website: www.peaktechnical.com/ai-disclosure.