Technology - Infrastructure Engineer IV

Artech LLC

New York, NY

JOB DETAILS
SALARY
$85–$95 Per Hour
SKILLS
ARM (Advanced RISC Machine), Analysis Skills, Application Hosting, Application Programming Interface (API), Automation, Automotive Repair and Maintenance, BGP, Best Practices, Budgeting, Capacity and Performance Management, Cloud Applications, Cloud Computing, Communication Skills, Continuous Deployment/Delivery, Continuous Improvement, Continuous Integration, DNS (Domain Name System), Data Collection, Data Management, Documentation, Event Correlation, Event Management, Finance, Financial Services, Git, Government, Healthcare, Healthcare Software, Hybrid Cloud, IT Service Management (ITSM), Identify Issues, Incident Management, Instrumentation, Leadership, Machine Tool, Metrics, Microsoft ASP.NET (Active Server Page), Microsoft Windows Azure, Multiplatform/Cross-Platform, Network Configuration Management, Network Performance/Analysis, Network Security, Performance Analysis, Problem Solving Skills, Process Improvement, Product/Service Launch, Regulatory Compliance, Reporting Dashboards, Right-Sizing, Root Cause Analysis, Scripting (Scripting Languages), Security Information and Event Management (SIEM), Security Monitoring, ServiceNow, Signal-to-noise Ratio (SNR), Software Engineering, Splunk, Telemetry, Test Automation, Testing, Trend Analysis, User Interface/Experience (UI/UX), Wide Area Network (WAN), Writing Skills
LOCATION
New York, NY
POSTED
8 days ago
Position: Cloud Monitoring and Observability Engineer (Azure)
Location: New York City , NY (full-time on-site)
Duration : 12 Months
 
Salary Range: $80.00 - $95.00/Hour on W2 (Without Benefits).
Applicants must be willing to work on W2.
 
Job Summary:
  • BNY is seeking a Cloud Monitoring and Observability Engineer to own the design, implementation, and continuous improvement of observability solutions for critical applications at the firm. This is a hands-on role for a self-directed problem solver who thrives with minimal guidance, proactively identifies gaps, and drives resolution end-to-end.
  • The ideal candidate brings deep expertise across leading observability platforms—spanning APM, NPM, infrastructure monitoring, and cloud-native telemetry—and applies industry best practices to deliver reliable, actionable insight across every layer of the stack.
  • You will partner with application engineering, cloud platform, network, and security teams while taking full ownership of monitoring strategy and execution.
Key Responsibilities:
  • Architect and continuously refine a unified observability strategy covering application, infrastructure, network, and user-experience layers within Azure and on-premises hosted applications, proactively identifying coverage gaps and driving improvements without waiting to be asked.
  • Integrate Azure-native telemetry (Azure Monitor, Log Analytics/KQL, Application Insights) with enterprise platforms (AppDynamics, Dynatrace, Thousand Eyes, SolarWinds, Prometheus/Grafana) to deliver correlated, cross-domain visibility.
  • Define and enforce telemetry standards—metrics, logs, and distributed traces—aligned to SLIs and SLOs; establish data collection pipelines using Open Telemetry and equivalent frameworks.
  • Build and maintain high-signal dashboards, synthetic tests, and alerting workflows; rigorously tune thresholds, anomaly detection, and de-duplication to maximize signal-to-noise ratio.
  • Instrument services with APM tooling for business-transaction tracing, dependency mapping, code-level diagnostics, and root-cause analysis.
  • Implement network performance and digital-experience monitoring (Thousand Eyes, NetScout) including path visualization, BGP/DNS tests, and endpoint-agent configuration to correlate network health with application performance.
  • Embed observability into CI/CD and infrastructure-as-code workflows so every new service launches with monitoring from day one.
  • Author and maintain runbooks, escalation paths, and post-incident review artifacts; lead data-driven root-cause analysis and remediation during incidents.
  • Perform capacity and performance trend analysis; deliver actionable recommendations for optimization, right-sizing, and resilience hardening.
  • Ensure all monitoring solutions satisfy security and compliance requirements; maintain audit-ready documentation and evidence.
Required Qualifications:
  • 5+ years designing and operating enterprise monitoring/observability for cloud or hybrid environments, including mission-critical applications.
  • Demonstrated ability to work independently diagnosing complex, cross-domain issues, proposing solutions, and driving them to completion with minimal oversight.
Proven, production-level expertise with at least one tool in each category (or equivalent):
APM:
  • AppDynamics, Dynatrace, or New Relic—including business-transaction tracing, service maps, anomaly detection, and alert-policy design.
NPM / Digital Experience Monitoring:
  • Thousand Eyes, and NetScout — including synthetic testing, path visualization, and WAN/internet performance analysis.
Infrastructure Monitoring & Event Management:
  • SolarWinds, Datadog, Moog soft, Big Panda, or Prometheus/Grafana—including availability/capacity dashboards, alert routing, and event correlation.
Azure Monitoring:
  • Azure Monitor, Log Analytics (KQL), and Application Insights with third-party integration experience.
Strong grasp of observability best practices:
  • Distributed tracing, structured logging, metric cardinality management, and Open Telemetry instrumentation pipelines.
  • Scripting and automation proficiency (PowerShell, Python, or Bash) for agent deployment, monitoring-as-code, synthetic-test creation, and reporting.
  • Solid networking fundamentals (DNS, BGP, HTTP, TLS, TCP/IP) and the ability to correlate application and network telemetry for end-to-end troubleshooting.
  • Working understanding of CI/CD best practices using Git-backed pipelines, including gated merge requests, automated testing stages, and progressive deployment strategies to ensure changes are consistently validated before reaching production.
  • Strong analytical and problem-solving skills with a track record of methodical root-cause analysis across application, infrastructure, and network layers.
  • Clear, concise documentation and communication skills; ability to translate complex observability data into actionable guidance for engineering, operations, and leadership stakeholders.
Preferred Qualifications:
  • Experience in regulated industries (financial services, government, healthcare) with compliance-aware monitoring design.
  • Familiarity with log aggregation and SIEM/SOAR platforms (Splunk, Elastic) and their integration with APM/NPM tooling.
  • ITSM platform integration experience (e.g., ServiceNow) for incident, change, and problem management workflows.
  • Hands-on infrastructure-as-code experience (ARM/Bicep/Terraform) with observability baked into deployment templates.
  • Grounding in SRE practices—error budgets, reliability reviews, and capacity/performance planning.
Ability to write instrumentation and automation code in one or more of the following:
Java:
  • Open Telemetry SDK/agent integration, custom instrumentation, APM tagging.
NET (C#):
ASP.NET service instrumentation, auto-instrumentation configuration, custom exporters and health probes.
Python:
  • Automation scripts, custom collectors/exporters, synthetic tests, and monitoring API integration.
 

About the Company

A

Artech LLC