Artificial Intelligence (AI), Automation, Best Practices, Business Operations, Cloud Computing, Communication Skills, Cross-Functional, Data Quality, Data Visualization, DevOps, Distributed Computing, Enterprise Architecture, Financial Services, Leadership, Mentoring, Metrics, Operations Security (OPSEC), Performance Tuning/Optimization, Reliability Engineering, Root Cause Analysis, Service Level Agreement (SLA), Splunk, System Architecture, Team Lead/Manager, Team Player, Technical Leadership, Technical/Engineering Design, Telemetry
Job Overview We are seeking an experienced Solution Architect / Team Lead to lead the design and implementation of a next-generation Observability and Anomaly Detection Platform leveraging AI-driven capabilities. This platform will support middle-office operations with advanced monitoring, telemetry engineering, anomaly detection, data validation, visualization, and intelligent remediation capabilities.
The ideal candidate will bring deep expertise in enterprise observability, AI/ML integration, and scalable platform architecture using tools such as Splunk, Grafana, Datadog, OpenTelemetry, and modern cloud-native technologies.
This is a strategic leadership role requiring strong architecture skills, technical depth, and the ability to collaborate across engineering, operations, security, and business teams.
Key Responsibilities - Design and implement scalable, secure, and cost-effective observability architectures
- Lead the development of enterprise monitoring and anomaly detection platforms
- Build and optimize telemetry pipelines for logs, metrics, traces, and events
- Enable AI/ML-driven anomaly detection, root cause analysis, and automated remediation
- Integrate observability solutions with business SLAs, SLOs, and reliability objectives
- Define platform governance, monitoring standards, and operational best practices
- Collaborate with cross-functional teams including infrastructure, DevOps, security, and business stakeholders
- Drive platform operationalization, enablement, and adoption across teams
- Evaluate and implement observability tools such as Splunk, Grafana, Datadog, Prometheus, and OpenTelemetry
- Ensure platform scalability, security, compliance, and performance optimization
- Lead architecture reviews, technical design discussions, and implementation strategies
- Mentor engineering teams and provide technical leadership on observability initiatives
Required Skills & Experience - 10+ years of experience in Solution Architecture or Enterprise Platform Architecture
- Strong expertise in Observability and Monitoring platforms
- Hands-on experience with:
- Splunk
- Grafana
- Datadog
- OpenTelemetry
- Prometheus / ELK Stack
- Experience building telemetry and monitoring pipelines
- Strong understanding of AI/ML integration for anomaly detection and AIOps
- Experience with root cause analysis and intelligent remediation frameworks
- Expertise in cloud-native and distributed system architectures
- Strong knowledge of platform security, governance, and operational standards
- Experience defining SLAs, SLOs, and reliability engineering practices
- Strong stakeholder management and cross-team collaboration skills
- Excellent communication and leadership abilities
Preferred Qualifications - Experience in AIOps or intelligent observability platforms
- Exposure to GenAI integration within enterprise monitoring systems
- Knowledge of Kubernetes, container monitoring, and cloud observability
- Experience in financial services or middle-office platforms
- Familiarity with DevOps, SRE, and automation frameworks
Key Skills - Solution Architecture
- Observability Platforms
- Splunk
- Grafana
- Datadog
- OpenTelemetry
- AIOps
- AI/ML Integration
- Telemetry Engineering
- Root Cause Analysis
- Platform Security
- SLA/SLO Management
- Cloud Monitoring
- Enterprise Architecture
- DevOps & SRE