DevOps / SRE (with Observability Engineering)

Mindlance

Coppell, TX

JOB DETAILS
SKILLS
Acceptance Testing, Analysis Skills, Artificial Intelligence (AI), Automation, Best Practices, Capacity and Performance Management, Cloud Computing, Coaching, Continuous Improvement, Customer Escalations, DevOps, Disaster Recovery, Embedded Systems, Go Programming Language (Golang), Improvement Metrics, Incident Management, Incident Response, Java, Knowledge Base, Large-Scale Systems, Leadership, Machine Tool, Mentoring, Messaging Technology, Metrics, Operational Improvement, Operations, Performance Analysis, Performance Metrics, Performance Testing, Performance Tuning/Optimization, Product Support, Programming Languages, Project/Program Management, Prototyping, Python Programming/Scripting Language, Reengineering, Release Management/Engineering, Reliability Engineering, Requirements Management, Risk, Risk Management, Root Cause Analysis, Software Administration, Software Development, Software Engineering, Team Player, Technical Leadership, Technical Support, User Documentation
LOCATION
Coppell, TX
POSTED
1 day ago
Hybrid onsite at Dallas, TX, 75019 / Jersey City, NJ, 07310

CTH

Description:
The Development family is responsible for creating, designing, deploying, and supporting applications, programs, and software solutions. May include research, new development, prototyping, modification, reuse, re-engineering, maintenance, or any other activities related to software products used internally or externally on product platforms supported by the firm. The software development process requires in-depth subject matter expertise in existing and emerging development methodologies, tools, and programming languages. Software Developers work closely with business partners and / or external clients in defining requirements and implementing solutions. The Application Support Engineering role specializes in maintaining and providing technical support for all applications that are beyond the development stage and are running in the daily operations of the firm. Works closely with development teams, infrastructure partners, and internal / external clients to escalate and resolve technical support incidents.

Your Primary Responsibilities:
Design & Delivery Partnership: Participate in design reviews, sprint zero, and delivery planning to champion non functional requirements (NFRs) including resiliency, observability, fault tolerance, Holiday and Special days processing, as well as disaster recovery.
Major Release Management Partnership: Collaborate with Major Release Management to ensure each Risk release meets SRE standards for observability and resiliency (SLIs/SLOs, monitoring, knowledge base articles). Ensure releases are subject to required deployment validations.
Monitoring, Observability & AI Enablement: Define and evolve monitoring, alerting, SLIs, and SLOs, leveraging AI/ML driven analytics for anomaly detection, incident correlation, and early risk identification.
Minimizing Application Recovery Time: Make design recommendations that will quick detection of outage conditions and allow the application to recover without manual interventions and/or create a knowledge based guidance for application support team to follow for improved application recovery times. Participate in major incident response / Root Cause analysis to drive continual systemic recovery time improvements.
Automation & Self Healing: Drive automation and intelligent tooling (including AI assisted remediation) to reduce manual toil and improve consistency and recovery times.
Operational Readiness & Risk Management: Attend and present operational readiness with application support (EAS L2) at project management meeting - raise any operational risks and concerns. Test NFRs in UAT environments to validate effectiveness and completeness of operational capabilities. Validate operational readiness prior to release with stakeholders, partner with Embedded Risk and Security teams, and proactively surface and mitigate technology and operational risks.
Capacity & Performance Optimization: Lead capacity planning and performance analysis to ensure Risk platforms scale reliably under high load.
Metrics & Continuous Improvement: Establish KPIs and operational metrics to demonstrate reliability improvements and operational maturity.
People & Culture: Build a strong SRE culture enhanced by AI driven insights across Risk Application Support and Development through mentorship and best practice coaching; leverage approved AI tools to analyze code and collaborate on knowledge base articles, and to accelerate improvements in observability, performance, security, and maintainability.

Qualifications:
Minimum of 8 years of related technical and management experience

Bachelor's degree preferred or equivalent experience
Cloud certifications is a plus

Talents Needed for Success:
Proven experience with SRE or DevOps practices, including CI/CD pipelines, infrastructure as code, and automation frameworks
Strong understanding of monitoring and observability platforms (e.g., Grafana) and experience designing and fine tuning robust monitoring systems
Programming proficiency in one or more languages such as Python, Java, Go, or similar, for automation and tooling development

Familiarity with cloud platforms, containerized environments, and/or hybrid infrastructure models
Experience in financial services, capital markets, or regulated environments
Demonstrated participation in disaster recovery, performance, and resiliency testing
Knowledge of AI concepts, data platforms, messaging systems, and large scale batch or real time processing systems
Strong collaboration skills across technology and business teams
Hands on experience leading and participating in incident and problem management, including root cause analysis

EEO:
Mindlance is an Equal Opportunity Employer and does not discriminate in employment on the basis of Minority/Gender/Disability/Religion/LGBTQI/Age/Veterans.

About the Company

M

Mindlance