AI / ML Engineer (with Observability)

Mindlance

Coppell, TX

JOB DETAILS
SKILLS
Amazon Web Services (AWS), Analysis Skills, Application Programming Interface (API), Artificial Intelligence (AI), Artificial Intelligence (AI) Agents, Automation, Cloud Computing, Communication Skills, Continuous Deployment/Delivery, Continuous Integration, Conversation Engine, Corrective Action, Cross-Functional, Data Management, Data Science, Distributed Computing, Ecosystems, Event Correlation, Instrumentation, Leadership, Machine Tool, Metrics, Microservices, Network Systems, Operational Audit, Predictive Modeling, Problem Solving Skills, Python Programming/Scripting Language, Reporting Dashboards, Science Library, Snowflake Schema, Statistics, Systems Analysis, Telemetry, Time Management
LOCATION
Coppell, TX
POSTED
22 days ago
Hybrid onsite at Dallas, TX, 75019 / Tampa, FL, 33647

CTH
2 rounds of interviews


Overview
We are seeking a passionate and hands-on AI/ML Engineer to accelerate our Enterprise Observability strategy. This role will design, build, and operationalize AI/ML capabilities that enhance end to end telemetry pipelines, anomaly detection, intelligent alerting, and proactive system resiliency.
You will work at the intersection of AI/ML engineering, Observability platforms, and automation, developing solutions that improve detection, diagnosis, and prevention of operational issues across distributed systems.
________________________________________
Key Responsibilities
" Design and deploy AI/ML models supporting anomaly detection, baselining, event correlation, and predictive operational analytics.
" Build and integrate AI enabled capabilities into enterprise Observability platforms, including Grafana, APM/RUM tools, network telemetry systems, and data observability tools.
" Develop AI Agents that can autonomously triage issues, recommend corrective actions, and initiate automated remediation workflows to reduce recovery time and improve system resilience.
" Implement self healing automation using AI driven decisioning, integrating with orchestration frameworks, service APIs, and infrastructure automation pipelines.
" Engineer and maintain real time and batch data pipelines using Snowflake ML Jobs, Snowflake Cortex, streams, tasks, and UDFs.
" Implement and manage OpenTelemetry based telemetry ingestion for logs, metrics, traces, and spans across distributed systems.
" Build asynchronous Python APIs and services for model inferencing and operational integration.
" Enhance observability intelligence with AI-powered capabilities such as root cause acceleration, chatbot/search enablement, and automated insights.
" Contribute to SLO/SLI modeling, Golden Signals instrumentation, and Observability NFR adoption.
" Collaborate across engineering, SRE, platform and business teams to embed proactive intelligence and Observability standards throughout the ecosystem.

Required Skills & Qualifications
Core Technical Skills

" Strong proficiency in Python and data science/ML libraries:
NumPy, Pandas, scikit learn, TensorFlow, PyTorch, Matplotlib, Seaborn.
" Experience with Generative AI, LLM fine tuning, prompt engineering, RAG pipelines, and LLM evaluation frameworks.

" Expertise in developing and deploying ML models in production (batch & streaming).
" Strong understanding of statistics, time series modeling, and anomaly detection.

Observability & Telemetry
" Experience with OpenTelemetry for logs, metrics, traces, spans.
" Familiarity with Observability concepts:

Golden Signals, SLO/SLI design, APM, RUM, Synthetics, event correlation, baselining.
" Experience with Observability tools such as:
Grafana (Alloy agents, dashboards, ML capabilities), Dynatrace, Monte Carlo (Data Observability), Netscout, ThousandEyes, SolarWinds, NetBrain.

Cloud, Data & Platform
" Hands on with AWS (SageMaker, Bedrock), Snowflake ML, Snowflake/Openflow, Snowflake AI Observability tooling.
" Experience building Snowflake data pipelines (streams, tasks, UDFs) plus for Cortex features.
" Strong understanding of distributed systems and microservices telemetry requirements.

Automation & Engineering Quality
" Experience with automation pipelines, CI/CD, and infrastructure as code patterns supporting Observability adoption.
" Ability to build asynchronous Python APIs or services for model inference and operational integration.
________________________________________
Preferred Qualifications
" Experience developing agentic AI systems that analyze telemetry, generate action recommendations, or execute automated operational responses.
" Experience building self healing patterns, including automated rollback, service restarts, configuration corrections, and predictive maintenance.
" Experience in Snowflake ML workflows, Snowflake Cortex Agents, and data pipeline automation.
" Exposure to AI-enabled alerting, RCA automation, and operational self healing concepts.
" Experience with large-scale operational telemetry and multi-cloud ecosystems.

Soft Skills
" Strong analytical thinking and problem solving.
" Excellent communication skills for cross functional collaboration with infrastructure, SRE, engineering, business, and leadership teams.
" Curiosity, continuous learning mindset, and passion for applied AI and Observability.

EEO:
Mindlance is an Equal Opportunity Employer and does not discriminate in employment on the basis of Minority/Gender/Disability/Religion/LGBTQI/Age/Veterans.

About the Company

M

Mindlance