Amazon Web Services (AWS), Analysis Skills, Application Programming Interface (API), Artificial Intelligence (AI), Automation, Cloud Computing, Compensation and Benefits, Computer Science, Content Filtering Software, Continuous Deployment/Delivery, Continuous Integration, Cost Control, DevOps, Ecosystems, GitHub, Information Technology & Information Systems, Injections, Input/Output, Machine Learning, Metrics, Microsoft Windows Azure, Performance Modeling, Problem Solving Skills, Reliability Engineering, Reporting Dashboards, Risk, Root Cause Analysis, Software Testing, Source Code/Configuration Management (SCM), Team Player, Telemetry
Citrin Cooperman offers a dynamic work environment, fostering professional growth and collaboration. We're continuously seeking talented individuals who bring a problem-solving mindset, fresh perspectives, and sharp technical expertise. We know you have choices, so our team of collaborative, innovative professionals are ready to support your professional development. At Citrin Cooperman, we offer competitive compensation and benefits and most importantly, the flexibility to manage your personal and professional life to focus on what matters most to you!
We are seeking a Senior - MLOps/LLMOps Engineer, Development, to join our Development team within the Information Technology department. The AI Solutions team is the vanguard of our enterprise AI competency, bridging the gap between rapid generative AI pilots and our enterprise operations. As we industrialize these advanced applications, you'll build the operational backbone for our non-deterministic systems.
In this critical deployment and observability role, you'll define how generative AI and agentic workflows are shipped to production. Working with frontier models (Anthropic, Google, OpenAI) and custom frameworks (LangGraph), you'll transition pilot code into robust solutions, automated with CI/CD pipelines. You'll own the infrastructure for prompt versioning, while establishing automated evaluation gates (e.g., LLM-as-a-judge), and implementing the deep telemetry required to monitor token costs, latency, and hallucination rates. The ideal candidate has a strong DevOps foundation but has successfully pivoted into the unique challenges of machine learning and generative AI operations, as well as views observability as the ultimate defense against model drift.
Responsibilities are, but not limited to
- LLMOps CI/CD Pipelines: Design and build automated deployment pipelines specifically for generative AI applications. Ensure that updates to prompts, LangGraph state machines, or RAG retrieval logic can be safely promoted across environments (Dev, Test, Prod).
- Evaluation Infrastructure: Deploy and manage the infrastructure required for continuous AI evaluation (e.g., LangSmith, Braintrust, or custom evaluation harnesses). Embed precision, recall, and toxicity checks directly into the deployment gates.
- Telemetry & Observability: Instrument the AI applications to capture deep operational metrics. Build dashboards to monitor token consumption, end-to-end latency, reasoning traces, and API failure rates across multiple LLM providers.
- Prompt & Model Registry Management: Implement version control for prompts and model configurations, ensuring the enterprise has a strict, auditable history of what instructions are running in production at any given time.
- Guardrails & Content Filtering: Integrate input/output guardrails (e.g., Azure AI Content Safety, NeMo Guardrails) into the application flow to automatically block prompt injection attacks, PII leakage, or off-topic responses.
- Cost Management (FinOps for AI): Actively monitor the financial footprint of our AI solutions. Set up alerting for token usage spikes and work with AI Engineers to optimize embedding and retrieval strategies for cost efficiency.
The ideal candidate must:
- Have a bachelor's degree in computer science, information technology, engineering, or equivalent practical experience.
- Be Databricks Certified: Machine Learning Professional
- Be Microsoft Certified: Azure DevOps Engineer Expert (AZ-400)
- Be DeepLearning.AI: Machine Learning Engineering for Production (MLOps)
- Have 4+ years of experience in DevOps, MLOps, or Site Reliability Engineering (SRE), with specific, hands-on experience managing generative AI deployments in the last 1-2 years.
- Be deep proficient in building CI/CD pipelines using enterprise tools (Azure DevOps, GitHub Actions, GitLab CI).
- Have hands-on experience with LLMOps tools and frameworks (e.g., MLflow, LangSmith, PromptFlow, Arize, or similar observability platforms).
- Possess strong Python scripting skills and experience containerizing machine learning or API workloads (Docker, Kubernetes).
- Understand of the API ecosystems for frontier models (OpenAI, Anthropic, Google Vertex AI) and multi-agent frameworks (LangChain, LangGraph).
- Be familiar with cloud infrastructure (Azure, AWS) and infrastructure-as-code principles.
- Be automation-obsessed: Views manual deployments or manual testing of prompts as an unacceptable operational risk.
- Be financially vigilant: Understands that an infinite loop in a LangGraph agent doesn't just crash an app-it burns real money through API token costs.
- Be an analytical defender: Deeply curious about why a model's performance degraded in production, relentlessly tracing logs to find the root cause of hallucinations or latency spikes.
C
Citrin Cooperman & Company LLP