New York City, NY30+ days ago
Agent Evaluation & Observability: Design and implement comprehensive evaluation pipelines for multi-agent orchestration—including visualisations, trace-level analysis of LLM calls and tool invocations, offline evaluation against golden datasets, real-time production monitoring for behavioral drift and outcome correlation, guardrails, and human-in-the-loop annotation workflows. This role requires deep expertise in developing complex, multi-agent systems, leveraging Large Language Models (LLMs) for reasoning, planning, and goal setting, and applying Reinforcement Learning (RL) techniques to model and influence human behavior safely and ethically.