Job Details: + Location: Seattle, WA (Hybrid with 3 days a week onsite) + Pay Rate: $60-70 hr/w2 + Job Type: Contract + Contract Length: 6 months + Experience Level: Mid-level to Senior Key Responsibilities: + Design and build LLM-based evaluation frameworks, including automated scoring pipelines and rubric-based grading systems + Build and maintain data pipelines for evaluation datasets using Python, SQL, and scalable processing tools + Translate complex evaluation results into clear, actionable insights for technical and non-technical stakeholders + Implement automation workflows and agentic evaluation systems to improve efficiency and reduce manual efforts + Develop prompt engineering strategies to evaluate output quality, accuracy, and consistency + Create and maintain metrics, KPIs, and dashboards to track and communicate model performance + Conduct error analysis, root-cause investigations, and quality deep dives to guide model improvements + Partner cross-functionally to define evaluation methodologies and integrate them into production workflows Must-Have Qualifications: + 5+ years of experience in ML engineering, NLP, or AI/ML automation + Strong programming skills in Python and SQL + Deep understanding of machine learning concepts with a focus on NLP and advanced LLM capabilities (e.g., Chain-of-Thought, agentic workflows) + Experience working with large-scale datasets and data pipelines + Strong experience with LLM evaluation, prompt engineering, or auto grading systems + Experience developing metrics and KPIs to measure model output quality and consistency Nice-to-Have: + Experience with LLM-as-judge systems or human + model evaluation frameworks + Background in inter-rater reliability, evaluation calibration, or judged systems design + Experience with PySpark or distributed data processing tools + Exposure to building dashboards or visualization tools for model performance tracking Technical Skills Python, SQL, NLP, LLM Evaluation, Prompt Engineering, Machine Learning, Data Pipelines, Automation Systems NOTE: This posting is for an existing vacancy. This role centers on designing scalable evaluation frameworks, optimizing prompt strategies, and building systems that ensure high-quality, consistent model outputs across product domains.