Research Scientist, AI Evaluation Science

Apple

Seattle, WA

JOB DETAILS
LOCATION
Seattle, WA
POSTED
30+ days ago
**Role Number:** 200649482-3337 **Summary** AI systems are only as trustworthy as the methods used to evaluate them. At Apple, where AI powers experiences for billions of people, getting evaluation right is not a support function-it is a foundational science. Our team, part of Apple Services Engineering, is building that scientific foundation: rigorous, scalable evaluation methodology for LLMs, agentic systems, and human-AI interaction. What makes this team unusual is its interdisciplinary core. You will work alongside measurement scientists (psychometrics, validity theory), ML researchers, and platform engineers-bringing together ML research, statistical rigor, and production engineering. We are looking for a Research Scientist who treats evaluation methodology itself as a first-class research problem-someone with deep technical fluency in preference learning, reward modeling, or calibration theory, and the drive to advance the field while solving real problems at scale. We're hiring at multiple levels (early-career to senior researchers). What unites all candidates is depth of thinking about evaluation as a research problem. **Description** This is primarily a research role. You will formulate open problems in evaluation science, design experiments, publish findings, and drive projects from conception through completion. While you will also partner with platform engineers to ensure your methods are productionized into SDKs and APIs, the focus of the role is original research. Our research team brings together ML scientists and measurement scientists to tackle evaluation as both a machine learning and a measurement problem, building methods that are technically innovative and scientifically valid. You will also work closely with a platform engineering team that translates research into production-ready SDKs and APIs used across Apple. The successful candidate will have a strong publication record in evaluation-adjacent ML areas and a demonstrated ability to implement complex methods from recent papers, run large-scale experiments, and communicate results to both technical and non-technical audiences. **Minimum Qualifications** + Ph.D. in Computer Science, Machine Learning, or a closely related field, with a research focus in evaluation-adjacent areas (preference learning, RLHF, human feedback, calibration, automated assessment) + Strong publication record at top-tier conferences (NeurIPS, ICML, ICLR, ACL, EMNLP), including first-author publications demonstrating independent research contributions + Deep technical expertise in at least one evaluation-adjacent ML area, with strong mathematical foundations: preference learning and reward modeling (RLHF, DPO, reward hacking, specification gaming); OR calibration theory, proper scoring rules, and statistical reliability; OR human-AI interaction methodology (active learning, annotation quality, preference elicitation) + Demonstrated ability to implement complex methods from recent papers and run large-scale experiments + Track record of translating research into practical systems-prototypes, tools, or methods adopted by others + Excellent written and verbal communication skills, including the ability to write clear research papers and explain complex concepts to diverse audiences **Preferred Qualifications** + Publications specifically on evaluation methodology-papers about how to evaluate, not just papers that use evaluation to demonstrate model improvements + Strong hands-on experience with modern ML frameworks (PyTorch, JAX, or TensorFlow) and training or fine-tuning large language models + Experience with theoretical foundations of evaluation: measurement theory and validity frameworks, statistical learning theory (calibration, reliability, decision theory), or preference elicitation and aggregation + Specific research experience in one or more of: reward modeling and RLHF for alignment; LLM-as-judge approaches (calibration, rubric design, bias mitigation); benchmark design and validation (IRT, contamination detection); human evaluation methodology (protocol design, quality control); or agentic and multi-agent system evaluation + Demonstrated passion for evaluation as a research area: conference presentations, workshops, or tutorials on evaluation topics; open-source contributions to evaluation tools or benchmarks; active engagement with the evaluation research community + Experience with cross-disciplinary research, such as collaboration with social scientists, psychometricians, or domain experts

About the Company

A

Apple

We bring amazing people together to make amazing things happen.

We’re a diverse collection of thinkers and doers, continually reimagining what’s possible to help us all do what we love in new ways. The people who work here have reinvented entire industries with the Mac, iPhone, iPad, and Apple Watch, as well as with services, including iTunes, the App Store, Apple Music, and Apple Pay. And the same passion for innovation that goes into our products also applies to our practices — strengthening our commitment to leave the world better than we found it.

About Apple

There’s a place here for every kind of brilliant. Everyone here is an innovator, or an innovator-to-be, no matter what your team or your role. So bring your passion, courage, and original thinking and get ready to share it, because every new product, service, or feature we invent is the result of people working together to make each others’ ideas stronger. Innovation at this level depends on people who represent the variety of the human experience and inspire us with their own fresh perspectives. Together, we’ll do amazing work that can make a difference in people’s lives. Including your own. Learn more about working at Apple.

COMPANY SIZE
10,000 employees or more
INDUSTRY
Other/Not Classified
FOUNDED
1976
WEBSITE
https://www.apple.com/jobs