Reality Labs at Meta is seeking a Research Scientist with expertise in multi-modal understanding to advance AI-powered interactions. We're building next-generation capabilities that integrate vision, language, audio, and sensor modalities. This is a unique opportunity to conduct cutting-edge multi-modal research with direct product impact.Lead the design, development, and optimization of multi-modal models that integrate vision, language, audio, and sensor inputs Set technical direction for multi-modal research projects Conduct research and experiments to improve cross-modal alignment and fusion strategies Collaborate with cross-functional teams (engineering, HCI, product) to transition multi-modal research into production Explore and adopt novel model optimization, quantization, and efficiency techniques Stay current with state-of-the-art advances in multi-modal learning, vision-language models, and related fieldsBachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience Currently has, or is in the process of obtaining, a PhD in Computer Science, Machine Learning, Computer Vision, or a related technical field. Degree must be completed prior to joining Meta Demonstrated expertise in multi-modal learning - including architecture design, training, and cross-modal alignment techniques Programming experience in Python and hands-on experience with deep learning frameworks such as PyTorch Experience developing machine learning models at scale from inception to impact 5+ years of research experience working autonomously on ML problems involving multiple modalities (vision, language, audio, or sensor data) Deep expertise in vision-language models, cross-modal attention mechanisms, or contrastive learning approaches First-authored publications at peer-reviewed AI conferences (e.g., CVPR, NeurIPS, ICML, ICLR, ACL, ECCV) Experience with on-device or edge multi-modal model optimization (quantization, sparsity, distillation) Demonstrated software engineering experience via internship, work experience, or widely used contributions in open source repositories Experience bringing multi-modal AI products from research to production Proven track record of developing multi-modal models that fuse vision, language, and/or audio for real-world applicationsMeta builds technologies that help people connect, find communities, and grow businesses. When Facebook launched in 2004, it changed the way people connect. Apps like Messenger, Instagram and WhatsApp further empowered billions around the world. Now, Meta is moving beyond 2D screens toward immersive experiences like augmented and virtual reality to help build the next evolution in social technology. People who choose to build their careers by building with us at Meta help shape a future that will take us beyond what digital connection makes possible today-beyond the constraints of screens, the limits of distance, and even the rules of physics.Meta is proud to be an Equal Employment Opportunity employer. We do not discriminate based upon race, religion, color, national origin, sex (including pregnancy, childbirth, reproductive health decisions, or related medical conditions), sexual orientation, gender identity, gender expression, age, status as a protected veteran, status as an individual with a disability, genetic information, political views or activity, or other applicable legally protected characteristics. You may view our Equal Employment Opportunity notice here .Meta is committed to providing reasonable accommodations for qualified individuals with disabilities and disabled veterans in our job application procedures. If you need assistance or an accommodation due to a disability, fill out the Accommodations request form .