Machine Learning Operations (MLOps) Engineer - AWS (with LLM Focus)

Kaav

Hollywood, FL

JOB DETAILS
SKILLS
AWS Lambda, Amazon Elastic Compute Cloud (EC2), Amazon Simple Storage Service (S3), Amazon Web Services (AWS), Application Programming Interface (API), Best Practices, Communication Skills, Computer Programming, Continuous Deployment/Delivery, Continuous Improvement, Continuous Integration, Cost Control, Cross-Functional, Data Science, Docker, Flask, GPU (Graphics Processing Unit), High Reliability, Identify Issues, Machine Learning, Machining Operations, Management Strategy, Performance Analysis, Performance Modeling, Problem Solving Skills, Process Improvement, Production Systems, Python Programming/Scripting Language, REST (Representational State Transfer), Resource Utilization, Scientific Research, Software Engineering, Strategic Planning, Team Player
LOCATION
Hollywood, FL
POSTED
2 days ago
Job Title

LLM-Optimized MLOps Infrastructure: Design and implement MLOps infrastructure on AWS tailored for LLMs, leveraging services like SageMaker, EC2 (with GPU instances), S3, ECS/EKS, Lambda, and more.

LLM Deployment Pipelines: Build and manage CI/CD pipelines specifically for LLM deployment, addressing unique challenges like model size, inference optimization, and versioning.

LLMOps Practices: Implement LLMOps best practices for monitoring model performance, drift detection, prompt management, and feedback loops for continuous improvement.

RESTful API Development: Design and develop RESTful APIs to expose LLM capabilities to other applications and services, ensuring scalability, security, and optimal performance.

Model Optimization: Apply techniques like quantization, distillation, and pruning to optimize LLM models for efficient inference on AWS infrastructure.

Monitoring and Observability: Establish comprehensive monitoring and alerting mechanisms to track LLM performance, latency, resource utilization, and potential biases.

Prompt Engineering and Management: Develop strategies for prompt engineering and management to enhance LLM outputs and ensure consistency and safety.

Collaboration: Work closely with data scientists, researchers, and software engineers to integrate LLM models into production systems effectively.

Cost Optimization: Continuously optimize LLMOps processes and infrastructure for cost-efficiency while maintaining high performance and reliability.

Experience: 3+ years of experience in MLOps or a related field, with hands-on experience in deploying and managing LLMs.

AWS Expertise: Strong proficiency in AWS services relevant to MLOps and LLMs, including SageMaker, EC2 (with GPU instances), S3, ECS/EKS, Lambda, and API Gateway.

LLM Knowledge: Deep understanding of LLM architectures (e.g., Transformers), training techniques, and inference optimization strategies.

Programming Skills: Proficiency in Python and experience with infrastructure-as-code tools (e.g., Terraform, CloudFormation), REST API frameworks (e.g., Flask, FastAPI), and LLM libraries (e.g., Hugging Face Transformers).

Monitoring: Familiarity with monitoring and logging tools for LLMs, such as Prometheus, Grafana, and CloudWatch.

Containerization: Experience with Docker and container orchestration (e.g., Kubernetes, ECS) for LLM deployment.

Problem Solving: Excellent problem-solving and troubleshooting skills in the context of LLMs and MLOps.

Communication: Strong communication and collaboration skills to effectively work with cross-functional teams.

About the Company

K

Kaav