AI Infra engineer

Artech LLC

Morrisville, NC

JOB DETAILS
SKILLS
Analysis Skills, Artificial Intelligence (AI), Benchmarking, Best Practices, CUDA (Compute Unified Device Architecture), Capacity Management, CentOS, Communication Skills, Computer Firmware, Debugging Skills, DevOps, Device Drivers, Docker, Failure Analysis, GPU (Graphics Processing Unit), IPMI (Intelligent Platform Management Interface), Identify Issues, Kernel Programming, Linux Administration, Linux Drivers, Linux Operating System, Performance Analysis, Problem Solving Skills, RAID Storage, Red Hat Linux Operating System, Root Cause Analysis, Server Hardware, Software Administration, Software Patches, Systems Administration/Management, Ubuntu, Unix Shell Programming
LOCATION
Morrisville, NC
POSTED
30+ days ago
Title: AI Infra Engineer
Duration: 10+ Months
Location: Morrisville, NC, 27560


Short Description:
This role combines IT operations, hardware troubleshooting, and AI infrastructure expertise. expect to handle day-to-day system administration, diagnose and resolve issues, and ensure optimal performance for ML workloads.


Key Responsibilities
  • Hardware Management and Troubleshooting: Monitor and maintain GPU servers/workstations, including diagnosing and resolving hardware failures (e.g., GPU faults, power issues, cooling problems). Coordinate repairs, replacements, or upgrades as needed to ensure system uptime.
  • Software and Driver Management: Install, update, and configure CUDA drivers, Linux operating systems (e.g., Ubuntu or CentOS), and related dependencies. Ensure compatibility across hardware and software stacks for seamless ML operations.
  • Performance Benchmarking: Run and analyze MLPerf benchmarks to evaluate system performance, identify bottlenecks, and optimize configurations for ML training tasks.
  • System Diagnostics and Problem Resolution: Proactively monitor systems for issues, perform root-cause analysis on failures or performance degradation, and implement fixes. This includes debugging kernel errors, network issues, or resource contention during LLM training.
  • General Infrastructure Ops: Implement best practices for security, backups, logging, and monitoring. Handle routine maintenance, such as firmware updates, patch management, and capacity planning for the GPU cluster.

Required Qualifications
  • - Proven experience (3+ years) in managing GPU-accelerated servers or high-performance computing (HPC) environments, preferably in AI/ML contexts.
  • - Strong knowledge of Linux system administration, including shell scripting, package management, and networking.
  • - Hands-on experience with NVIDIA CUDA toolkit, drivers, and GPU hardware (e.g., A100, H100, or similar).
  • - Familiarity with ML benchmarking tools like MLPerf and frameworks such as TensorFlow, PyTorch, or Hugging Face for LLM training.
  • - Ability to diagnose hardware and software issues using tools like nvidia-smi, dmesg, top/htop, or Prometheus/Grafana for monitoring.
  • - Understanding of AI infrastructure ops, including containerization (Docker/Kubernetes) and orchestration for distributed training.
  • - Excellent problem-solving skills with a proactive approach to preventing downtime.

Preferred Skills
  • - Experience with cluster management tools like Slurm, Kubernetes, or Ray for scaling ML workloads.
  • - Knowledge of hardware diagnostics for servers (e.g., IPMI, BIOS configuration, RAID setups).
  • - Background in IT operations with AI focus, such as DevOps for ML (MLOps).
  • - Certifications like RHCE (Red Hat Certified Engineer), NVIDIA certifications, or similar.
  • - Ability to work independently in a remote or on-site setup, with strong communication skills for reporting issues.

About the Company

A

Artech LLC