GPU Infrastructure Architect - GPU Cluster Stand-up, Configuration & Operations (AMD MI350)

PeopleNTech LLC

Alexandria, VA

JOB DETAILS
SALARY
SKILLS
Access Control, Automation, Cloud Computing, Computer Architecture, Computer Networks, GPU (Graphics Processing Unit), Operating Systems, Software Patches, Stock Keeping Unit (SKU), Systems Administration/Management
LOCATION
Alexandria, VA
POSTED
30+ days ago
Indent: SF_OP_202213-2-1 / SF_OP_202213-2-2
Role: GPU Infrastructure Architect - GPU Cluster Stand-up, Configuration & Operations (AMD MI350)
Location: San Jose CA (Remote is also an option)
1st Preference – FTE – Max Salary: $160K/annum
2nd Preference – Contract – Max Rate: $100/hr

Role Summary
The GPU Cluster Architect is responsible for designing, provisioning, and operating AMD MI350–based GPU clusters on a cloud platform. The role ensures scalable, secure, and reproducible GPU infrastructure to support distributed training and high-performance workloads.

Key Responsibilities
  • Design end-to-end GPU cluster architecture covering compute, networking, storage, and control services.
  • Provision and operationalize up to 9 AMD MI350 GPU clusters based on confirmed cloud SKU availability.
  • Configure GPU compute nodes including base OS images, GPU drivers, runtime libraries, and distributed training dependencies.
  • Implement automation for node imaging, bootstrapping, lifecycle management, patching, and upgrades.
  • Standardize environments using reproducible builds and Infrastructure-as-Code (IaC).
  • Enable workload portability through containerized environments and documented deployment patterns.
  • Implement OS baseline hardening, restricted administrative access, and secure cluster access controls.
  • Establish monitoring, logging, and operational runbooks to ensure reliability and performance.

About the Company

P

PeopleNTech LLC