Indent: SF_OP_202213-2-1 / SF_OP_202213-2-2 Role: GPU Infrastructure Architect - GPU Cluster Stand-up, Configuration & Operations (AMD MI350) Location: San Jose CA (Remote is also an option) 1st Preference – FTE – Max Salary: $160K/annum 2nd Preference – Contract – Max Rate: $100/hr
Role Summary The GPU Cluster Architect is responsible for designing, provisioning, and operating AMD MI350–based GPU clusters on a cloud platform. The role ensures scalable, secure, and reproducible GPU infrastructure to support distributed training and high-performance workloads.
Key Responsibilities
Design end-to-end GPU cluster architecture covering compute, networking, storage, and control services.
Provision and operationalize up to 9 AMD MI350 GPU clusters based on confirmed cloud SKU availability.
Configure GPU compute nodes including base OS images, GPU drivers, runtime libraries, and distributed training dependencies.
Implement automation for node imaging, bootstrapping, lifecycle management, patching, and upgrades.
Standardize environments using reproducible builds and Infrastructure-as-Code (IaC).
Enable workload portability through containerized environments and documented deployment patterns.
Implement OS baseline hardening, restricted administrative access, and secure cluster access controls.
Establish monitoring, logging, and operational runbooks to ensure reliability and performance.