Key job responsibilities Fleet Health & Predictive Infrastructure - Build and own the automation infrastructure responsible for the health of the server fleet across storage and accelerator (AI/ML) compute platforms - Design and implement predictive failure detection systems using telemetry, sensor data, error trending, and log correlation to identify hardware issues before they cause customer impact - Drive toward zero-touch operations - building automation that detects, diagnoses, triages, and remediates hardware and software faults without human intervention - Develop monitoring tools, dashboards, and alerting systems to provide real-time visibility into fleet health across lab and production environments - Define and track fleet health metrics (failure rates, mean time to detect, mean time to repair, first-time fix rate, predictive accuracy) Debugging & Troubleshooting - Debug and resolve complex system-level issues across storage, compute, GPU, networking in production environments - Troubleshoot Linux boot and runtime failures across x86 and ARM architectures, including PCIe, power, NIC, NVMe, and GPU subsystems - Perform root cause analysis on hardware failures - correlating across firmware, kernel, driver, and physical layer to isolate faults - Build diagnostic tooling that automates root cause identification and reduces reliance on manual triage Systems Development & Automation - Lead the definition and development of software, automation, and enabling tools for server hardware programs; track and report progress - Design and build scalable system-level software with focus on durability, availability, security, and diagnostics - Develop and maintain device drivers for Linux on ARM and x86 architectures - Build automation solutions using modern programming languages (Python, Ruby, Java, C/C++, etc.) - Work with OS internals, storage subsystems, and accelerator/GPU software stacks in Linux-based environments - Build, manage, and deploy CI/CD pipelines for rapid deployment of code changes to org-owned and customer-owned systems Cross-Team Collaboration - Work across internal HWEng teams to ensure new server hardware addresses data path and control path functionality needed by dependent service teams - Work closely with internal customers to identify early any potential problems onboarding new servers - storage or accelerated compute - into their ecosystem - Engage with ODMs and design partners on testability, diagnostic, and automation requirements during hardware design and development - Contribute to server design to improve robustness, testability, diagnosability, and reliability - Partner with datacenter operations teams to close the loop between field failures and design improvements A day in the life Systems Development Engineers in AWS Hardware Engineering wear many hats. You will collaborate with a variety of roles (SDEs, SDETs, Mechanical/Electrical/Hardware Engineers, TPMs, Managers, Principals) and organizations through server conception, test validation, qualification, launch, and operations - driving high quality and reliability into current and future designs for AWS server solutions.