Sr. System Development Engineer, AL/ML/Storage server team

Amazon.com Inc

Cupertino, CA

JOB DETAILS
SKILLS
ARM (Advanced RISC Machine), Amazon Web Services (AWS), Architectural Services, Artificial Intelligence (AI), Automation, Automation System Development, Automation Systems, C Programming Language, C++ Programming Language, Computer Engineering, Computer Firmware, Computer Maintenance, Computer Systems, Continuous Deployment/Delivery, Continuous Integration, Cross-Functional, Debugging Skills, Device Drivers, Diagnostics Solutions/Software, Ecosystems, Electrical Engineering, Failure Analysis, GPU (Graphics Processing Unit), Hardware Design, Hardware Development, High Reliability, Home Automation, Identify Issues, Java, Kernel Programming, Linux Drivers, Linux Operating System, Machine Tool, Manufacturing, Metrics, Microprocessor Architecture, National Intelligence Council (NIC), Network Operations Center, Onboarding, Operating Systems, Original Design Manufacturer (ODM), PCI Express (PCI-E), Problem Solving Skills, Product Lifecycle, Product/Service Launch, Production Systems, Programming Languages, Progress Reports, Python Programming/Scripting Language, Reporting Dashboards, Root Cause Analysis, Ruby, Scalable System Development, Server Hardware, Server Programming/Applications, Software Design, Software Development, Systems Engineering, Team Player, Technical Leadership, Telemetry, Testability, Validation Testing, Vehicle Fleets, x86 Processors
LOCATION
Cupertino, CA
POSTED
30+ days ago

Application deadline: May 26, 2026

We are seeking an experienced Senior Systems Development Engineer to lead the development of automation software, diagnostic tooling, and fleet health infrastructure for our server platforms. You will work across multiple teams and organizations to build scalable, reliable systems that keep our storage and accelerated (AI/ML) compute fleet healthy - with a vision toward zero-touch operations where automation detects, diagnoses, and resolves issues without human intervention.

You will be a technical leader solving complex architectural problems that may not be well-defined in advance. You will own your team"s systems, proactively identify deficiencies, write scalable and robust code to solve issues before they impact customers. You will decompose large, difficult server testability, reliability, and diagnosis problems into straightforward tasks and components - leading delivery yourself and through others in parallel - using a combination of hardware, software, system design, processor architecture, diagnostics, and operations knowledge.

You will collaborate with a variety of roles (SDEs, SDETs, Mechanical/Electrical/Hardware Engineers, TPMs, Managers, Principals) and organizations through server conception, test validation, qualification, launch, and operations - driving high quality and reliability into current and future designs for AWS server solutions. You will also work closely with ODMs and Design Partners to ensure our tooling, diagnostics, and automation requirements are met throughout the hardware development lifecycle (NPI).

Key job responsibilities

Fleet Health & Predictive Infrastructure

  • Build and own the automation infrastructure responsible for the health of the server fleet across storage and accelerator (AI/ML) compute platforms
  • Design and implement predictive failure detection systems using telemetry, sensor data, error trending, and log correlation to identify hardware issues before they cause customer impact
  • Drive toward zero-touch operations - building automation that detects, diagnoses, triages, and remediates hardware and software faults without human intervention
  • Develop monitoring tools, dashboards, and alerting systems to provide real-time visibility into fleet health across lab and production environments
  • Define and track fleet health metrics (failure rates, mean time to detect, mean time to repair, first-time fix rate, predictive accuracy)

Debugging & Troubleshooting

  • Debug and resolve complex system-level issues across storage, compute, GPU, networking in production environments
  • Troubleshoot Linux boot and runtime failures across x86 and ARM architectures, including PCIe, power, NIC, NVMe, and GPU subsystems
  • Perform root cause analysis on hardware failures - correlating across firmware, kernel, driver, and physical layer to isolate faults
  • Build diagnostic tooling that automates root cause identification and reduces reliance on manual triage
  • Improve manufacturing throughput and yield through test optimization

Systems Development & Automation

  • Lead the definition and development of software, automation, and enabling tools for server hardware programs; track and report progress
  • Design and build scalable system-level software with focus on durability, availability, security, and diagnostics
  • Develop and maintain device drivers for Linux on ARM and x86 architectures
  • Build automation solutions using modern programming languages (Python, Ruby, Java, C/C++, etc.)
  • Work with OS internals, storage subsystems, and accelerator/GPU software stacks in Linux-based environments
  • Build, manage, and deploy CI/CD pipelines for rapid deployment of code changes to org-owned and customer-owned systems

Cross-Team Collaboration

  • Work across internal HWEng teams to ensure new server hardware addresses data path and control path functionality needed by dependent service teams
  • Work closely with internal customers to identify early any potential problems onboarding new servers - storage or accelerated compute - into their ecosystem
  • Engage with ODMs and design partners on testability, diagnostic, and automation requirements during hardware design and development
  • Contribute to server design to improve robustness, testability, diagnosability, and reliability
  • Partner with datacenter operations teams to close the loop between field failures and design improvements

A day in the life

Systems Development Engineers in AWS Hardware Engineering wear many hats. From orchestration tooling development to hardware integration to kernel driver debugging, we dive deep into problems across the breadth of AWS. Our teams are directly responsible for launching and maintaining server hardware in the fleet - including storage servers powering distributed storage platforms and AI/ML accelerator servers with GPUs. Located in Seattle and Cupertino, we work with internal development teams, ODMs, and design partners to deliver servers deployed in datacenters worldwide.

About the Company

A

Amazon.com Inc

At Amazon, we don’t wait for the next big idea to present itself. We envision the shape of impossible things and then we boldly make them reality. So far, this mindset has helped us achieve some incredible things. Let’s build new systems, challenge the status quo, and design the world we want to live in. We believe the work you do here will be the best work of your life.

Wherever you are in your career exploration, Amazon likely has an opportunity for you. Our research scientists and engineers shape the future of natural language understanding with Alexa. Fulfillment center associates around the globe send customer orders from our warehouses to doorsteps. Product managers set feature requirements, strategy, and marketing messages for brand new customer experiences. And as we grow, we’ll add jobs that haven’t been invented yet.

It’s Always Day 1
At Amazon, it’s always “Day 1.” Now, what does this mean and why does it matter? It means that our approach remains the same as it was on Amazon’s very first day – to make smart, fast decisions, stay nimble, invent, and stay focused on delighting our customers. In our 2016 shareholder letter, Amazon CEO Jeff Bezos shared his thoughts on how to keep up a Day 1 company mindset. “Staying in Day 1 requires you to experiment patiently, accept failures, plant seeds, protect saplings, and double down when you see customer delight,” he wrote. “A customer-obsessed culture best creates the conditions where all of that can happen.” You can read the full letter here

Our Leadership Principles
Our Leadership Principles help us keep a Day 1 mentality. They aren’t just a pretty inspirational wall hanging. Amazonians use them, every day, whether they’re discussing ideas for new projects, deciding on the best solution for a customer’s problem, or interviewing candidates. To read through our Leadership Principles from Customer Obsession to Bias for Action, visit https://www.amazon.jobs/principles
COMPANY SIZE
10,000 employees or more
INDUSTRY
Retail
FOUNDED
1994
WEBSITE
http://Amazon.com/militaryroles

Similar Job Searches