AI and Systems Software Intern, At Scale AI - Fall 2026

NVIDIA Corp

Santa Clara, CA

JOB DETAILS
SKILLS
Algorithms, Analysis Skills, Artificial Intelligence (AI), Autonomous Driving Systems, Bash Scripting, Benchmarking, CPU (Central Processing Unit), Communication Skills, Computer Engineering, Computer Graphics, Computer Science, Debugging Skills, Debugging Tools, Deep Learning, Distributed Computing, Electrical Engineering, Environmental Management, GDB (Gnu Debugger), GPU (Graphics Processing Unit), Hardware Architecture, Human Intelligence (HUMINT), Industry Standards, Industry/Trade Analysis, Large-Scale Systems, Linux Operating System, Mentoring, Metrics, Multitasking, Network Operations Center, Operating Systems, PCI Express (PCI-E), Performance Tuning/Optimization, Process Improvement, Python Programming/Scripting Language, Root Cause Analysis, Scientific Research, Scripting (Scripting Languages), Server Architecture, Software Configuration Management, System Architecture, Team Player, Telemetry, Unix Shell Programming
LOCATION
Santa Clara, CA
POSTED
23 days ago

Our work at NVIDIA is dedicated towards a computing model focused on visual and AI computing. For two decades, NVIDIA has pioneered visual computing, the art and science of computer graphics, with our invention of the GPU. The GPU has also shown to be spectacularly effective at solving some of the most complex problems in computer science. Today, NVIDIA's GPU simulates human intelligence, running deep learning algorithms and acting as the brain of computers, robots and self-driving cars that can perceive and understand the world. We are looking to grow our company and teams with the smartest people in the world and there has never been a more exciting time to join our team!

NVIDIA is looking for an intern for an exciting role in AI and Systems Software for datacenter applications. You will be deeply involved in system-level debugging, analyzing our large-scale infrastructure reliability, and correlating complex failure modes to underlying hardware or system issues. We are working with the latest Accelerated Computing and Deep Learning software and hardware platforms, along with many scientific researchers, developers, and customers to craft improved workflows and develop new, leading differentiated solutions. Our team interacts with OS, container technologies, GPU compute, and systems specialists to architect, develop and bring up large scale performance software components and optimize performance.

What you'll be doing:

  • Investigate and triage failures within large-scale compute clusters, performing deep-dive analysis to distinguish between software glitches, configuration errors, and hardware faults.

  • Analyze logs and telemetry to correlate specific job failures to system-level issues and diagnostic test failures, helping to reduce noise and identify root causes.

  • Assist with the tracking, calculation, and reporting on key reliability metrics, specifically Mean Time Between Failures (MTBF) and Mean Time Between Interruptions (MTBI), to drive infrastructure improvements.

  • Assist in analyzing large-scale workload issues, searching for application and infrastructure improvement opportunities to ensure jobs run as fast and reliably as possible.

  • Work closely with a mentor to learn about hardware validation suite architecture, document debugging methodologies, and help the team make intelligent, data-backed engineering decisions.

What we need to see from you:

  • Pursuing a BS, MS, or PhD in Computer Science, Computer Engineering, Electrical Engineering, or a related field.

  • Proficiency in Python and Bash/Shell scripting for automation and tool development.

  • Proven debugging skills with an ability to isolate issues in complex, distributed systems.

  • Exposure to high-performance computing (HPC) environments, cluster managers (e.g., Slurm, Kubernetes), or large-scale distributed systems.

Ways to stand out:

  • Familiarity with server architecture (PCIe, NVLink, CPU/GPU interactions) and hardware diagnostics.

  • Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).

  • Familiarity with system profiling and debugging tools (e.g., strace, gdb, perf).

  • Experience running and analyzing standard industry benchmarks on Linux systems.

  • Desire to learn and be part of a committed and hardworking team with excellent collaboration and communication skills.

  • Ability to multitask effectively in a dynamic, high-performance environment.

NVIDIA is widely considered to be one of the technology world's most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you!

Our internship hourly rates are a standard pay based on the position, your location, year in school, degree, and experience. The hourly rate for our interns is 20 USD - 71 USD.

You will also be eligible for Intern benefits.

Applications for this job will be accepted at least until May 31, 2026.

This posting is for an existing vacancy.

NVIDIA uses AI tools in its recruiting processes.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

About the Company

N

NVIDIA Corp

Visualize your future . . . We Do
NVIDIA is the world leader in graphics processing technologies, creating innovative, industry-changing products for computing, consumer electronics, and mobile devices. NVIDIA products are transforming visually-rich applications such as video games, film production, broadcasting, industrial design, space exploration, and medical imaging. We invest in our people and our technologies, support and fund industry research around the world, and consistently deliver high-quality products. NVIDIA's culture promotes and inspires a team of world-class employees to be at the top of their game. We've created an environment where talents are recognized and collaboration is valued. Our employees are shaping the world of tomorrow. . . today. We invite you to explore the opportunities available at NVIDIA to see what your future may hold.

COMPANY SIZE
10,000 employees or more
INDUSTRY
Computer Software
FOUNDED
1993
WEBSITE
http://www.nvidia.com