Site Reliability Engineer - Cloud

NVIDIA

NVIDIA

Software Engineering
Santa Clara, CA, USA
USD 136k-212,750 / year + Equity
Posted on May 6, 2025

NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing — with the GPU acting as the brains of computers, robots, and self-driving cars that can perceive and understand the world. Today, we are increasingly known as “the AI computing company.” We're looking to grow our company and build our teams with the smartest people in the world. Join us at the forefront of technological advancement.

NVIDIA is looking for an outstanding Site Reliability Engineer to join its Digital Marketing Organization. SREs are responsible for ensuring that all of NVIDIA’s Digital Marketing Services are reliable, fast and efficient, all the time. The person in this position will be responsible for the leading and improving AWS Infrastructure, scripting. This individual will have all the existing tools and software at fingertips to dedicatedly help monitor the Platform and Services. In addition, SRE will assist in the development or improvement of new monitoring and alerting tools, tests plan, updates to existing infrastructure, and further automate our deployment pipeline. The ideal candidate will be encouraged to act and respond to urgent issues and outages (globally and regionally), validating the deployment process, tackle issues in the field, and most significantly communicate current status to both internal and external stakeholders. Want to make a difference? Come join us!

What you will be doing:

  • Rapidly debug and triage user-reported issues on the Digital Marketing Organization.

  • On-board new applications and services on AWS Infrastructure

  • Make valuable contribution to the overall health, performance, and uptime of our services running in Linux and Windows.

  • Implement monitors, alerts and SOPs to ensure early detection, and accurate response to service-impacting issues.

  • Taking ownership of automating, scripting, and tooling of new/existing scripts to help the team achieve 100% automation of daily tasks

What we need to see:

  • MS or BS in Computer Science/Engineering or a related field or equivalent experience.

  • You will need 5+ years of experience supporting technical operations in a live-site production environment with a real passion for automation and tooling.

  • Built and ran critical production services packaged or custom python/java on Windows or Linux.

  • Strong knowledge of Kubernetes Platform, deployments, automation.

  • Make valuable contributions to the incident management process for early detection of all service-impacting issues, accurate triage, partner communication, impact containment, service restoration, and post-incident follow-up. SRE On call experience is a must

  • Advance level experience with scripting and development in (Python). Fully automating the steps with a “one-click” rapid solution.

  • Shown strengths in problem-solving and root causing issues, while continuously seeking ways to drive optimization, efficiency and the bottom line.

Ways to stand out from the crowd:

  • Strong Experience with AWS Cloud Platform, Kubernetes as a platform. SRE On call experience

  • Excellent communication, presentation, social, and analytical skills; the ability to communicate sophisticated interaction concepts clearly and persuasively across different audiences and varying levels of the organization.

With competitive salaries and a generous benefits package, we are widely considered to be one of the technology world’s most desirable employers, and we have some of the most forward-thinking and hardworking people in the world working for us. Due to outstanding growth, our best-in-class teams are rapidly growing so if you're creative and autonomous with a real passion for technology, we want to hear from you.

The base salary range is 136,000 USD - 212,750 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.