Software Development Engineer, AWS Resilience, Health Guardian
Amazon
Software Engineering
Seattle, WA, USA
Description
AWS Infrastructure Services owns the design, planning, delivery, and operation of all AWS global infrastructure. In other words, we're the people who keep the cloud running. We support all AWS data centers and all of the servers, storage, networking, power, and cooling equipment that ensure our customers have continual access to the innovation they rely on. We work on the most challenging problems, with thousands of variables impacting the supply chain, and we're looking for talented people who want to help.
You'll join a diverse team of software, hardware, and network engineers, supply chain specialists, security experts, operations managers, and other vital roles. You'll collaborate with people across AWS to help us deliver the highest standards for safety and security while providing seemingly infinite capacity at the lowest possible cost for our customers. And you'll experience an inclusive culture that welcomes bold ideas and empowers you to own them to completion.
The HealthGuardian team is looking for a software engineer who is excited about building automated detection and mitigation systems that protect AWS infrastructure at scale. We detect subtle failures that evade traditional health checks and automatically remove affected resources from service before customers are impacted. Our systems run across every AWS region, and we're scaling coverage from hundreds of services to thousands. This is a hands-on position where you will design and deliver significant software components, drive cross-team technical alignment, and mentor other engineers. You need to be a strong software developer with a track record of delivering, but also excel in communication, technical leadership, and customer focus. You'll leverage generative AI tools as part of your daily workflow to accelerate design, development, and validation. This is an opportunity to join a small, high-impact team solving hard reliability problems and help shape both the technology and the direction of automated failure protection across AWS.
Key job responsibilities
Our engineers collaborate across diverse teams, projects, and environments to have a firsthand impact on AWS reliability. You'll bring a passion for distributed systems, safety engineering, and data-driven detection. You'll also: Design and deliver systems that span multiple AWS teams and organizational boundaries. Build detection algorithms and experimentation frameworks that validate changes at scale. Architect safety mechanisms — circuit breakers, throttling, validation — that let automation scale without unintended customer impact. Own ambiguous problems end-to-end from design through operations. Mentor other engineers and lead technical design reviews. Use AI-assisted development tools to prototype, test, and validate faster.
About the team
We are a small team with outsized impact on AWS reliability. We operate what we build, and every engineer has direct visibility into how their code performs during real infrastructure events. We solve complex distributed systems challenges to ensure automated protection works reliably even during the failures it's designed to detect. We value operational rigor, building systems that are safe by default, and solving hard problems with simple designs.