System Development Engineer, Elastic Disaster Recovery, AWS Elastic Disaster Recovery
Santa Clara, CA, USA
Description
We are looking for a Systems Development Engineer to build the automation, tooling, and operational infrastructure that keep this large-scale, mission-critical service reliable, secure, and efficient. In this role you will treat operations as a software problem — eliminating manual toil, hardening our deployment and monitoring systems, and ensuring our replication and recovery fleet runs flawlessly across a broad and heterogeneous environment. A key dimension of this role is breadth: DRS supports a wide range of operating systems (multiple Linux distributions and Windows versions) and both x86/64 and ARM64 (Graviton) architectures, so your automation and tooling must be robust across diverse OS and hardware combinations.
Key job responsibilities
* Operational automation: Design and build software that automates infrastructure provisioning, deployments, and recurring operational workflows, reducing manual effort and on-call burden across the DRS fleet.
* CI/CD and deployment safety: Build and improve pipelines, deployment guardrails, and rollback mechanisms to ship changes safely across all regions and platform variants.
* Cross-platform support: Develop and maintain tooling that works reliably across a wide range of operating systems (various Linux distributions and Windows) and both x86/64 and ARM64 (Graviton) architectures.
* Monitoring and resilience: Implement monitoring, alarming, and self-healing systems to detect and remediate issues before they impact customers' replication and recovery operations.
* Scaling and performance: Tune and scale the systems behind continuous replication, capacity management, and recovery orchestration to handle growth gracefully.
* Operational excellence: Drive down ticket and incident volume through durable, programmatic fixes; lead root-cause analysis and contribute to runbooks and operational best practices.
* Security and compliance: Partner with security teams to harden the service and remediate findings, ensuring fixes are deployed consistently across the fleet.
* Cross-team leverage: Build automation and tooling that serves multiple teams and raises the operational bar across DRS.
About the team
AWS Elastic Disaster Recovery (DRS) is a disaster recovery service provided by AWS that enables organizations to minimize downtime and data loss with fast, reliable recovery of on-premises and cloud-based applications. DRS uses cost-effective AWS resources to maintain an up-to-date copy of source servers on AWS, allowing for point-in-time recovery and failback to the primary site after an issue is resolved.