Lead Site Reliability Engineer
The Walt Disney Company
Job Posting Title:
Lead Site Reliability EngineerReq ID:
10129163Job Description:
“We Power the Magic!” That’s our motto at Disney Experiences (DX). Our team creates world-class immersive digital experiences for the Company’s premier vacation brands including Disney’s Parks & Resorts worldwide, Disney Cruise Line, Aulani, a Disney Resort & Spa, and Disney Vacation Club.
We are responsible for the end-to-end digital and physical Guest experience for all technology & digital-led initiatives across the Attractions & Entertainment, Food & Beverage, Resorts & Transportation and Merchandise lines of business as well as other initiatives including MyDisneyExperience and Hey, Disney!
This role sits in the Commerce Shared Services organization within Technology & Digital for Disney Experiences. It works closely with Technical Operations and Product Delivery from across the company.
The Lead Site Reliability Engineer will report to the Manager-Site Reliability Engineer.
About The Role & Team:
This is a team lead role that focuses on engineering and reliability with a team of site reliability engineers. You will be responsible for coordinating the teams efforts for the portfolio of applications supported by the team. This team needs a strong mentor who can help develop and execute specific reliability plans in line with the business strategy of DX Tech and Digital.
What You'll Do:
Lead the evolution of DevOps practices within the broader team framework, guiding others in leveraging this culture to enhance observability practices.
Consult, design, build, and support development pipelines, automate infrastructure and operations, create telemetry for monitoring, engineer high reliability and reinforce best- practices to secure company data.
Expertise in systems administration skills on AWS Cloud, Docker, Kubernetes and must have extensive experience with web technologies, source control management using Nimbus, ECS, Tomcat, Harness, GitHub and GitLab.
Develop and advocate strategic directions for reliability, observability and recovery and bring practical knowledge on systems, network, operational excellence and application stability, security, performance, and capacity management.
Plan and coordinate larger efforts for the team of site reliability engineers.
You will be expected to stay up to date with emerging technologies so you can make informed recommendations.
Drive teams to consult, design, build, and support development pipelines, automate infrastructure and operations, build telemetry for monitoring, engineer high-reliability and reinforce best-practices to secure company data
Required Qualifications:
Minimum 7 years of related work experience
Demonstrated leadership in implementing observability principles across complex systems and environments, fostering a culture of reliability and resilience
Extensive experience with modern software delivery tools, including GitHub, GitLab, Harness.io, LaunchDarkly, Nimbus, Kubernetes and with optimizing workflows and ensuring seamless deployment processes
Outstanding communication and leadership abilities, to ensure effective growth and development of team
A visionary who motivates teams to excel and fosters creativity, consistently driving excellence in all endeavors
An advocate for a diverse and inclusive culture that encourages innovation and ensures every team member feels a sense of belonging
Proficient in implementing observability principles and advanced tools for system enhancement, applying expertise in major APM tools
Fluent in core scripting languages and advanced programming skills (Python, NodeJS, Golang), experienced with Linux, CLI's, and code editors like VS Code
Skilled in Source Control Management systems like GitHub and Gitlab, managing users, and repos, proficient in networking protocols, distributed systems, and container platforms (e.g., Docker, ECS)
Experience in cloud hosting services (AWS, Google Cloud, Azure), databases, tools, and security, with experience in CI pipelines, build tools like Jenkins, RESTful web service calls, and JSON
Outstanding troubleshooting methodology, including instructing new methodologies to the team and evaluating new systems and infrastructure solutions for technical feasibility against standards
Preferred Qualifications:
Leveraging AI for predictive insights, driving measurable continuous improvement in system reliability
Required Education:
Bachelor’s degree in Computer Science, Information Systems, Software, Electrical or Electronics Engineering, or comparable field of study, and/or equivalent work experience
#DISNEYTECH
Job Posting Segment:
Technology & DigitalJob Posting Primary Business:
CommercePrimary Job Posting Category:
Site/System Reliability EngineerEmployment Type:
Full timePrimary City, State, Region, Postal Code:
Orlando, FL, USAAlternate City, State, Region, Postal Code:
Date Posted:
2025-08-21