Lead Site Reliability Engineer, Cloud Technology
JPMorganChase
Lead Site Reliability Engineer, Cloud Technology
Job Information
- Job Identification 210643109
- Job Category Software Engineering
- Business Unit Corporate Sector
- Posting Date 2025-07-10, 12:35 a.m.
- Locations One@Changi 1 Changi Business Park Central 1, One @ Changi City, Singapore, 486036, SG
- Job Schedule Full time
Job Description
Public Cloud SRE is responsible for engineering and operating the cloud infrastructure and platforms of JPMC ensuring reliability, resiliency, and security. We have a Senior Software Engineer, Site Reliability position to build the infrastructure and tooling for JPMC’s Public Cloud Platform.
As a Lead Site Reliability Engineer at JPMorgan Chase within the Cloud Reliability Services, you hold a leadership role in your team, demonstrate strong knowledge across multiple technical domains, and advise others on the technical and business issues facing them. Take lead and conduct resiliency design reviews, break up complex problems into digestible work for other engineers, act as a technical lead for medium to large-sized products, and provide advice and mentoring to other engineers.
Job responsibilities
- Engage in and improve the lifecycle of cloud services from inception, design, deployment, and operation
- Automate repeated manual tasks, develop tools and automation to improve the efficiency of the platform and infrastructure.
- Analyze defects, propose improvements and drive efficiencies in systems and processes.
- Helps to develop new cloud engineering strategies and implementations for the firm
- As part of Site Reliability, you have the responsibility of ensuring the reliability, availability, and performance of the cloud infrastructure and platform.
- Demonstrates site reliability principles and practices every day and champions the adoption of site reliability throughout your team
- Develop observability and telemetry tools.
- Author and improve the quality of technical engineering documentation
- Debug and solve issues in a production environment
- Participates in SRE on-call rotations and escalation workflows.
Required qualifications, capabilities, and skills
- Formal training or certification on software engineering or site reliability engineering and 5+ years applied experience
- Bachelor’s Degree in Computer Science or equivalent
- Deep proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices with the ability to implement these practices within an application or platform
- Expertise in building solutions with AWS cloud service, knowledge in Infrastructure as Code, tools such as Terraform and fluency in at least one programming language such as Python and Java
- Proficiency and experience in observability such as white and black box monitoring, SLO alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, etc.
- Proficiency in continuous integration and continuous delivery tools (e.g., Jenkins, GitLab, Terraform, etc.)
- Experience with container and container orchestration (e.g., ECS, Kubernetes, Docker, etc.) and troubleshooting common networking technologies and issues
- Ability to identify and solve problems related to complex data structures and algorithms
- Drive to self-educate and evaluate new technology and ability to teach team members
- Ability to expand and collaborate across different levels and stakeholder groups. Excellent communication skills working with stakeholders and domain experts across the company to design solutions to user problems
- Self-disciplined, self-managed, self-motivated and strong sense of ownership, urgency and drive
Preferred qualifications, capabilities, and skills
- AWS certifications will be a bonus.