Senior Software Engineer
Microsoft
Senior Software Engineer
Multiple Locations, United States
Save
Overview
The Azure Kubernetes Service (AKS) team is responsible for running Kubernetes at global cloud scale. On AKS, millions of containers are started, healed, and routed to serve production traffic every day. The team delivers essential control-plane and data-plane capabilities, and the work directly impacts reliability, performance, and developer productivity for customers around the world.
As a Senior Software Engineer on Azure Kubernetes Service, you will design, build, and operate cloud services that provision, upgrade, secure, and monitor Kubernetes clusters across global infrastructure. This role involves working across distributed systems, networking, storage, and platform automation to deliver resilient customer experiences. It offers opportunities to grow your expertise in large-scale systems, deepen your knowledge of Kubernetes and cloud engineering, and strengthen your skills in Site Reliability Engineering (SRE) practices. Flexible work arrangements are supported, including hybrid and partial remote options.
This position is ideal for individuals interested in building scalable, secure, and reliable cloud-native solutions. You will collaborate with a diverse team to solve complex technical challenges and contribute to the evolution of Microsoft Azure’s container orchestration capabilities. The work is impactful, fast-paced, and aligned with the needs of developers and enterprises worldwide.
Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
Qualifications
Required Qualifications:
- Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, Golang, C, C++, C#, Java, JavaScript, or Python
- OR equivalent experience.
- 1+ year(s) experience building or operating distributed systems or cloud services in production environments, including:
- Microservices architecture
- Remote Procedure Call (RPC) frameworks
- Messaging systems
- Data store technologies
- 1+ year(s) experience working with containerization and orchestration technologies such as Docker and Kubernetes, along with foundational Linux knowledge in:
- Networking
- Process management
- Storage systems
- 1+ year(s) experience owning services in production environments, including:
- On-call responsibilities or Designated Responsible Individual (DRI) duties
- Monitoring and incident response
- Post-incident analysis and continuous improvement
Other Requirements:
- Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings:
- Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
Preferred Qualifications:
- Bachelor's Degree in Computer Science
- OR related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, Golang, C, C++, C#, Java, JavaScript, or Python
- OR Master's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, Golang, C, C++, C#, Java, JavaScript, or Python
- OR equivalent experience.
- 1+ year(s) experience with systems programming and container orchestration, including:
- Proficiency in Go (Golang) and/or C# for cloud services
- Familiarity with Kubernetes internals such as controllers, webhooks, Custom Resource Definitions (CRDs), scheduler, and kubelet
- Knowledge of cloud networking and storage technologies including Container Network Interface (CNI), load balancers, virtual networks (VNETs), Domain Name System (DNS), Ingress, Container Storage Interface (CSI), disks/files, and snapshots
- Experience with infrastructure-as-code tools such as Azure Resource Manager (ARM), Bicep, and Terraform, and continuous integration/continuous delivery (CI/CD) pipelines
- 1+ year(s) experience applying reliability engineering practices, including:
- Service Level Objectives (SLOs) and Service Level Indicators (SLIs)
- Chaos and upgrade testing
- Capacity and performance tuning.Telemetry pipelines and observability tools such as Kusto, Prometheus, and Grafana
Software Engineering IC4 - The typical base pay range for this role across the U.S. is USD $119,800 - $234,700 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $158,400 - $258,000 per year.
Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay
Microsoft will accept applications for the role until October 28, 2025
#azurecorejobs
Responsibilities
- Collaborate with product managers, architects, and partner teams to clarify scenarios and user requirements for AKS features and platform investments.
- Drive design for new or improved AKS components (e.g., cluster lifecycle, upgrades, networking/CNI, storage/CSI, policy, security, observability) including dependency mapping, design docs, and API contracts.
- Create, implement, optimize, and refactor production code and automation to improve reliability, performance, maintainability, and cost efficiency across control-plane and data-plane services.
- Leverage subject-matter expertise in Kubernetes and Azure to plan releases, break down work, and lead execution across a workgroup; provide technical mentorship and code reviews.
- Act as a Designated Responsible Individual (DRI): participate in on-call, follow runbooks/playbooks, monitor for degradation, triage incidents, communicate status, and drive mitigations/RCAs for complex issues.
- Proactively adopt new patterns and technologies to improve availability, reliability, efficiency, observability, and performance; champion consistency in telemetry, alerting, and operations at scale.
- Uphold security and compliance best practices (least privilege, secrets management, supply-chain security, vulnerability remediation) across services and CI/CD.