Sr. Software Engineer--M365 Foundation
Microsoft
The Substrate Fleet Health team is engineering the future of cloud reliability and efficiency of managing health and capacity of the Substrate Fleet. We are a high-impact team driving innovation in hardware health, fleet lifecycle management, intelligent repair systems, and proactive capacity optimization to ensure Microsoft’s hyperscale infrastructure operates at peak performance.
Our mission is bold:
- Maximize fleet availability through proactive detection and mitigation of hardware issues.
- Accelerate repair intelligence with AI-driven insights and automation, reducing repair times from hours to seconds.
- Optimize spare machine utilization and capacity forecasting across global datacenters, unlocking millions in cost savings and enabling sustainable growth.
- Enhance fleet lifecycle management by predicting failures, improving component health, and reducing stranded capacity.
We are building next-generation solutions like RepairBox vNext, Fleet Health Copilot, Unified Spare Pool, and Smart Recovery Services—systems that integrate telemetry, predictive analytics, and automation to transform how cloud infrastructure is managed and scaled.
Our culture values:
- Innovation: We challenge the status quo and pioneer AI-driven solutions for hardware health and capacity optimization.
- Collaboration: We work across Substrate, Azure, and vendor ecosystems to solve complex global challenges.
- Ownership: We take pride in delivering resilient, scalable systems that power Microsoft’s cloud.
Joining this team means shaping the backbone of Microsoft’s cloud reliability and capacity strategy. You’ll be part of a group that doesn’t just respond to issues—we anticipate them, solve them, and set new standards for operational excellence.
As a Senior Software Engineer, you will play a key technical leadership role in building Hardware Health & Repair Intelligence—a transformative initiative focused on predictive hardware health, intelligent repair workflows, and proactive fleet capacity management.
Why Join Us
- Lead high-impact work at the intersection of cloud reliability, AI, and operational excellence.
- Tackle some of the most critical challenges in fleet health and capacity optimization at hyperscale.
- Be part of a mission-driven team that values innovation, collaboration, and bold execution.
- Influence how Microsoft builds and operates its cloud—smarter, faster, and more sustainably.
Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
Responsibilities
- Lead architecture and design for intelligent repair and fleet optimization systems, including Repairbox Vnext, and Fleet Copilot.
- Drive development of AI-powered telemetry pipelines and automation frameworks for predictive diagnostics and lifecycle management.
- Establish capacity forecasting and spare pool optimization strategies across global datacenters.
- Ensure security, scalability, and operational excellence across all solutions, including live-site readiness and DRI pathways.
- Collaborate with Azure, vendor, and platform teams to align technical solutions with business goals
Qualifications
Required Qualifications:
- Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
- OR equivalent experience.
Other Requirements:
- Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings:
- Master's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
- OR Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
- OR equivalent experience.
- Expertise in distributed systems, cloud infrastructure, and large-scale automation.
- Solid background in AI/ML-driven telemetry, anomaly detection, and predictive analytics.
- Experience with capacity planning, hardware lifecycle management, and hyperscale reliability preferred.
- Excellent communication and collaboration skills.
This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.
Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations.