(USA) Distinguished, Software Engineer-AI/ML Engineer - Agentic Systems & Site Reliability Engineering

Walmart
Walmart

Software Engineering, Data Science

United States · California, USA · Texas, USA · Tampa, FL, USA · Sunnyvale, CA, USA · Remote

Posted on Jun 26, 2026

Position Summary...

As a Distinguished AI/ML Engineer within Walmart Global Tech's Site Reliability Engineering organization, you will lead the technical development of next-generation agentic AI systems and intelligent automation solutions that ensure mission-critical reliability, scalability, and operational excellence across Walmart's entire technology ecosystem. You will architect and implement cutting-edge machine learning platforms and autonomous agents that revolutionize how we monitor, predict, and automatically resolve issues across all of Walmart's systems, supporting millions of Associates and customers globally.

Walmart Global Tech's Site Reliability Engineering organization is built with hybrid systems and software engineers who take technical ownership for reliability, scalability, automation, and mission-critical issues related to uptime, availability and fast rate of improvement of Walmart's e-commerce, stores, and omni-channel platform. As a technical expert in this domain, you'll drive the transformation of traditional SRE practices into AI-powered, self-healing, and autonomous systems built on modern tech stacks with intelligent capacity management and predictive performance optimization.

You'll be responsible for designing and building Tier 0 high-availability, resilient agentic platforms that serve as the backbone for reliability engineering across all of Walmart's systems, stores and facilities across US and international markets while defining and implementing unified, intelligent, operationally robust technical solutions and tools for all Walmart Technology organizations across all channels and geographies.

About Team:

The Site Reliability Engineering organization at Walmart Global Tech is responsible for ensuring the reliability, availability, and performance of all systems that power the world's largest retailer. As a Fortune #1 company, our work impacts hundreds of millions of customers and associates globally through every transaction, every search, and every interaction across Walmart's digital and physical ecosystem.

We are the guardians of system reliability for Walmart's e-commerce platform, supply chain systems, in-store technology, financial services, and all critical business operations. Our SRE organization is at the forefront of applying cutting-edge AI/ML technologies to traditional reliability engineering challenges, building autonomous systems that can predict, prevent, and resolve issues before they impact customers or business operations.

The SRE team is one of the core engineering organizations within Walmart Global Tech, working closely with all product and engineering teams across the enterprise to ensure that every system meets the highest standards of reliability, scalability, and performance. We're invested in building a robust, intelligent, and highly automated infrastructure that supports Walmart's mission to help people live better through technology innovation and operational excellence.

What you'll do...

AI/ML & Agentic Systems Technical Leadership:

  • Architect and develop advanced agentic AI systems that can autonomously handle complex reliability engineering workflows, predictive failure analysis, and self-optimization across all Walmart technology systems.
  • Design and implement multi-agent orchestration platforms that coordinate between different AI agents for automated incident response, capacity planning, and performance optimization across e-commerce, supply chain, and in-store systems.
  • Build intelligent observability and monitoring systems using ML-driven anomaly detection, predictive analytics, and autonomous incident resolution capabilities that span all of Walmart's technology ecosystem.
  • Develop self-healing infrastructure platforms that leverage AI to predict, prevent, and automatically resolve system issues before they impact customers, associates, or business operations across any Walmart system.

Site Reliability Engineering Technical Excellence:

  • Design, write and build advanced tools to improve reliability, latency, availability, and scalability of all Walmart Tech systems including: 1) Engineer reliability and availability starting with metrics and measurements across all domains, 2) Enable scaling by providing technical solutions, developing automation and/or optimizing processes for all engineering teams, 3) Build tools/automate to prevent re-occurrence of problems across all mission critical Walmart services, 4) Augment existing instrumentation to build a cohesive picture of system characteristics across the entire Walmart technology landscape with special attention to points of failure.
  • Architect and implement fault-tolerant systems and services across Walmart's hybrid cloud infrastructure with focus on autonomous recovery and intelligent failure prediction for e-commerce, supply chain, financial services, and in-store technology.
  • Collaborate with engineering teams and leadership across all Walmart technology organizations to establish technical strategies and solutions to improve mean time to detect (MTTD) and mean time to restore (MTTR) through intelligent automation and predictive capabilities.
  • Work with service owners across all domains (e-commerce, supply chain, stores, fintech, etc.) to define SLOs and build SLIs to ensure all critical systems are meeting SLAs while maintaining optimal performance and user experience.
  • Perform complex troubleshooting and analysis of large-scale distributed systems across Walmart's entire technology stack, using expertise in coding, algorithms, and distributed system design.

Strategic Technical Innovation:

  • Partner closely with all engineering organizations including E-commerce, Supply Chain, Store Technology, Fintech, and Data Platform teams to deliver autonomous reliability solutions through advanced machine learning, natural language processing, and computer vision technologies.
  • Drive the development of MLOps and AIOps platforms that enable continuous learning, model deployment, monitoring, and autonomous optimization of reliability engineering systems across all Walmart domains.
  • Innovate in agentic AI technologies for SRE including large language models (LLMs) for automated incident response, reinforcement learning agents for capacity optimization, multi-modal AI for infrastructure monitoring, and federated learning for cross-domain reliability insights.
  • Implement advanced CI/CD pipelines for reliability systems including automated deployment, validation, and rollback mechanisms for SRE tools and monitoring systems with built-in observability and performance monitoring.
  • Establish platform engineering excellence by building reusable SRE infrastructure, intelligent monitoring platforms, and developer productivity tools that serve all Walmart engineering teams.
  • Provide technical mentorship and guidance to engineering teams across all Walmart organizations on advanced SRE concepts, AI/ML for reliability, platform engineering best practices, and autonomous system design through code reviews, technical discussions, and knowledge sharing.

What you'll bring:

Education & Experience:

  • Bachelor's/Master's degree in Engineering, Computer Science, or related field with 12+ years of hands-on experience in Site Reliability Engineering, AI/ML Engineering, or Platform Engineering.
  • Proven track record as a senior individual contributor in SRE, AI/ML, or Platform Engineering with experience influencing technical decisions and driving technical excellence across teams.
  • Deep experience working with mission-critical systems with KPI expertise in MTTD, MTTR, availability, model performance, and autonomous system reliability.

Must-Have Technical Experience:

  • Expert-level AI/ML engineering experience with deep expertise in machine learning algorithms, deep learning frameworks (TensorFlow, PyTorch), and production ML system deployment at scale.
  • Advanced experience with agentic AI systems including multi-agent frameworks, autonomous decision-making systems, LLM-based agents, and agent orchestration platforms.
  • Comprehensive Site Reliability Engineering expertise including hands-on experience with Service Management (Incident, Problem & Change Management), Performance and Capacity Engineering for AI/ML systems.
  • Expert-level cloud engineering experience (Azure, GCP, AWS) with deep knowledge of cloud-native AI/ML services, containerization (Kubernetes, Docker), and serverless architectures.
  • Deep observability and monitoring expertise with hands-on experience in:
    • Distributed tracing (Jaeger, Zipkin, OpenTelemetry) for AI/ML pipelines
    • Metrics collection and alerting (Prometheus, Grafana, DataDog) with ML-specific dashboards
    • Log aggregation and analysis (ELK stack, Splunk, Fluentd) for model and system monitoring
    • APM tools and performance monitoring for AI/ML workloads
    • AI-driven anomaly detection and predictive monitoring systems
Platform Engineering experience including:
  • Building developer platforms and internal tooling for AI/ML teams
  • Infrastructure as Code (Terraform, CloudFormation, Pulumi)
  • Service mesh architectures (Istio, Linkerd) for AI/ML services
  • API gateway and microservices platform development
  • Self-service ML deployment platforms and developer productivity tools

Industry & Domain Experience:

  • Experience in large-scale retail, e-commerce, or high-traffic consumer-facing systems with strict availability and performance requirements (strongly preferred).
  • Experience with mission-critical distributed systems serving millions of concurrent users across multiple domains (e-commerce, payments, inventory, supply chain, etc.).
  • Experience with enterprise-scale SRE implementations supporting diverse technology stacks and business-critical applications across multiple organizational domains.
  • Experience with complex multi-cloud and hybrid cloud environments supporting diverse workloads with varying reliability and performance requirements.

Technical Leadership & Collaboration Skills:

  • Technical thought leadership and influence in AI/ML architecture decisions, SRE methodologies, and platform engineering strategies across all Walmart technology domains.
  • Strong cross-functional collaboration experience working with diverse engineering teams across E-commerce, Supply Chain, Store Technology, Fintech, Security, and Platform Engineering to deliver enterprise-wide reliability solutions.
  • Excellent technical communication skills with ability to articulate complex SRE and AI/ML concepts to diverse engineering audiences and influence technical decisions across multiple organizations.
  • Mentorship and knowledge sharing experience, providing technical guidance on SRE best practices, AI/ML for reliability, and platform engineering through code reviews, technical discussions, and documentation.
  • High degree of technical ownership and accountability for complex, mission-critical reliability systems with ability to work independently on high-impact projects that span multiple engineering domains.

Preferred Technical Experience:

  • MLOps and model lifecycle management experience with tools like MLflow, Kubeflow, Seldon, or similar platforms for enterprise-scale reliability and monitoring deployments.
  • Natural Language Processing and Computer Vision expertise for building intelligent log analysis, automated incident response, visual infrastructure monitoring, and conversational AI for SRE operations.
  • Edge computing and distributed systems experience for deploying monitoring and reliability solutions across retail stores, distribution centers, and edge infrastructure.
  • Real-time streaming and event-driven architectures using Kafka, Pulsar, or similar technologies for processing high-volume operational data streams across all Walmart systems.
  • Advanced security practices for reliability systems including secure monitoring, data privacy in observability, and secure multi-tenant SRE platforms.
  • Chaos Engineering and fault injection experience across diverse system types including e-commerce, supply chain, financial services, and in-store technology.
  • Performance optimization for large-scale distributed systems including database optimization, network performance tuning, and infrastructure cost optimization.
  • Open source contribution experience in SRE, observability, and infrastructure automation tools and familiarity with industry best practices and emerging technologies.
At Walmart, we offer competitive pay as well as performance-based bonus awards and other great benefits for a happier mind, body, and wallet. Health benefits include medical, vision and dental coverage. Financial benefits include 401(k), stock purchase and company-paid life insurance. Paid time off benefits include PTO (including sick leave), parental leave, family care leave, bereavement, jury duty, and voting. Other benefits include short-term and long-term disability, company discounts, Military Leave Pay, adoption and surrogacy expense reimbursement, and more. You will also receive PTO and/or PPTO that can be used for vacation, sick leave, holidays, or other purposes. The amount you receive depends on your job classification and length of employment. It will meet or exceed the requirements of paid sick leave laws, where applicable. For information about PTO, see https://one.walmart.com/notices. Live Better U is a Walmart-paid education benefit program for full-time and part-time associates in Walmart and Sam's Club facilities. Programs range from high school completion to bachelor's degrees, including English Language Learning and short-form certificates. Tuition, books, and fees are completely paid for by Walmart.
Eligibility requirements apply to some benefits and may depend on your job classification and length of employment. Benefits are subject to change and may be subject to a specific plan or program terms.
For information about benefits and eligibility, see One.Walmart.
The annual salary range for this position is $169,000.00 - $338,000.00 Additional compensation includes annual or quarterly performance bonuses. Additional compensation for certain positions may also include :
- Stock

Minimum Qualifications...

Outlined below are the required minimum qualifications for this position. If none are listed, there are no minimum qualifications.

Option 1: Bachelor's degree in computer science, computer engineering, computer information systems, software engineering, or related area and 6 years’ experience in software engineering or related area.
Option 2: 8 years’ experience in software engineering or related area.

Preferred Qualifications...

Outlined below are the optional preferred qualifications for this position. If none are listed, there are no preferred qualifications.

Master’s degree in computer science, computer engineering, computer information systems, software engineering, or related area and 4 years' experience in software engineering or related area, We value candidates with a background in creating inclusive digital experiences, demonstrating knowledge in implementing Web Content Accessibility Guidelines (WCAG) 2.2 AA standards, assistive technologies, and integrating digital accessibility seamlessly. The ideal candidate would have knowledge of accessibility best practices and join us as we continue to create accessible products and services following Walmart’s accessibility standards and guidelines for supporting an inclusive culture.

Primary Location...

1345 Crossman Ave, Sunnyvale, CA 94089-1114, United States of AmericaWalmart and its subsidiaries are committed to maintaining a drug-free workplace and has a no tolerance policy regarding the use of illegal drugs and alcohol on the job. This policy applies to all employees and aims to create a safe and productive work environment.