Senior Software Engineer - AI Observability - AI, Search & Knowledge Platform

Apple

Apple

Software Engineering, Data Science

Cupertino, CA, USA

USD 212k-318,400 / year + Equity

Posted on May 12, 2026
Do you want to build the future of AI enabled observability at Apple? We're looking for an experienced AI observability engineer to design and build AI observability solutions that power Apple Intelligence, Search, and AI infrastructure powering Apple's intelligent products. We're at the forefront of building AI-first observability services, blending AI, cloud-first engineering, and industry standards to deliver smart, scalable solutions. Your work will directly impact the experience of billions of users on their favorite Apple devices. If you are a seasoned principal or senior software engineer with a proven track record in building AI enabled observability solutions and have a deep passion for observability, AI, cloud-native technologies and large-scale distributed systems, we want to talk with you.
The AI, Search & Knowledge Platform Cloud Infrastructure Team within Apple’s Services organization designs, builds, and scales the foundational systems that power Search and next-generation machine learning workloads. We're pioneering the next generation of AI-powered observability solutions. While we innovate to build new solutions, we also leverage industry-standard open-source technologies. In this role, you will collaborate with a team of engineers to lead the design and development of user-facing observability features for AIML products and infrastructure. You will also be responsible for providing technical guidance, sharing observability best practices and know-how, leveraging AI pipelines and mentoring the team to develop and deliver best-of-class features and a delightful user experience for all users
  • 7+ years of software engineering experience building and operating large-scale, cloud-native, distributed systems and microservices in public cloud infrastructure and/or "private cloud" environments
  • 7+ years of software engineering experience and strong background in computer science: distributed systems, algorithms and data structures, APIs and highly-scalable, reliable systems and micro-services
  • Demonstrated experience using LLM and ML models for AIOps and model observability
  • Hands on experience building ML pipelines, portable workflows and in model tuning to deploy ML and LLM models in production for customer-facing features
  • Hands on experience using LLMs, ML frameworks, i.e. TensorFlow, PyTorch and libraries like Scikit-learn, NumPy, LangChain, MLFlow, KubeFlow
  • Experience building services for Observability Analysis, including anomaly detection, incident detection, automated remediation, and root-cause analysis
  • Excellent verbal and written communication, problem solving, and cross-team collaboration skills, including with open source communities
  • Knowledge of current Gen AI research and techniques: MCPs, RAG systems, Agentic AI (multi-agent orchestration, tool calling)
  • Hands-on experience with agentic AI frameworks (e.g. LangGraph, AutoGen, CrewAI) for building multi-step reasoning and tool-using agents
  • Experience designing multi-agent orchestration, tool-calling, or RAG systems for operational/diagnostic workflows
  • Demonstrated proficiency operating workloads on public and/or private cloud platforms, Kubernetes, object storage, networking, databases, and observability services
  • Demonstrated experience in building observability systems for metrics, distributed tracing, logs, profiling
  • Experience with large scale observability visualization tools like Grafana, DataDog, and ELK
  • Building large-scale incident management, alert management and notification systems
  • Active contributions to CNCF or open source projects (e.g., k8sGPT, HolmesGPT, kagent, OpenTelemetry, Prometheus)