ML Compute Efficiency Automation Engineer, Infrastructure & Planning
Software Engineering, Other Engineering, Data Science
Cupertino, CA, USA
USD 181,100-318,400 / year + Equity
Posted on Jun 18, 2026
Apple’s Platform Acceleration & Compute Efficiency (PACE) is a high-leverage team operating at the intersection of our ML organizations, underlying compute infrastructure, and core platform tooling. Our mission is to empower Apple’s software engineering teams with efficient, scalable compute. By driving out operational friction and optimizing the broader machine learning ecosystem, we directly accelerate the pace of development for our Software and AIML organization. Foundation models are central to Apple's user experiences and maximizing the efficiency of our ML compute is paramount. Compute efficiency sits at the center of this role, ensuring that Apple’s models run as fast, reliably, and cost-effectively as possible. In this role you will tackle optimization challenges, from maximizing hardware utilization across GPUs, TPUs, and custom Apple Silicon, to shaping workload scheduling and capacity allocation for large model serving. We are looking for a particular kind of builder, an exceptional engineer who can think through hard problems and code their way past them, especially the ones involving scale and the slow manual work that quietly drains a high-leverage team. The ideal candidate treats every repeated process as a system waiting to be automated, every manual escalation as a system not yet built, and every prioritization request as a problem the right tooling can solve faster. The resulting data forms a foundation for the rapid, high-quality decisions that empower Apple's technical and business leaders. This is a founding role. The majority of your time goes to AI automation, building the systems that turn manual operations into tooling that runs and corrects itself. Your remaining time will go towards hands-on ML compute efficiency, working along side senior ML efficiency engineers directly on the optimization problems behind the numbers. You will share ownership of PACE's governance and operations with our tools team who is actively building solutions with AI. The work a traditional operations team would grind through by hand, things like resource requests, allocation tracking, escalations, and efficiency reporting, you will turn into systems that run themselves and watch themselves. When you have done it well, the busywork is gone and PACE moves faster than its size says it should. Your challenging work will result in high development velocity and efficient compute, accelerating not only Apple, but also your career as well.
- Govern compute as code. Build the systems of record for resource requests, allocations, and utilization, accurate and at scale, so leadership can trust the numbers. - Hunt down ML inefficiency. Dig into inference and training workloads across GPUs, TPUs, and custom Apple Silicon, find where compute is wasted, trace it to a cause, and drive the fix. - Work the real optimization problems: scheduling, capacity allocation, and serving cost, alongside the engineers who own those systems. - Get rid of the toil. Replace the time-sink workflows, triage, reporting, reconciliation, with systems that handle the routine and pull a person in only when judgment matters. Drive manual escalations toward zero instead of standing up a tiered on call org. - Make the data useful. Build the telemetry, schemas, and anomaly detection that surface efficiency and cost opportunities, then wire them into tooling that acts rather than just files a report. - Rebuild what breaks at scale. When a process buckles under Apple scale ML demand, re-architect it so it grows with usage instead of headcount. - Make a lasting impact. Turn what you build into reusable tooling so the rest of the team benefits without coming back to you each time.
- BS in Computer Science, Computer Engineering, or equivalent practical experience.
- 6 or more years building production software, automation, tooling, or data and infrastructure systems.
- A problem solver who builds first. You have designed things from scratch to wipe out manual work or get past a scale ceiling, and you can show us something you built.
- Fluent with AI tooling. Coding assistants as part of how you already work, not something you read about.
- Strong programming skills, Python or similar, or automation, pipelines, and tooling.
- SQL, plus dashboards or data products in something like Tableau, Looker, or Grafana.
- Experience designing data models or telemetry schemas for infrastructure, capacity, or utilization data.
- Experience running complex systems in a large scale compute, cloud, or infrastructure environment.
- Experience knowing where not to automate, and how to guardrail systems that act on their own.
- Strong cross-team collaborator who moves work forward through influence rather than authority, and is comfortable owning systems others rely on daily.
- Production experience shipping automated or autonomous workflows
- Understanding of ML training and inference infrastructure, GPU and TPU utilization, training throughput, scheduling efficiency, and foundation model serving
- Experience building automated alerting or anomaly detection for infrastructure metrics
- Experience with FinOps, capacity planning, cloud cost management, or IT governance
- Knowledge of Django/Postgres
- Love for open-ended "go figure it out and build it" projects