Neuron Collectives Software Engineer, Trainium Collectives

Amazon

Amazon

Software Engineering

Cupertino, CA, USA

Posted on Apr 8, 2026

Description

As a Neuron Collectives Software Developer, you will:

* Enhance collective algorithms and topologies for optimal training performance
* Use tools like Neuron Explorer to identify bottlenecks in compute and bus bandwidth utilization
* Monitor and analyze processor, DMA, firmware, and workload metrics
* Optimize collective operations to scale AI compute across the data center
* Work closely with the hardware team to co-optimize software and Trainium silicon
* Develop and optimize C/C++ implementations of collective communication patterns
* Investigate and implement improvements for specific training topologies used by modern LLMs
* Build and maintain analysis frameworks and automation solutions

The role offers opportunities to work on cutting-edge AI training hardware while contributing to one of Amazon's most critical initiatives.


A day in the life
Annapurna Labs, a crucial part of AWS, is responsible for developing hardware and software components for EC2 infrastructure. Our team focuses on building networking solutions that for Machine Learning (ML) and High-Performance Computing (HPC) workloads on AWS.

We have mixed discipline orgs, you’d be working side by side with infrastructure experts, hardware engineers, RTL engineers, scientists & architects. Our workforce spans the globe and is truly international, you’ll find yourself working side by side with individuals from numerous countries. We take mentorship seriously, you can both expect senior mentorship and will be expected to mentor new and junior engineers.

The pace is fast as we work on the latest advancements of AI/ML, but we take the time to bond as a team and enjoy the successes. We offer flexibility in working hours, and respect WLB as a core org tenet. The team enjoys working with numerous principal-level engineers and closely with directors, career growth opportunities are certainly available. This is a role where you will always be encouraged to keep learning, the AI/ML field is fast moving and constantly evolving.

About the team
Annapurna Labs, part of AWS, created Trainium as a purpose-built AI training chip to revolutionize machine learning at Amazon scale. The Neuron Collectives team owns the software stack that enables collective operations — the communication primitives that allow AI training to scale across thousands of chips in the data center. Our work is essential to training the frontier models that power AI today. We work closely with hardware teams to extract maximum performance from Trainium, ensuring that compute and interconnect bandwidth are fully utilized. Our team sits at the intersection of hardware, firmware, and distributed systems.