Member of Technical Staff - Research Engineer, Frontier AI Robotics
Amazon
Software Engineering, IT, Data Science
San Francisco, CA, USA
Description
At Frontier AI & Robotics, we're not just advancing robotics – we're reimagining it from the ground up. Our team is building the future of intelligent robotics through frontier foundation models and end-to-end learned systems. We tackle some of the most challenging problems in AI and robotics, from developing sophisticated perception systems to creating adaptive manipulation strategies that work in complex, real-world scenarios.
What sets us apart is our unique combination of ambitious research vision and practical impact. We leverage Amazon's computational infrastructure and rich real-world datasets to train and deploy state-of-the-art foundation models. Our work spans the full spectrum of robotics intelligence – from multimodal perception using images, videos, and sensor data, to sophisticated manipulation strategies that can handle diverse real-world scenarios. We're building systems that don't just work in the lab, but scale to meet the demands of Amazon's global operations.
Join us if you're excited about pushing the boundaries of what's possible in robotics, working with world-class researchers, and seeing your innovations deployed at unprecedented scale.
As a Senior Research Engineer embedded in our science team, you'll be instrumental in transforming innovative research into high-performance production systems. You'll collaborate directly with scientists to build and optimize large-scale transformer models for robotics applications.
In this role, you'll balance deep technical optimization work with strategic input on model architecture decisions, ensuring our innovative robotics models are designed with performance in mind from the ground up. You'll leverage PyTorch and NVIDIA's acceleration stack and other compilation techniques to tackle ambitious performance targets, working at the intersection of large language models and real-world robotics applications.
Key job responsibilities
- Design, implement, and optimize distributed training systems that scale across thousands of GPUs and nodes for large-scale training workloads.
- Develop high-performance optimizations to maximize throughput and efficiency.
- Develop reusable frameworks and libraries to improve training reproducibility, reliability, and scalability for new model architectures.
- Establish standards for reliability, maintainability, and security, ensuring systems are robust under rapid iteration.
- Collaborate with researchers to influence model architectures for optimal hardware utilization
- Develop comprehensive benchmarking frameworks to measure and optimize model performance
A day in the life
In this role, you will:
- Optimize transformer blocks using custom CUDA kernels and TensorRT optimization techniques
- Partner with scientists to analyze model architectures and propose efficiency improvements
- Implement and benchmark various optimization strategies for large-scale models
- Debug performance bottlenecks using NVIDIA profiling tools
- Participate in technical discussions about new model architectures with the science team
- Manage pre/post training runs and continue improve system stability and throughput
- Prototype new acceleration approaches using emerging compilation frameworks