Senior Software Engineer - AI Inference
Bloomberg
Software Engineering, Data Science
New York, NY, USA
Posted on Apr 30, 2026
Our team:
Join the team that is building the core infrastructure for AI at Bloomberg. The Bloomberg AI Inference Platform provides production-grade managed infrastructure for hosting, deploying, and serving all machine learning models, both predictive and cutting-edge generative models. We abstract away infrastructure complexity, empowering engineering teams to focus on creating intelligent applications with guaranteed scalability, performance, and governance. Our platform is built on the open-source KServe project, and the CNCS AI Inference team is a primary contributor to its development.
We'll trust you to:
- Design and build scalable infrastructure for both online and offline inference workloads.
- Lead integration of high-performance inference runtimes and serving frameworks, including TensorRT, vLLM, ONNX, and Triton.
- Drive architecture and technical decisions across Bloomberg’s inference platform, balancing latency, throughput, reliability, and cost.
- Partner across engineering teams to improve model deployment, observability, and production performance.
- Mentor junior engineers on system design, debugging, and performance optimization.
You'll need to have:
- 5+ years of professional software engineering experience.
- Experience designing, building, and operating production distributed systems.
- Strong systems intuition and a track record of debugging and optimizing performance-critical services.
- Ability to own problems end-to-end and quickly ramp up in unfamiliar technical areas.
- 4+ years of demonstrated experience working with an object-oriented programming language.
- A degree in Computer Science, Electrical Engineering, or equivalent practical experience.
We'd love to see:
- Experience deploying and operating machine learning systems at scale.
- Experience with inference optimization techniques such as batching, caching, request scheduling, or memory-aware serving.
- Familiarity with PyTorch and GPU software stacks such as CUDA and NCCL.
- Exposure to high-performance interconnects and distributed computing technologies such as NVLink, InfiniBand, or MPI.
- Experience with Kubernetes and cloud-native infrastructure.
- Experience with load balancing, request routing, or traffic management systems.
Representative projects:
- Autoscaling a heterogeneous compute fleet to match supply and demand aross diverse inference workloads.
- Building production-grade deployment pipelines to safely roll out new models to millions of users.
- Developing new inference capabilities such as structured sampling, prompt caching, and advanced serving optimizations.
- Analyzing observability data from real production workloads to improve latency, throughput, and resource efficiency.