Senior Software Engineer / Reliability Engineering - Real-time Data
Bloomberg
Software Engineering
London, UK
Posted on Apr 27, 2026
Our department is responsible for efficiently distributing financial data from its source to interested users all around the world. This includes (for example) stock prices or foreign exchange rates. Data can either be served in response to a request or streamed in real time.
The group owns:
- The distribution software and infrastructure
- A range of different sources of data
- Supporting services to administer and manage the system, including permissioning and metering
The team is also responsible for the Enterprise endpoint (“B-PIPE”), which allows end-users to programmatically consume data via our SDK. Data is also available through the Bloomberg Terminal and Microsoft Excel.
The main challenge faced by the group is one of scale. Data is sourced from more than 370 global exchanges, with a combined volume in excess of 60 billion messages each day. We deliver this data to hundreds of thousands of terminals and thousands of B-PIPEs. Handling this volume requires significant infrastructure, we manage multiple clusters in our main data centres, as well as a network of many thousands of servers around the world.
Group Overview
The RD Reliability Engineering group comprises three sub-teams located in Tokyo, London, and New York, providing follow-the-sun support.
Our mission is to ensure systems are reliable, scalable, and observable through software engineering, while continuously improving how systems behave under load and failure conditions. We work in an outcome-driven model, focusing on measurable improvements in availability, latency, capacity, and recovery. Our goal is to ensure systems meet defined service level objectives while minimising manual operational effort through automation and software solutions.
The systems we support must behave predictably under extreme load, recover quickly from failures, and continue to evolve without compromising stability - these are the core challenges we solve.
London Team Focus – Availability & Resiliency
The London team plays a key role in ensuring the availability and resiliency of RD infrastructure globally.
We focus on:
- Detecting and preventing failures across large-scale distributed systems
- Ensuring infrastructure demonstrates sufficient capacity and failover capability during site-loss scenarios
- Reducing time to detect, diagnose, and recover from incidents
- Ensuring systems behave predictably under both normal and adverse conditions
This role provides the opportunity to influence how reliability is engineered across the platform, working closely with teams globally to improve system behaviour and design.
What You’ll Do
- Build and maintain production-grade software supporting Bloomberg’s global distribution infrastructure
- Design and implement scalable, fault-tolerant systems with a focus on observability, performance, and automation
- Analyse system behaviour under real-world and failure scenarios to validate capacity, failover, and recovery meet resilience objectives
- Identify bottlenecks, scaling limits, and reliability risks across distributed systems
- Improve detection, diagnosis, and prevention of production issues
- Build tools and frameworks to increase system visibility and reduce time to detect and resolve incidents
- Automate operational workflows to reduce manual effort and improve system reliability
- Partner with application and infrastructure teams to improve system design, resilience, and performance
- Contribute to design discussions, incident reviews, and reliability improvements across the platform
Systems You’ll Work With
- Configuration systems serving thousands of servers across the global network
- Service discovery and clustering systems for distributed infrastructure
- Monitoring and observability frameworks for large-scale server estates
- Tooling for diagnosing data quality and distribution issues
- Ownership of systems may evolve over time as the team focuses on areas of highest impact.
What Success Looks Like
- Systems consistently meet defined reliability, latency, and capacity objectives
- Issues are detected and mitigated before significant customer impact
- Systems are demonstrably resilient, with proven failover capability and sufficient capacity under failure conditions
- Operational processes are automated and scalable
- Reliability is achieved through engineering improvements rather than manual intervention
What We’re Looking For
We're not a traditional SRE team. We engineer reliability through software, building solutions that automate operations and improve system resilience by design.
- Experience with an object-oriented programming language (preferably Python or C++)
- Strong focus on building reliable, observable distributed systems
- Experience working with SLOs, SLIs, and production reliability metrics
- Proven ability to triage and resolve live production problems
- A mindset focused on automation and reducing operational toil
- A strength in collaborating within an inclusive team environment
- The ability to work across departments and build strong relationships with both technical and non-technical partners
Why Join Us
You’ll work on systems that sit at the core of Bloomberg’s real-time data platform, operating at global scale and under demanding performance and reliability requirements.
This is an opportunity to:
- Solve complex distributed systems problems with real-world impact
- Influence how reliability is engineered across a critical platform
- Work with teams across multiple regions and technical domains
- Build systems that are resilient by design and operate at massive scale