Senior Software Engineer/SRE - BQL Reliability Engineering
Bloomberg
Software Engineering
London, UK
Posted on Mar 19, 2026
Bloomberg runs on data. It’s our business and our product.
BQL is the single API for all client-facing structured data at Bloomberg. As Bloomberg Terminal is reimagined for the age of AI, products like Bloomberg Assistant primarily use BQL to power AI’s data retrieval capabilities. In a similar way, BQL is used by products like BQuant for financial markets analysis, research, and modeling. In terms of scale, BQL handles ~100 million requests hourly from ~100K active firms.
The BQL Platform Observability team owns the reliability of the BQL platform, using observability as its primary lever. We ensure that the BQL ecosystem—spanning workload management layers, the query engine, and hundreds of data providers—operates predictably and at scale across diverse workloads, from low-latency queries powering Terminal screens to long-running quantitative analytics. Our mission is to make platform behavior transparent and measurable by building robust telemetry, instrumentation, and diagnostic capabilities that provide real-time and historical insight into system performance and usage.
A core focus of the team is reducing Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR) by improving signal quality, strengthening alerting, and enabling rapid root cause analysis. We also embed reliability as a first-class concern throughout the Software Development Lifecycle—driving instrumentation standards, defining SLIs and SLOs, influencing design reviews, and ensuring production learnings continuously improve engineering practices. By institutionalizing observability and reliability across the platform, we help teams build and operate BQL with confidence at scale.
What You’ll Do:
As part of the BQL Platform Observability team, you will focus on improving the reliability, resilience, and transparency of BQL and the services it depends on. You’ll be trusted to:
- Build and evolve solutions to monitor infrastructure and process health, identifying trends and anomalies at scale
- Work with stakeholders to formulate Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for BQL services and their dependencies, monitor error budget and provide insights for reliability improvements
- Reduce SEV MTTR by building tools that can quickly provide insights on what is not working to incident owners.
- Drive resiliency across the org by contributing to system design and increase adoption of experimentation practices such as load testing, A/B testing, canary releases
- Partner closely with engineering teams to help build observable systems from day 1, and fill the gaps on what cannot currently be seen or measured
What You’ll Need:
- Proven experience in a relevant role (Reliability Engineering or Software Development)
- Strong knowledge of UNIX or Linux systems running distributed application platforms
- Hands-on experience with at least one programming language (e.g., Python, Java, JavaScript) beyond basic scripting
- Demonstrated experience managing the performance, availability, and scalability of mid- to large-sized systems
- Hands-on experience with production deployment and release management
- Energy, self-motivation, and the ability to manage multiple tasks in a global, collaborative environment
- BA, BS, MS, or PhD in Computer Science, Engineering, or a related technical field
We’d Love to See Experience With:
- Operating in regulated or highly controlled environments
- Applying statistical methods to solve real-world business problems
- Querying and analyzing large-scale datasets within enterprise data warehouse environments
- Querying and manipulating time series data
- Designing, executing, and analyzing A/B tests and other controlled experiments