Senior DevOps Engineer (AI + Azure)
EY
About Us
At EY wavespace Madrid - AI & Data Hub, we are a diverse, multicultural team at the forefront of technological innovation, working with cutting-edge technologies like Gen AI, data analytics, robotics, etc. Our center is dedicated to exploring the future of AI and Data.
Overview:
We’re looking for a Senior DevOps Engineer to build and run cloud and AI infrastructure at scale. You’ll own IaC with Terraform, CI/CD, Kubernetes, and Linux. You’ll also help run LLM workloads both in Azure and locally (Ollama/vLLM/llama.cpp). Your work will enable fast, secure, repeatable delivery.
Key responsibilities
- Build and maintain Azure infrastructure with Terraform (modules, workspaces, pipelines, policies).
- Design and operate CI/CD with GitHub Actions and/or Azure DevOps (multi-stage, approvals, environments).
- Run containers and Kubernetes/AKS (Helm, ingress, autoscaling, node pools, storage).
- Manage AI/LLM runtime: local model runners (Ollama, vLLM, llama.cpp), GPU/CPU configs.
- Support RAG: embeddings pipelines, vector DBs (Azure AI Search/Cognitive Search, pgvector, Milvus), data sync, retention.
- Automate platform tasks with Python (tooling, CLI utilities, API glue, ops scripts).
- Implement observability (Azure Monitor, Prometheus/Grafana, logs/traces/metrics, alerts, runbooks, SLOs).
- Apply Zero Trust security; Enforce least privilege and role-based access control (RBAC), Identity-based segmentation (Azure AD, Conditional Access, MFA).
- Implement policy-as-code (OPA, Azure Policy) for compliance.
- Rotate secrets and certificates via Key Vault; integrate with pipelines.
- Add continuous security scanning (SAST/DAST, container image scanning).
- Handle reliability: rollout strategies, health probes, incident response, postmortems.
- Optimize costs: right-sizing, autoscaling, budgets, tags, reporting.
Key requirements:
- 4+ years in DevOps/SRE/Platform Engineering.
- Strong Linux (shell, systemd, networking, performance troubleshooting).
- Terraform at scale (modules, state backends, CI/CD integration).
- Deep Azure experience (AKS, VNets, Key Vault, Storage, Monitor, Identity, Networking).
- CI/CD expertise (GitHub Actions and/or Azure DevOps).
- Containers and Kubernetes in production.
- Python or scripting for automation (solid scripting and tooling; not full-time app dev).
- Hands-on with LLM setups (local runners or Azure OpenAI), embeddings, vector indexes, and RAG basics.
Nice to have
- Multi-cloud exposure (AWS / GCP).
- Azure AI services (Azure OpenAI, Cognitive Search).
- GitOps (Argo CD/Flux), Helm packaging, OCI registries.
- Eventing/queues (Event Grid, Service Bus, Kafka).
- Security/compliance in cloud (CIS, NIST, Microsoft CAF).
- Certifications: AZ‑104, AZ‑204, AZ‑400, AI‑900, HashiCorp Terraform Associate, CKA/CKAD.
- Experience with GPU nodes, drivers, CUDA/ROCm, or CPU-only optimizations for LLMs.
How we work
- Everything as code. PRs, reviews, and tests.
- Small batches. Trunk-based or short-lived branches.
- Clear runbooks and on-call rotation where needed.
- Measure, alert, fix, and improve.
Our commitment to diversity & inclusion
We are genuinely passionate about inclusion and we support individuals of all groups; we do not discriminate on the basis of race, religion, gender, sexual orientation, or disability status.