Senior Cloud Site Reliability Engineer (AI/ML Platform & GPU Compute) in London

Job Board

Companies

Deepstreamtech

Senior Cloud Site Reliability Engineer (AI/ML Platform & GPU Compute)

Senior Cloud Site Reliability Engineer (AI/ML Platform & GPU Compute) in London

London Full-Time 70000 - 90000 € / year (est.) Home office (partial)

Apply Now

At a Glance

Tasks: Build and scale reliability for our AI cloud platform and GPU Compute environments.
Company: Join Wayve, a pioneering tech company at the forefront of AI and cloud infrastructure.
Benefits: Competitive salary, flexible working hours, and opportunities for professional growth.
Other info: Be part of a founding team, with potential leadership opportunities as you grow.
Why this job: Shape the future of AI by ensuring robust and efficient cloud systems.
Qualifications: Experience in SRE roles, strong Kubernetes skills, and proficiency in scripting languages.

The predicted salary is between 70000 - 90000 € per year.

In order to set you up for success as a Cloud Site Reliability Engineer at Wayve, we’re looking for the following skills and experience:

Proven experience in an SRE, Production Engineer, or Cloud Reliability role supporting large-scale cloud systems
Strong Kubernetes experience, including operating production clusters
Hands-on experience running production workloads in AWS, GCP, or Azure
Experience operating complex distributed systems in production, ideally including compute-heavy or high-performance workloads
Experience working with large compute clusters; exposure to AI/ML training or inference workloads strongly preferred
Strong Linux fundamentals and proficiency in at least one scripting or systems language (e.g. Python, Go, C++) with a bias toward automation
Deep troubleshooting skills across networking, storage, distributed systems, and performance at scale
Experience designing and operating observability stacks (e.g. Datadog, Prometheus, Grafana, OpenTelemetry)
Clear communication skills, including leading incidents, writing postmortems, and influencing teams to prioritise reliability improvements
(Desirable) Experience operating GPU-backed environments or large-scale ML infrastructure
(Desirable) Experience running model training or inference pipelines in production (MLOps)
(Desirable) Familiarity with infrastructure-as-code (e.g. Terraform) and secure cloud production environments
(Desirable) Experience defining and running SLOs/SLIs and building reliability programs across multiple teams
(Desirable) Experience as an early or founding SRE hire establishing processes from scratch
(Desirable) Interest in helping shape and grow a Cloud SRE function, with potential to take on leadership responsibilities over time

As a Cloud Site Reliability Engineer at Wayve, you will build and scale the reliability foundations of our AI cloud platform. This includes our Model Development Platform (powering end-to-end model development from raw data to on-road experimentation) and our GPU Compute platform (large-scale, multi-tenant GPU fleets and scheduling systems driving model training and inference at scale).

This is a founding Cloud SRE role. You won’t inherit a mature SRE function, you’ll help create it. You will define the frameworks, automation, and operational standards that ensure our model development infrastructure, distributed systems, and large compute clusters operate predictably, efficiently, and at scale.

This role sits at the intersection of AI research, large-scale cloud infrastructure, and production operations. Your work will directly enable faster model training, reliable experimentation, and scalable AI deployment by ensuring our cloud infrastructure is resilient and performant.

Reliability & Platform Ownership

Own the reliability, availability, and performance of the Model Dev Platform and GPU Compute environments
Define and operationalise SLOs, SLIs, and error budgets across platform services
Improve capacity planning, scaling strategies, and resource efficiency across large GPU-backed clusters
Partner with ML, platform, and software teams to establish clear production readiness standards

Incident Response & On-Call

Participate in a 24/7 on-call rotation as first-line response for cloud and cluster-related incidents
Lead incident triage, escalation, communications, and root cause analysis
Translate post-incident learning into durable architectural or automation improvements
Continuously reduce alert noise and recurring operational burden

Observability & Operational Excellence

Design and operate monitoring, logging, tracing, and alerting systems that enable rapid detection and recovery
Build dashboards that reflect real user-centric platform health (not just infrastructure metrics)
Improve deployment safety through better change management, validation, and rollback mechanisms

Automation & Tooling

Build automation for cluster operations, training workflows, remediation, and scaling tasks
Implement self-healing patterns and resilient recovery workflows
Harden CI/CD and release processes to improve deployment safety and velocity
Support infrastructure-as-code and policy-driven guardrails to ensure secure, reliable cloud environments

Senior Cloud Site Reliability Engineer (AI/ML Platform & GPU Compute) in London employer: Deepstreamtech

At Wayve, we pride ourselves on being an innovative employer that fosters a collaborative and dynamic work culture, particularly for our Senior Cloud Site Reliability Engineers. Located in a vibrant tech hub, we offer competitive benefits, opportunities for professional growth, and the chance to shape the future of AI infrastructure while working alongside industry leaders. Join us to not only advance your career but also contribute to groundbreaking projects that redefine the capabilities of cloud technology.

Contact Detail:

Deepstreamtech Recruiting Team

View Deepstreamtech Profile

StudySmarter Expert Advice🤫

We think this is how you could land Senior Cloud Site Reliability Engineer (AI/ML Platform & GPU Compute) in London

✨Tip Number 1

Network, network, network! Get out there and connect with folks in the industry. Attend meetups, webinars, or even local tech events. You never know who might have a lead on that perfect Cloud SRE role!

✨Tip Number 2

Show off your skills! Create a portfolio or GitHub repository showcasing your projects, especially those related to Kubernetes, AWS, or AI/ML. This gives potential employers a taste of what you can do and sets you apart from the crowd.

✨Tip Number 3

Prepare for interviews by brushing up on your troubleshooting skills. Be ready to discuss real-world scenarios where you've tackled complex distributed systems issues. Practice explaining your thought process clearly; communication is key in this role!

✨Tip Number 4

Don’t forget to apply through our website! We’re always on the lookout for passionate individuals who want to help shape our Cloud SRE function. Your next big opportunity could be just a click away!

We think you need these skills to ace Senior Cloud Site Reliability Engineer (AI/ML Platform & GPU Compute) in London

Site Reliability Engineering (SRE)

Kubernetes

AWS

GCP

Azure

Distributed Systems

AI/ML Workloads

Linux Fundamentals

Python

C++

Troubleshooting Skills

Observability Stacks (Datadog, Prometheus, Grafana, OpenTelemetry)

Communication Skills

Infrastructure-as-Code (Terraform)

MLOps

Some tips for your application 🫡

Tailor Your CV:Make sure your CV highlights your experience in SRE or Production Engineering, especially with large-scale cloud systems. We want to see your Kubernetes skills and any hands-on work you've done with AWS, GCP, or Azure!

Showcase Your Projects:Include specific examples of projects where you operated complex distributed systems or worked with AI/ML workloads. This is your chance to shine, so let us know how you’ve tackled challenges in production environments.

Communicate Clearly:Since clear communication is key for this role, make sure your application reflects that. Whether it’s leading incidents or writing postmortems, we want to see how you’ve influenced teams to prioritise reliability improvements.

Apply Through Our Website:Don’t forget to apply through our website! It’s the best way for us to receive your application and get you into our system. We can’t wait to see what you bring to the table!

How to prepare for a job interview at Deepstreamtech

✨Know Your Cloud Platforms

Make sure you brush up on your knowledge of AWS, GCP, and Azure. Be ready to discuss your hands-on experience with these platforms, especially in relation to running production workloads and managing large compute clusters.

✨Show Off Your Kubernetes Skills

Kubernetes is a big deal for this role, so be prepared to talk about your experience operating production clusters. Think of specific examples where you've tackled challenges or optimised performance in a Kubernetes environment.

✨Demonstrate Your Troubleshooting Prowess

Prepare to showcase your deep troubleshooting skills. Have examples ready that highlight how you've resolved issues across networking, storage, and distributed systems, particularly in high-performance workloads.

✨Communicate Clearly and Confidently

Since clear communication is key, practice articulating your thoughts on incident response and postmortems. Be ready to explain how you've influenced teams to prioritise reliability improvements in past roles.

Senior Cloud Site Reliability Engineer (AI/ML Platform & GPU Compute) in London

Deepstreamtech

Location: London

Apply Now

Senior Cloud Site Reliability Engineer (AI/ML Platform & GPU Compute) in London

At a Glance

Senior Cloud Site Reliability Engineer (AI/ML Platform & GPU Compute) in London employer: Deepstreamtech

StudySmarter Expert Advice🤫

We think you need these skills to ace Senior Cloud Site Reliability Engineer (AI/ML Platform & GPU Compute) in London

Some tips for your application 🫡

How to prepare for a job interview at Deepstreamtech

Company

Product

Help