At a Glance
- Tasks: Build and scale reliability for our AI cloud platform and GPU Compute environments.
- Company: Join Wayve, a pioneering tech company at the forefront of AI and cloud infrastructure.
- Benefits: Competitive salary, flexible working hours, and opportunities for professional growth.
- Other info: Be part of a founding team, with potential leadership opportunities as you grow.
- Why this job: Shape the future of AI by ensuring robust and efficient cloud systems.
- Qualifications: Experience in SRE roles, strong Kubernetes skills, and proficiency in scripting languages.
The predicted salary is between 70000 - 90000 € per year.
In order to set you up for success as a Cloud Site Reliability Engineer at Wayve, we’re looking for the following skills and experience:
- Proven experience in an SRE, Production Engineer, or Cloud Reliability role supporting large-scale cloud systems
- Strong Kubernetes experience, including operating production clusters
- Hands-on experience running production workloads in AWS, GCP, or Azure
- Experience operating complex distributed systems in production, ideally including compute-heavy or high-performance workloads
- Experience working with large compute clusters; exposure to AI/ML training or inference workloads strongly preferred
- Strong Linux fundamentals and proficiency in at least one scripting or systems language (e.g. Python, Go, C++) with a bias toward automation
- Deep troubleshooting skills across networking, storage, distributed systems, and performance at scale
- Experience designing and operating observability stacks (e.g. Datadog, Prometheus, Grafana, OpenTelemetry)
- Clear communication skills, including leading incidents, writing postmortems, and influencing teams to prioritise reliability improvements
- (Desirable) Experience operating GPU-backed environments or large-scale ML infrastructure
- (Desirable) Experience running model training or inference pipelines in production (MLOps)
- (Desirable) Familiarity with infrastructure-as-code (e.g. Terraform) and secure cloud production environments
- (Desirable) Experience defining and running SLOs/SLIs and building reliability programs across multiple teams
- (Desirable) Experience as an early or founding SRE hire establishing processes from scratch
- (Desirable) Interest in helping shape and grow a Cloud SRE function, with potential to take on leadership responsibilities over time
As a Cloud Site Reliability Engineer at Wayve, you will build and scale the reliability foundations of our AI cloud platform. This includes our Model Development Platform (powering end-to-end model development from raw data to on-road experimentation) and our GPU Compute platform (large-scale, multi-tenant GPU fleets and scheduling systems driving model training and inference at scale).
This is a founding Cloud SRE role. You won’t inherit a mature SRE function, you’ll help create it. You will define the frameworks, automation, and operational standards that ensure our model development infrastructure, distributed systems, and large compute clusters operate predictably, efficiently, and at scale.
This role sits at the intersection of AI research, large-scale cloud infrastructure, and production operations. Your work will directly enable faster model training, reliable experimentation, and scalable AI deployment by ensuring our cloud infrastructure is resilient and performant.
Reliability & Platform Ownership
- Own the reliability, availability, and performance of the Model Dev Platform and GPU Compute environments
- Define and operationalise SLOs, SLIs, and error budgets across platform services
- Improve capacity planning, scaling strategies, and resource efficiency across large GPU-backed clusters
- Partner with ML, platform, and software teams to establish clear production readiness standards
Incident Response & On-Call
- Participate in a 24/7 on-call rotation as first-line response for cloud and cluster-related incidents
- Lead incident triage, escalation, communications, and root cause analysis
- Translate post-incident learning into durable architectural or automation improvements
- Continuously reduce alert noise and recurring operational burden
Observability & Operational Excellence
- Design and operate monitoring, logging, tracing, and alerting systems that enable rapid detection and recovery
- Build dashboards that reflect real user-centric platform health (not just infrastructure metrics)
- Improve deployment safety through better change management, validation, and rollback mechanisms
Automation & Tooling
- Build automation for cluster operations, training workflows, remediation, and scaling tasks
- Implement self-healing patterns and resilient recovery workflows
- Harden CI/CD and release processes to improve deployment safety and velocity
- Support infrastructure-as-code and policy-driven guardrails to ensure secure, reliable cloud environments
Senior Cloud Site Reliability Engineer (AI/ML Platform & GPU Compute) in London employer: Deepstreamtech
At Wayve, we pride ourselves on being an innovative employer that fosters a collaborative and dynamic work culture, particularly for our Senior Cloud Site Reliability Engineers. Located in a vibrant tech hub, we offer competitive benefits, opportunities for professional growth, and the chance to shape the future of AI infrastructure while working alongside industry leaders. Join us to not only advance your career but also contribute to groundbreaking projects that redefine the capabilities of cloud technology.
StudySmarter Expert Advice🤫
We think this is how you could land Senior Cloud Site Reliability Engineer (AI/ML Platform & GPU Compute) in London
✨Tip Number 1
Network, network, network! Get out there and connect with folks in the industry. Attend meetups, webinars, or even local tech events. You never know who might have a lead on that perfect Cloud SRE role!
✨Tip Number 2
Show off your skills! Create a portfolio or GitHub repository showcasing your projects, especially those related to Kubernetes, AWS, or AI/ML. This gives potential employers a taste of what you can do and sets you apart from the crowd.
✨Tip Number 3
Prepare for interviews by brushing up on your troubleshooting skills. Be ready to discuss real-world scenarios where you've tackled complex distributed systems issues. Practice explaining your thought process clearly; communication is key in this role!
✨Tip Number 4
Don’t forget to apply through our website! We’re always on the lookout for passionate individuals who want to help shape our Cloud SRE function. Your next big opportunity could be just a click away!
We think you need these skills to ace Senior Cloud Site Reliability Engineer (AI/ML Platform & GPU Compute) in London
Some tips for your application 🫡
Tailor Your CV:Make sure your CV highlights your experience in SRE or Production Engineering, especially with large-scale cloud systems. We want to see your Kubernetes skills and any hands-on work you've done with AWS, GCP, or Azure!
Showcase Your Projects:Include specific examples of projects where you operated complex distributed systems or worked with AI/ML workloads. This is your chance to shine, so let us know how you’ve tackled challenges in production environments.
Communicate Clearly:Since clear communication is key for this role, make sure your application reflects that. Whether it’s leading incidents or writing postmortems, we want to see how you’ve influenced teams to prioritise reliability improvements.
Apply Through Our Website:Don’t forget to apply through our website! It’s the best way for us to receive your application and get you into our system. We can’t wait to see what you bring to the table!
How to prepare for a job interview at Deepstreamtech
✨Know Your Cloud Platforms
Make sure you brush up on your knowledge of AWS, GCP, and Azure. Be ready to discuss your hands-on experience with these platforms, especially in relation to running production workloads and managing large compute clusters.
✨Show Off Your Kubernetes Skills
Kubernetes is a big deal for this role, so be prepared to talk about your experience operating production clusters. Think of specific examples where you've tackled challenges or optimised performance in a Kubernetes environment.
✨Demonstrate Your Troubleshooting Prowess
Prepare to showcase your deep troubleshooting skills. Have examples ready that highlight how you've resolved issues across networking, storage, and distributed systems, particularly in high-performance workloads.
✨Communicate Clearly and Confidently
Since clear communication is key, practice articulating your thoughts on incident response and postmortems. Be ready to explain how you've influenced teams to prioritise reliability improvements in past roles.