Senior Cloud SRE - AI/ML Platform & GPU Compute

Job Board

Companies

Icehouseventures

Senior Cloud SRE - AI/ML Platform & GPU Compute

Full-Time 60000 - 80000 £ / year (est.) No working from home possible

Apply Now

At a Glance

Tasks: Build and scale the reliability of our cutting-edge AI cloud platform.
Company: Join Wayve, a leader in Embodied AI technology with a diverse and inclusive culture.
Benefits: Enjoy a competitive salary, hybrid work model, and opportunities for professional growth.
Other info: Be part of a dynamic team shaping the future of AI and cloud infrastructure.
Why this job: Make a real impact on the future of automated driving with innovative AI solutions.
Qualifications: Experience in SRE or Cloud Reliability roles, strong Kubernetes skills, and a passion for automation.

The predicted salary is between 60000 - 80000 £ per year.

At Wayve we’re committed to creating a diverse, fair and respectful culture that is inclusive of everyone based on their unique skills and perspectives. Founded in 2017, Wayve is the leading developer of Embodied AI technology. Our advanced AI software and foundation models enable vehicles to perceive, understand, and navigate any complex environment, enhancing the usability and safety of automated driving systems. Our vision is to create autonomy that propels the world forward.

The role As a Cloud Site Reliability Engineer at Wayve, you will build and scale the reliability foundations of our AI cloud platform. This includes our Model Development Platform and our GPU Compute platform. This is a founding Cloud SRE role where you will define the frameworks, automation, and operational standards that ensure our model development infrastructure operates predictably, efficiently, and at scale.

Key responsibilities:

Reliability & Platform Ownership: Own the reliability, availability, and performance of the Model Dev Platform and GPU Compute environments. Define and operationalise SLOs, SLIs, and error budgets across platform services.
Incident Response & On-Call: Participate in a 24/7 on-call rotation as first-line response for cloud and cluster-related incidents. Lead incident triage, escalation, communications, and root cause analysis.
Observability & Operational Excellence: Design and operate monitoring, logging, tracing, and alerting systems that enable rapid detection and recovery.
Automation & Tooling: Build automation for cluster operations, training workflows, remediation, and scaling tasks.

About you In order to set you up for success as a Cloud Site Reliability Engineer at Wayve, we’re looking for the following skills and experience.

Essential skills:

Proven experience in an SRE, Production Engineer, or Cloud Reliability role supporting large-scale cloud systems.
Strong Kubernetes experience, including operating production clusters.
Hands-on experience running production workloads in AWS, GCP, or Azure.
Experience operating complex distributed systems in production.
Strong Linux fundamentals and proficiency in at least one scripting or systems language (e.g., Python, Go, C++).
Deep troubleshooting skills across networking, storage, distributed systems, and performance at scale.
Experience designing and operating observability stacks.
Clear communication skills, including leading incidents and writing post-mortems.

Desirable skills:

Experience operating GPU-backed environments or large-scale ML infrastructure.
Experience running model training or inference pipelines in production.
Familiarity with infrastructure-as-code.
Interest in helping shape and grow a Cloud SRE function.

This is a full-time role based in our office in London (2 days a week in the office). At Wayve we want the best of all worlds so we operate a hybrid working policy that combines time together in our offices and workshops to fuel innovation, culture, relationships and learning, and time spent working from home.

Senior Cloud SRE - AI/ML Platform & GPU Compute employer: Icehouseventures

Re-Leased is an exceptional employer that prioritises innovation and employee well-being, making it a fantastic place for a Commercial Manager in the EMEA region. With a strong focus on personal and professional growth, employees benefit from comprehensive health insurance, generous parental leave, and a flexible working environment that supports a balanced lifestyle. Our collaborative culture encourages high aspirations and celebrates achievements, ensuring that every team member feels valued and empowered to thrive in their role.

Contact Details:

Icehouseventures Recruitment Team

View Icehouseventures profile

We think you need these skills to ace Senior Cloud SRE - AI/ML Platform & GPU Compute

Site Reliability Engineering (SRE)

Kubernetes

AWS

GCP

Azure

Distributed Systems

AI/ML Workloads

Linux Fundamentals

Python

C++

Troubleshooting Skills

Observability Stacks (e.g., Datadog, Prometheus, Grafana, OpenTelemetry)

Communication Skills

Infrastructure-as-Code (e.g., Terraform)

Senior Cloud SRE - AI/ML Platform & GPU Compute

Icehouseventures

Apply Now

Senior Cloud SRE - AI/ML Platform & GPU Compute

At a Glance

Senior Cloud SRE - AI/ML Platform & GPU Compute employer: Icehouseventures

We think you need these skills to ace Senior Cloud SRE - AI/ML Platform & GPU Compute

Company

Product

Help