Staff Cloud Site Reliability Engineer

Staff Cloud Site Reliability Engineer

Full-Time 60000 - 80000 € / year (est.) No home office possible
I

At a Glance

  • Tasks: Build and scale the reliability foundations of our AI cloud platform.
  • Company: Wayve, a leader in Embodied AI technology with a diverse and inclusive culture.
  • Benefits: Hybrid working policy, competitive salary, and opportunities for professional growth.
  • Other info: Dynamic environment with potential for leadership roles and impactful contributions.
  • Why this job: Join us to shape the future of automated driving with cutting-edge AI technology.
  • Qualifications: Experience in SRE or Cloud Reliability roles, strong Kubernetes skills, and a passion for automation.

The predicted salary is between 60000 - 80000 € per year.

At Wayve we’re committed to creating a diverse, fair and respectful culture that is inclusive of everyone based on their unique skills and perspectives. Founded in 2017, Wayve is the leading developer of Embodied AI technology. Our advanced AI software and foundation models enable vehicles to perceive, understand, and navigate any complex environment, enhancing the usability and safety of automated driving systems. Our vision is to create autonomy that propels the world forward.

The role As a Cloud Site Reliability Engineer at Wayve, you will build and scale the reliability foundations of our AI cloud platform. This includes our Model Development Platform and our GPU Compute platform. This is a founding Cloud SRE role where you will define the frameworks, automation, and operational standards that ensure our model development infrastructure operates predictably, efficiently, and at scale.

Key responsibilities:

  • Reliability & Platform Ownership: Own the reliability, availability, and performance of the Model Dev Platform and GPU Compute environments. Define and operationalise SLOs, SLIs, and error budgets across platform services.
  • Incident Response & On-Call: Participate in a 24/7 on-call rotation as first-line response for cloud and cluster-related incidents. Lead incident triage, escalation, communications, and root cause analysis.
  • Observability & Operational Excellence: Design and operate monitoring, logging, tracing, and alerting systems that enable rapid detection and recovery.
  • Automation & Tooling: Build automation for cluster operations, training workflows, remediation, and scaling tasks.

About you In order to set you up for success as a Cloud Site Reliability Engineer at Wayve, we’re looking for the following skills and experience:

Essential skills:

  • Proven experience in an SRE, Production Engineer, or Cloud Reliability role supporting large-scale cloud systems.
  • Strong Kubernetes experience, including operating production clusters.
  • Hands-on experience running production workloads in AWS, GCP, or Azure.
  • Experience operating complex distributed systems in production.
  • Strong Linux fundamentals and proficiency in at least one scripting or systems language (e.g., Python, Go, C++).
  • Deep troubleshooting skills across networking, storage, distributed systems, and performance at scale.
  • Clear communication skills, including leading incidents and writing post-mortems.

Desirable skills:

  • Experience operating GPU-backed environments or large-scale ML infrastructure.
  • Experience running model training or inference pipelines in production.
  • Familiarity with infrastructure-as-code (e.g., Terraform).

This is a full-time role based in our office in London (2 days a week in the office). At Wayve we want the best of all worlds so we operate a hybrid working policy that combines time together in our offices and workshops to fuel innovation, culture, relationships and learning, and time spent working from home.

Staff Cloud Site Reliability Engineer employer: Icehouseventures

Wayve is an exceptional employer that champions a diverse and inclusive culture, fostering an environment where every employee's unique skills and perspectives are valued. With a commitment to innovation in AI technology, employees have the opportunity to work on groundbreaking projects while enjoying a hybrid working model that balances collaboration in our vibrant London office with the flexibility of remote work. At Wayve, your contributions directly impact the future of automated driving, and we prioritise continuous learning and professional growth, making it a rewarding place to advance your career.

I

Contact Detail:

Icehouseventures Recruiting Team

StudySmarter Expert Advice🤫

We think this is how you could land Staff Cloud Site Reliability Engineer

Tip Number 1

Network like a pro! Reach out to folks in the industry, attend meetups, and connect with Wayve employees on LinkedIn. A personal touch can make all the difference when it comes to landing that interview.

Tip Number 2

Show off your skills! Create a portfolio or GitHub repository showcasing your projects, especially those related to cloud systems or AI. This gives you a chance to demonstrate your expertise beyond just a CV.

Tip Number 3

Prepare for the interview by diving deep into Wayve’s tech stack and recent projects. Familiarise yourself with their AI cloud platform and think about how your experience aligns with their goals. It’ll show you’re genuinely interested!

Tip Number 4

Don’t forget to apply through our website! It’s the best way to ensure your application gets seen. Plus, it shows you’re keen on being part of the Wayve team right from the start.

We think you need these skills to ace Staff Cloud Site Reliability Engineer

Site Reliability Engineering (SRE)
Kubernetes
AWS
GCP
Azure
Distributed Systems
Linux Fundamentals

Some tips for your application 🫡

Tailor Your Application:Make sure to customise your CV and cover letter for the Cloud Site Reliability Engineer role. Highlight your relevant experience with cloud systems, Kubernetes, and any AI/ML projects you've worked on. We want to see how your unique skills align with our mission!

Showcase Your Problem-Solving Skills:In your application, share examples of how you've tackled complex challenges in previous roles. We love candidates who can demonstrate their troubleshooting abilities and innovative thinking, especially in high-pressure situations.

Be Clear and Concise:When writing your application, keep it straightforward and to the point. Use clear language to describe your experiences and achievements. We appreciate a well-structured application that makes it easy for us to see your potential!

Apply Through Our Website:We encourage you to submit your application directly through our website. This way, you’ll ensure it reaches the right people and you’ll get a feel for our culture and values. Plus, it’s super easy to do!

How to prepare for a job interview at Icehouseventures

Know Your Stuff

Make sure you brush up on your SRE fundamentals, especially around Kubernetes and cloud platforms like AWS, GCP, or Azure. Be ready to discuss your hands-on experience with production workloads and how you've tackled complex distributed systems.

Showcase Your Problem-Solving Skills

Prepare to share specific examples of incidents you've managed, including how you led triage and root cause analysis. Highlight any improvements you've made post-incident to show your proactive approach to reliability.

Demonstrate Your Automation Mindset

Talk about your experience with automation in cluster operations and CI/CD processes. Be ready to discuss how you've implemented self-healing patterns or improved deployment safety, as this is crucial for the role.

Communicate Clearly

Since clear communication is key, practice explaining technical concepts in a way that's easy to understand. Be prepared to discuss how you've influenced teams to prioritise reliability improvements and how you handle incident communications.