Senior Cloud Site Reliability Engineer (AI/ML Platform & GPU Compute)

Senior Cloud Site Reliability Engineer (AI/ML Platform & GPU Compute)

Full-Time 70000 - 90000 € / year (est.) No home office possible
Deepstreamtech

At a Glance

  • Tasks: Build and scale reliability for our AI cloud platform and GPU compute environments.
  • Company: Wayve, a pioneering tech company at the forefront of AI and cloud infrastructure.
  • Benefits: Competitive salary, flexible working hours, and opportunities for professional growth.
  • Other info: Join a dynamic team and take on leadership responsibilities as you grow.
  • Why this job: Shape the future of AI reliability while working with cutting-edge technology.
  • Qualifications: Experience in SRE roles, strong Kubernetes skills, and proficiency in scripting languages.

The predicted salary is between 70000 - 90000 € per year.

In order to set you up for success as a Cloud Site Reliability Engineer at Wayve, we’re looking for the following skills and experience:

  • Proven experience in an SRE, Production Engineer, or Cloud Reliability role supporting large-scale cloud systems
  • Strong Kubernetes experience, including operating production clusters
  • Hands-on experience running production workloads in AWS, GCP, or Azure
  • Experience operating complex distributed systems in production, ideally including compute-heavy or high-performance workloads
  • Experience working with large compute clusters; exposure to AI/ML training or inference workloads strongly preferred
  • Strong Linux fundamentals and proficiency in at least one scripting or systems language (e.g. Python, Go, C++) with a bias toward automation
  • Deep troubleshooting skills across networking, storage, distributed systems, and performance at scale
  • Experience designing and operating observability stacks (e.g. Datadog, Prometheus, Grafana, OpenTelemetry)
  • Clear communication skills, including leading incidents, writing postmortems, and influencing teams to prioritise reliability improvements

Desirable:

  • Experience operating GPU-backed environments or large-scale ML infrastructure
  • Experience running model training or inference pipelines in production (MLOps)
  • Familiarity with infrastructure-as-code (e.g. Terraform) and secure cloud production environments
  • Experience defining and running SLOs/SLIs and building reliability programs across multiple teams
  • Experience as an early or founding SRE hire establishing processes from scratch
  • Interest in helping shape and grow a Cloud SRE function, with potential to take on leadership responsibilities over time

What the job involves:

As a Cloud Site Reliability Engineer at Wayve, you will build and scale the reliability foundations of our AI cloud platform. This includes our Model Development Platform (powering end-to-end model development from raw data to on-road experimentation) and our GPU Compute platform (large-scale, multi-tenant GPU fleets and scheduling systems driving model training and inference at scale).

This is a founding Cloud SRE role. You won’t inherit a mature SRE function, you’ll help create it. You will define the frameworks, automation, and operational standards that ensure our model development infrastructure, distributed systems, and large compute clusters operate predictably, efficiently, and at scale.

This role sits at the intersection of AI research, large-scale cloud infrastructure, and production operations. Your work will directly enable faster model training, reliable experimentation, and scalable AI deployment by ensuring our cloud infrastructure is resilient and performant.

Reliability & Platform Ownership:

  • Own the reliability, availability, and performance of the Model Dev Platform and GPU Compute environments
  • Define and operationalise SLOs, SLIs, and error budgets across platform services
  • Improve capacity planning, scaling strategies, and resource efficiency across large GPU-backed clusters
  • Partner with ML, platform, and software teams to establish clear production readiness standards

Incident Response & On-Call:

  • Participate in a 24/7 on-call rotation as first-line response for cloud and cluster-related incidents
  • Lead incident triage, escalation, communications, and root cause analysis
  • Translate post-incident learning into durable architectural or automation improvements
  • Continuously reduce alert noise and recurring operational burden

Observability & Operational Excellence:

  • Design and operate monitoring, logging, tracing, and alerting systems that enable rapid detection and recovery
  • Build dashboards that reflect real user-centric platform health (not just infrastructure metrics)
  • Improve deployment safety through better change management, validation, and rollback mechanisms

Automation & Tooling:

  • Build automation for cluster operations, training workflows, remediation, and scaling tasks
  • Implement self-healing patterns and resilient recovery workflows
  • Harden CI/CD and release processes to improve deployment safety and velocity
  • Support infrastructure-as-code and policy-driven guardrails to ensure secure, reliable cloud environments

Senior Cloud Site Reliability Engineer (AI/ML Platform & GPU Compute) employer: Deepstreamtech

At Wayve, we pride ourselves on being an innovative employer that fosters a collaborative and dynamic work culture, particularly for our Senior Cloud Site Reliability Engineers. Located in a vibrant tech hub, we offer competitive benefits, opportunities for professional growth, and the chance to shape the future of AI cloud infrastructure. Join us to be part of a pioneering team where your contributions directly impact cutting-edge technology and drive meaningful advancements in AI and machine learning.

Deepstreamtech

Contact Detail:

Deepstreamtech Recruiting Team

StudySmarter Expert Advice🤫

We think this is how you could land Senior Cloud Site Reliability Engineer (AI/ML Platform & GPU Compute)

Tip Number 1

Network, network, network! Get out there and connect with folks in the industry. Attend meetups, webinars, or even just grab a coffee with someone who’s already in the role you want. You never know who might have the inside scoop on job openings!

Tip Number 2

Show off your skills! Create a portfolio or GitHub repository showcasing your projects, especially those related to cloud systems, Kubernetes, or AI/ML. This gives potential employers a tangible look at what you can do and sets you apart from the crowd.

Tip Number 3

Prepare for interviews by brushing up on your troubleshooting skills. Be ready to discuss real-world scenarios where you’ve solved complex problems in production environments. Practice articulating your thought process clearly; communication is key in these roles!

Tip Number 4

Don’t forget to apply through our website! We’re always on the lookout for passionate individuals who want to help shape our Cloud SRE function. Your application will get the attention it deserves when you come directly to us!

We think you need these skills to ace Senior Cloud Site Reliability Engineer (AI/ML Platform & GPU Compute)

Cloud Reliability Engineering
Kubernetes
AWS
GCP
Azure
Distributed Systems
AI/ML Workloads

Some tips for your application 🫡

Tailor Your CV:Make sure your CV highlights your experience in SRE or Production Engineering roles. We want to see how your skills align with our needs, especially in Kubernetes and cloud systems.

Showcase Your Projects:Include specific examples of projects where you've operated large-scale cloud systems or worked with AI/ML workloads. This helps us understand your hands-on experience and problem-solving skills.

Communicate Clearly:When writing your cover letter, be clear and concise about your achievements. We love candidates who can communicate effectively, especially when it comes to leading incidents or writing postmortems.

Apply Through Our Website:Don’t forget to apply through our website! It’s the best way for us to receive your application and ensures you’re considered for this exciting opportunity.

How to prepare for a job interview at Deepstreamtech

Know Your Cloud Platforms

Make sure you brush up on your knowledge of AWS, GCP, and Azure. Be ready to discuss your hands-on experience with these platforms, especially in relation to running production workloads and managing large-scale cloud systems.

Show Off Your Kubernetes Skills

Kubernetes is a big deal for this role, so be prepared to talk about your experience operating production clusters. Share specific examples of challenges you've faced and how you overcame them to demonstrate your expertise.

Demonstrate Your Troubleshooting Prowess

Be ready to dive into your deep troubleshooting skills. Prepare to discuss scenarios where you've tackled issues across networking, storage, and distributed systems, especially in high-performance environments. Real-life examples will make your case stronger!

Communicate Clearly and Confidently

Since clear communication is key, practice articulating your thoughts on incident response and postmortems. Think about how you've influenced teams to prioritise reliability improvements and be ready to share those experiences during the interview.