Site Reliability Engineer in London
Site Reliability Engineer

Site Reliability Engineer in London

London Full-Time 36000 - 60000 ÂŁ / year (est.) Home office (partial)
H

At a Glance

  • Tasks: Join our team to ensure system reliability and improve operational processes.
  • Company: Heidi, an innovative AI Care Partner transforming healthcare.
  • Benefits: Equity from day one, personal development budget, and wellness days.
  • Why this job: Make a real impact in healthcare while working with world-class talent.
  • Qualifications: 3-6+ years in SRE or operations-heavy roles, cloud infrastructure experience.
  • Other info: Flexible hybrid work environment with opportunities for growth.

The predicted salary is between 36000 - 60000 ÂŁ per year.

Healthcare needs a better rhythm: one that keeps care continuous and deeply human. Heidi is building an AI Care Partner that works alongside clinicians to make that possible. We’re a team of doctors, engineers, designers, researchers, and creatives building tools that help clinicians stay focused on what matters most: their patients. In just 18 months, Heidi has given back more than 18 million hours to healthcare professionals - supporting 73 million patient visits in 116 countries. Today, more than two million patient visits each week are powered by Heidi worldwide. Backed by nearly $100 million in funding, we’re growing in the US, UK, Canada, and Europe, partnering with leading health systems including the NHS, Beth Israel Lahey Health, and Monash Health.

This role sits in the core Platform/SRE team that owns production. You’ll work directly on incident response, on-call, system reliability, and day-to-day operations for Heidi’s platform. We’re open to candidates who are strong mid-level SREs ready to take on more ownership, as well as senior SREs who enjoy being hands‑on in operations. The role is intentionally ops‑heavy and focused on keeping real systems healthy in production.

What You’ll Do

  • Participate in on‑call and incident response: Respond to production incidents, contribute to service restoration, and support clear communication during incidents. Over time, take increasing responsibility for leading incidents end‑to‑end.
  • Improve operational reliability: Identify recurring issues and reliability risks, and drive fixes through better alerting, automation, system changes, or process improvements.
  • Own parts of the production environment: Operate and improve Kubernetes clusters, cloud infrastructure, and core platform services, with growing ownership as familiarity increases.
  • Strengthen observability: Improve dashboards, alerts, logs, and traces so issues are detected earlier and diagnosed faster, with a strong focus on actionable signals.
  • Reduce operational toil: Automate repetitive tasks, simplify runbooks, and improve tooling to make on‑call and day‑to‑day operations easier and safer.
  • Support safe change: Improve deployments, rollback mechanisms, and operational readiness to reduce the risk of incidents caused by change.
  • Contribute to operational practices: Write and maintain runbooks, participate in blameless post‑mortems, and help improve incident response processes over time.
  • Collaborate closely with engineers: Work with product and feature teams to improve production readiness, service ownership, and reliability expectations.

What We’re Looking For

  • 3–6+ years in SRE, DevOps, Platform, or operations‑heavy engineering roles.
  • Experience supporting production systems and participating in on‑call rotations.
  • Comfortable debugging live systems under pressure.
  • Experience operating cloud infrastructure (AWS preferred).
  • Working knowledge of Kubernetes and containerised workloads.
  • Infrastructure as Code experience (Terraform or similar).
  • Familiarity with monitoring and alerting tools (Datadog, Prometheus, etc).
  • Scripting or automation experience (Python, Bash, or similar).

Nice To Have

  • Experience leading incidents or mentoring others during on‑call.
  • Experience in regulated or security‑sensitive environments.
  • Familiarity with databases, queues, and caches in production.
  • Interest in reliability practices such as SLOs, error budgets, and capacity planning.

How We Work

  • We own production: The Platform/SRE team is responsible for reliability and incident response.
  • Incidents are blameless: We focus on learning and improving systems, not assigning fault.
  • Practical over perfect: We prioritise improvements that reduce real operational pain.
  • Calm under pressure: Clear thinking and communication matter during incidents.

What do we believe in?

  • Live Forever - Every release moves care forward: measured, safe, and built to last. Data guides us, but patients define the truth that matters.
  • Practice Ownership - Decisions follow logic and proof, not hierarchy. Exceptional care demands exceptional standards in our work, our thinking, and our character.
  • Small Cuts Heal Faster - Stability earns trust, speed delivers impact. Progress is about learning fast without breaking what people depend on.
  • Make others better - Feedback is direct, kindness is constant, and excellence lifts everyone. Our success is measured by collective growth, not individual output.

Our mission is clear: expand the world’s capacity to care, and do it without losing the humanity that makes care worth delivering.

Why you should join Heidi

  • Real product momentum. We’re not trying to generate interest, we’re channeling it.
  • Equity from day one. When Heidi wins, you win. You’ll share directly in the success you help create.
  • Unmatched impact. Play a pivotal role in defining and scaling customer success at a critical growth moment - all while working on a product that delivers tangible value to clinicians and patients every day.
  • Work alongside world‑class talent. Join a team of operators and builders who’ve scaled unicorns.
  • Global reach. Help shape our international expansion as we bring Heidi to key international markets.
  • Growth and balance. Enjoy a personal development budget, work from anywhere for a month, dedicated wellness days, and your birthday off to recharge.
  • Flexibility that works. A hybrid environment, with 3 days in the office.

Heidi’s commitment to Diversity, Equity and Inclusion: Heidi is dedicated to creating an equitable, inclusive, and supportive work environment that brings people together from diverse backgrounds, experiences, and perspectives. Our strength is in our differences. We're proud to be an equal opportunity employer and are proud to welcome all applicants as we’re committed to promoting a culture of opportunity for all.

Site Reliability Engineer in London employer: Heidi

Heidi is an exceptional employer that prioritises a culture of collaboration and innovation, empowering employees to make a meaningful impact in healthcare. With a strong focus on personal development, flexible working arrangements, and a commitment to diversity and inclusion, team members enjoy unparalleled growth opportunities while contributing to a mission that enhances patient care globally. Join us to work alongside world-class talent in a supportive environment that values your contributions and well-being.
H

Contact Detail:

Heidi Recruiting Team

StudySmarter Expert Advice 🤫

We think this is how you could land Site Reliability Engineer in London

✨Tip Number 1

Network like a pro! Reach out to folks in the industry, attend meetups, and connect with current employees at Heidi. A friendly chat can sometimes lead to opportunities that aren’t even advertised!

✨Tip Number 2

Show off your skills! If you’ve got a GitHub or personal project that showcases your SRE expertise, make sure to highlight it during interviews. It’s a great way to demonstrate your hands-on experience.

✨Tip Number 3

Prepare for those tricky technical questions! Brush up on your knowledge of Kubernetes, cloud infrastructure, and incident response strategies. We want to see how you think under pressure, so practice makes perfect!

✨Tip Number 4

Don’t forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, we love seeing candidates who are genuinely interested in joining our mission.

We think you need these skills to ace Site Reliability Engineer in London

Incident Response
System Reliability
Kubernetes
Cloud Infrastructure (AWS preferred)
Infrastructure as Code (Terraform or similar)
Monitoring and Alerting Tools (Datadog, Prometheus, etc)
Scripting or Automation (Python, Bash, or similar)
Debugging Live Systems
Operational Practices
Runbook Maintenance
Collaboration with Engineers
Automation of Repetitive Tasks
Observability Improvement
Change Management
Calm Under Pressure

Some tips for your application 🫡

Show Your Passion for Reliability: When you're writing your application, let us see your enthusiasm for keeping systems healthy and reliable. Share specific examples of how you've tackled incidents or improved operational reliability in your past roles.

Be Clear and Concise: We appreciate straightforward communication, especially in the tech world. Make sure your application is easy to read and gets straight to the point. Highlight your relevant experience without fluff!

Tailor Your Application: Don’t just send a generic application! Take the time to align your skills and experiences with what we’re looking for in the job description. Show us why you’re the perfect fit for our team at Heidi.

Apply Through Our Website: We encourage you to apply directly through our website. It’s the best way for us to receive your application and ensures you don’t miss out on any important updates from our team!

How to prepare for a job interview at Heidi

✨Know Your Stuff

Make sure you brush up on your technical skills, especially around Kubernetes, cloud infrastructure, and incident response. Be ready to discuss your past experiences with production systems and how you've handled on-call situations.

✨Show Your Problem-Solving Skills

Prepare to share specific examples of how you've improved operational reliability or reduced operational toil in previous roles. Highlight any automation or tooling improvements you've implemented that made a real difference.

✨Communicate Clearly

During the interview, practice clear and concise communication. Since the role involves incident response, demonstrating your ability to communicate effectively under pressure will be key. Think about how you would explain complex issues simply.

✨Emphasise Team Collaboration

Heidi values collaboration, so be ready to discuss how you've worked with engineers and product teams in the past. Share examples of how you've contributed to improving service ownership and reliability expectations within a team setting.

Site Reliability Engineer in London
Heidi
Location: London

Land your dream job quicker with Premium

You’re marked as a top applicant with our partner companies
Individual CV and cover letter feedback including tailoring to specific job roles
Be among the first applications for new jobs with our AI application
1:1 support and career advice from our career coaches
Go Premium

Money-back if you don't land a job in 6-months

H
Similar positions in other companies
UK’s top job board for Gen Z
discover-jobs-cta
Discover now
>