At a Glance
- Tasks: Respond to production incidents and improve operational reliability through automation and system changes.
- Company: Heidi is developing an AI Care Partner to support clinicians in delivering care effectively.
- Benefits: Enjoy comprehensive private medical cover, a £700 learning budget, and global parental leave.
- Other info: This position is hybrid, requiring 3 days in the office.
- Why this job: Join a hands-on role focused on maintaining real systems in production with significant ownership.
- Qualifications: 3–6+ years in SRE or operations-heavy engineering roles, with experience in cloud infrastructure and Kubernetes.
The predicted salary is between 60000 - 80000 £ per year.
About Heidi
Heidi is building an AI Care Partner that supports clinicians every step of the way, from documentation to delivery of care.
The Role
This role sits in the core Platform/SRE team that owns production. You’ll work directly on incident response, on-call duties, system reliability, and day-to-day operations for Heidi’s platform. We’re open to candidates who are strong mid-level SREs ready to take on more ownership, as well as senior SREs who enjoy being hands-on in operations. The role is intentionally ops-heavy and focused on keeping real systems healthy in production.
What You’ll Do
- Participate in on-call and incident response: Respond to production incidents, contribute to service restoration, and support clear communication during incidents. Over time, take increasing responsibility for leading incidents end-to-end.
- Improve operational reliability: Identify recurring issues and reliability risks, and drive fixes through better alerting, automation, system changes, or process improvements.
- Own parts of the production environment: Operate and improve Kubernetes clusters, cloud infrastructure, and core platform services, with growing ownership as familiarity increases.
- Strengthen observability: Improve dashboards, alerts, logs, and traces so issues are detected earlier and diagnosed faster, with a strong focus on actionable signals.
- Reduce operational toil: Automate repetitive tasks, simplify runbooks, and improve tooling to make on-call and day-to-day operations easier and safer.
- Support safe change: Improve deployments, rollback mechanisms, and operational readiness to reduce the risk of incidents caused by change.
- Contribute to operational practices: Write and maintain runbooks, participate in blameless post-mortems, and help improve incident response processes over time.
- Collaborate closely with engineers: Work with product and feature teams to improve production readiness, service ownership, and reliability expectations.
What We’re Looking For
- 3–6+ years in SRE, DevOps, Platform, or operations-heavy engineering roles.
- Experience supporting production systems and participating in on-call rotations.
- Comfortable debugging live systems under pressure.
- Experience operating cloud infrastructure (AWS preferred).
- Working knowledge of Kubernetes and containerised workloads.
- Infrastructure as Code experience (Terraform or similar).
- Familiarity with monitoring and alerting tools (Datadog, Prometheus, etc).
- Scripting or automation experience (Python, Bash, or similar).
Nice to Have
- Experience leading incidents or mentoring others during on-call.
- Experience in regulated or security-sensitive environments.
- Familiarity with databases, queues, and caches in production.
- Interest in reliability practices such as SLOs, error budgets, and capacity planning.
Benefits
- Real product momentum. We’re not trying to generate interest, we’re channeling it.
- Equity from day one.
- Unmatched impact.
- Work alongside world-class talent.
- Your health, covered. Comprehensive private medical and dental cover through Bupa, plus 24/7 mental health, coaching and wellbeing support through Sonder and a £100/month Healthy Heidi’s stipend.
- Global parental leave. 26 weeks paid for primary carers and 18 weeks for secondary carers, subject to eligibility.
- Fertility support. £7,000 one-off payment, eligibility applies.
- Learning & development. £700 per year for courses, books, memberships, conferences and more.
- Home office budget. £500 one-off to set up a workspace you actually want to work in.
- Recharge days after major milestones and busy periods so you can reset and come back strong.
- Work from anywhere for up to 4 weeks per year, wherever the world takes you.
- Clinical leave. 10 days per year for eligible clinical roles to maintain accreditation and requirements.
- Flexibility that works. A hybrid environment, with 3 days in the office.
Heidi’s Commitment to Diversity, Equity and Inclusion
Heidi is dedicated to creating an equitable, inclusive, and supportive work environment that brings people together from diverse backgrounds, experiences, and perspectives. Our strength is in our differences. We’re proud to be an equal opportunity employer and are proud to welcome all applicants as we’re committed to promoting a culture of opportunity for all.
Senior Site Reliability Engineer - UK employer: Heidi
Heidi offers a unique opportunity to work on an impactful AI platform in the UK. Employees benefit from comprehensive health coverage and a generous learning budget. The team values diversity and inclusion, fostering a supportive environment for all backgrounds.