At a Glance
- Tasks: Design and evolve systems for reliability, scalability, and security at Lyrebird Health.
- Company: Lyrebird Health is revolutionising healthcare by automating clinicians' tasks.
- Benefits: High ownership role with direct impact on product reliability and meaningful problem-solving.
- Why this job: Shape the future of healthcare technology while ensuring fast, secure, and reliable services.
- Qualifications: 8+ years in engineering, strong SRE experience, and leadership in cross-team initiatives.
- Other info: Diverse team culture encouraging applicants from underrepresented backgrounds.
The predicted salary is between 48000 - 72000 ÂŁ per year.
We’re looking for a Staff Site Reliability Engineer (SRE) to raise the reliability, scalability, and security bar across the Lyrebird platform. This is a senior, high-impact role focused on designing and evolving the systems and practices that keep Lyrebird fast, safe, and available. You’ll work across infrastructure, application reliability, observability, incident response, and platform enablement - partnering closely with Engineering, Security, and Product. This is not a “keep the lights on” role. You’ll drive meaningful improvements to how we build, deploy, and operate our services in production - with real autonomy and ownership.
About Lyrebird Health: Lyrebird Health is transforming the quality and accessibility of healthcare by automating clinicians’ most time-consuming tasks. Thousands of clinicians across many disciplines already use Lyrebird — and that number is growing every day. They trust us to deliver a fast, reliable, and secure experience. We value that trust above all else and strive to earn it while continuing to amaze our users.
What You’ll Do
- Reliability & Production Engineering
- Own reliability outcomes across core services and customer-facing systems
- Define, implement, and evolve SLOs/SLIs, alerting strategy, and error budgets
- Lead initiatives to improve uptime, latency, and overall system resilience
- Proactively identify reliability risks and drive mitigation plans to completion
- Observability & Incident Response
- Improve end-to-end observability (metrics, logs, traces) so issues are detected early and diagnosed quickly
- Lead incident response for high-severity events and guide teams through calm, effective mitigation
- Drive post-incident reviews that result in measurable, lasting improvements
- Build a culture of operational excellence: fewer incidents, faster recovery, better learning
- Platform Enablement
- Develop internal tooling and paved paths that make “doing the right thing” the easiest option
- Improve the developer experience around deployments, rollbacks, environment consistency, and service ownership
- Partner with engineers to uplift production-readiness across new and existing services
- Infrastructure & Automation
- Improve infrastructure reliability and maintainability using Infrastructure as Code
- Strengthen deployment workflows and reduce operational toil through automation
- Help shape architecture decisions with a reliability and scalability lens
- Security & Compliance Support
- Embed security and compliance principles into platform practices (access controls, auditability, safe-by-default designs)
- Work closely with Security and Engineering leadership to support regulatory and enterprise requirements without slowing down delivery
What We’re Looking For
- 8+ years of engineering experience, with significant depth in SRE / platform/production systems
- Strong experience operating and improving systems in production (including incident response)
- Proven ability to lead cross-team initiatives and influence engineering standards
Technical Strength
- You don’t need to tick every box, but you should be strong across most:
- Cloud/Infrastructure, AWS (ECS, EC2, VPC, IAM, RDS/Aurora, S3, CloudWatch)
- Infrastructure as Code (Terraform)
- Observability
- Strong grasp of monitoring and alerting principles
- Experience with logs + metrics + tracing and building meaningful dashboards
- Familiar with OpenTelemetry and modern observability tooling
Systems & Operational Excellence
- Knowledge of reliability patterns: graceful degradation, retries, backoff, timeouts, load shedding, capacity planning
- Strong debugging instincts across distributed systems
- Practical approach to risk management and tradeoffs
Software Engineering
- Ability to build tools and automation (TypeScript, Go, Python, or similar)
- Familiarity with CI/CD and safe rollout strategies (feature flags, canary, blue/green)
Bonus Skill (Nice to Have)
- Experience supporting security frameworks (SOC 2, ISO 27001, HIPAA-style environments)
- Experience with service mesh patterns, multi-account AWS environments, or multi-region design
- Experience working with healthcare or regulated domains
- Experience scaling engineering org practices as the company grows
Who You Are
- You’re deeply accountable - you take ownership of outcomes, not just tasks
- You value simplicity and reliability over cleverness
- You’re calm and effective in incidents, and you raise the quality bar afterward
- You communicate clearly across engineering and non-engineering stakeholders
- You’re pragmatic: you know when to move fast, and when to slow down to reduce risk
Why This Role Is Different
- Staff-level scope with real influence across engineering
- Direct impact on reliability for a product clinicians depend on every day
- Work on meaningful problems where security, performance, and trust matter
- High ownership environment with room to shape how the company operates at scale
At Lyrebird, you won’t just respond to incidents - you’ll design the systems and standards that prevent them. We’re building a team that reflects the diversity of the people who’ll benefit from our work. If you’re from an underrepresented background in tech, we especially encourage you to apply - even if you don’t meet every single requirement.
We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.
Staff Site Reliability Engineer employer: Lyrebird Health
Contact Detail:
Lyrebird Health Recruiting Team
StudySmarter Expert Advice 🤫
We think this is how you could land Staff Site Reliability Engineer
✨Tip Number 1
Network like a pro! Reach out to folks in your industry on LinkedIn or at meetups. A friendly chat can lead to opportunities that aren’t even advertised yet.
✨Tip Number 2
Show off your skills! Create a portfolio or GitHub repo showcasing your projects and contributions. This gives potential employers a taste of what you can do beyond your CV.
✨Tip Number 3
Prepare for interviews by practising common SRE scenarios. Think about how you’d handle incidents or improve system reliability. We want to see your thought process!
✨Tip Number 4
Don’t forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, we love seeing candidates who are proactive!
We think you need these skills to ace Staff Site Reliability Engineer
Some tips for your application 🫡
Show Your Passion for Reliability: When writing your application, let us see your enthusiasm for reliability and system performance. Share specific examples of how you've improved uptime or resilience in past roles – we love to hear about your hands-on experience!
Tailor Your Application: Make sure to customise your application to highlight the skills and experiences that align with our job description. We want to see how your background in SRE or production systems makes you a perfect fit for our team at Lyrebird.
Be Clear and Concise: Keep your application straightforward and to the point. Use clear language to describe your achievements and avoid jargon unless it’s relevant. We appreciate clarity as much as you do!
Apply Through Our Website: Don’t forget to submit your application through our website! It’s the best way for us to receive your details and ensures you’re considered for the role. We can’t wait to see what you bring to the table!
How to prepare for a job interview at Lyrebird Health
✨Know Your Stuff
Make sure you brush up on your technical skills, especially around SRE principles and the tools mentioned in the job description. Be ready to discuss your experience with AWS, Infrastructure as Code, and observability tools. This is your chance to show how your background aligns with what they need!
✨Showcase Your Problem-Solving Skills
Prepare to share specific examples of how you've tackled reliability issues or improved system performance in the past. Use the STAR method (Situation, Task, Action, Result) to structure your answers. They want to see your thought process and how you handle challenges.
✨Communicate Clearly
Since this role involves working closely with various teams, practice explaining complex technical concepts in simple terms. Think about how you would communicate with non-engineering stakeholders. Clear communication can set you apart from other candidates.
✨Demonstrate Ownership and Accountability
Be prepared to discuss times when you took ownership of a project or incident. Highlight how you ensured successful outcomes and learned from any mistakes. This role values accountability, so showing that you take responsibility for your work will resonate well with them.