Site Reliability Engineer

Site Reliability Engineer

Full-Time 60000 - 75000 £ / year (est.) No working from home possible
Amber Labs

At a Glance

  • Tasks: Join us as a Site Reliability Engineer to enhance data platform reliability and performance.
  • Company: Amber Labs, a fast-growing digital transformation consultancy in the public sector.
  • Benefits: Enjoy 25 days annual leave, private medical insurance, and a personal training budget.
  • Other info: Remote-first work culture with excellent career growth opportunities and team socials.
  • Why this job: Make a real impact on critical data services while working with cutting-edge technology.
  • Qualifications: Experience in Site Reliability Engineering and strong collaboration skills required.

The predicted salary is between 60000 - 75000 £ per year.

Amber Labs is a fast-growing digital transformation consultancy delivering complex data and technology solutions across the public sector. We are looking for an experienced Site Reliability Engineer (SRE) to join our team and support a high-profile, security-cleared programme focused on critical data and platform services. This role will focus on improving the reliability, observability and performance of large-scale data platforms and services. You'll work closely with architects, developers, platform engineers and stakeholders to define reliability objectives, drive automation, and ensure services operate effectively at scale.

As an SRE, you will be responsible for embedding reliability engineering principles across a complex data and platform landscape. You will help establish and measure service reliability targets, improve observability, lead root cause investigations, and identify opportunities to automate operational activities. This is an excellent opportunity for someone who enjoys solving complex operational challenges, working across multiple teams, and driving continuous service improvement within a highly secure environment.

Key Responsibilities

  • Define, implement and champion Site Reliability Engineering practices across critical services and platforms
  • Collaborate with architects, developers and platform teams to design and maintain Service Level Indicators (SLIs), Service Level Objectives (SLOs) and Service Level Agreements (SLAs)
  • Monitor and manage error budgets, ensuring reliability targets are understood and achieved across engineering teams
  • Improve service observability through effective monitoring, alerting and reporting capabilities
  • Conduct and facilitate Root Cause Analysis (RCA) and Post-Incident Reviews (PIRs), driving meaningful improvements and preventative actions
  • Build and manage an SRE improvement backlog, identifying opportunities for automation and operational excellence
  • Support the reliability, availability and performance of data platforms, infrastructure and data pipelines
  • Drive continuous improvement initiatives that reduce operational overhead and improve service resilience
  • Work with technical and business stakeholders to establish meaningful service health metrics and reporting

Requirements

  • Strong experience applying Site Reliability Engineering principles within complex production environments
  • Observability and monitoring practices
  • Root Cause Analysis (RCA)
  • Experience designing and implementing reliability frameworks and operational excellence practices
  • Hands-on experience with: Dynatrace, Kubernetes, Helm
  • Experience developing automation solutions to improve reliability and reduce manual operational effort
  • Strong stakeholder management and collaboration skills, with the ability to work effectively across engineering and architecture teams
  • Experience supporting cloud-native platforms and services
  • Active SC Clearance that has been used within the last 6–12 months
  • Ability to operate at SFIA Level 4
  • Experience working with data platforms, data engineering teams or large-scale data ecosystems
  • Understanding of data pipelines and their operational challenges
  • Experience supporting platform engineering or infrastructure teams
  • Experience working within secure public sector or regulated environments
  • Familiarity with cloud platforms such as AWS, Azure or GCP

What We Offer

  • 25 days annual leave plus public holidays, giving you time to properly switch off and recharge
  • Private medical insurance with Bupa
  • Remote-first working, with access to our Liverpool Street office when you want to collaborate in person
  • Personal training budget to support your professional development
  • Perkbox membership, with access to discounts across retail, travel, dining, wellness and entertainment
  • Electric Vehicle Scheme after one year of service
  • Regular team socials and opportunities to connect across the business
  • Employer pension contributions
  • Referral scheme offering up to £3,000 for successful hires
  • The opportunity to join a growing consultancy early and genuinely influence its direction and success

Amber Labs is a specialist digital consultancy delivering technology, data and transformation services across complex and highly regulated environments. We work with organisations tackling some of the UK's most challenging digital programmes, helping them deliver reliable, scalable and user-focused solutions.

Site Reliability Engineer employer: Amber Labs

Amber Labs is an exceptional employer, offering a dynamic work culture that prioritises collaboration and innovation within the digital transformation space. With a strong focus on employee growth, we provide a personal training budget, remote-first working options, and regular team socials, ensuring a supportive environment where you can thrive. Join us in shaping the future of technology solutions in the public sector while enjoying competitive benefits like private medical insurance and an electric vehicle scheme.

Amber Labs

Contact Details:

Amber Labs Recruitment Team

We think you need these skills to ace Site Reliability Engineer

Site Reliability Engineering (SRE)
Service Level Indicators (SLIs)
Service Level Objectives (SLOs)
Service Level Agreements (SLAs)
Root Cause Analysis (RCA)
Post-Incident Reviews (PIRs)
Automation Solutions