At a Glance
- Tasks: Join us to enhance the reliability and performance of large-scale data platforms.
- Company: Amber Labs, a fast-growing digital transformation consultancy.
- Benefits: Enjoy 25 days leave, private health insurance, and remote-first working.
- Other info: Be part of a dynamic team with excellent career growth opportunities.
- Why this job: Make a real impact on critical data services while collaborating with top professionals.
- Qualifications: Experience in Site Reliability Engineering and strong collaboration skills required.
The predicted salary is between 60000 - 80000 £ per year.
Amber Labs is a fast-growing digital transformation consultancy delivering complex data and technology solutions across the public sector. We are looking for an experienced Site Reliability Engineer (SRE) to join our team and support a high-profile, security-cleared programme focused on critical data and platform services. This role will focus on improving the reliability, observability and performance of large-scale data platforms and services. You'll work closely with architects, developers, platform engineers and stakeholders to define reliability objectives, drive automation, and ensure services operate effectively at scale.
As an SRE, you will be responsible for embedding reliability engineering principles across a complex data and platform landscape. You will help establish and measure service reliability targets, improve observability, lead root cause investigations, and identify opportunities to automate operational activities. This is an excellent opportunity for someone who enjoys solving complex operational challenges, working across multiple teams, and driving continuous service improvement within a highly secure environment.
Key Responsibilities
- Define, implement and champion Site Reliability Engineering practices across critical services and platforms
- Collaborate with architects, developers and platform teams to design and maintain Service Level Indicators (SLIs), Service Level Objectives (SLOs) and Service Level Agreements (SLAs)
- Monitor and manage error budgets, ensuring reliability targets are understood and achieved across engineering teams
- Improve service observability through effective monitoring, alerting and reporting capabilities
- Conduct and facilitate Root Cause Analysis (RCA) and Post-Incident Reviews (PIRs), driving meaningful improvements and preventative actions
- Build and manage an SRE improvement backlog, identifying opportunities for automation and operational excellence
- Support the reliability, availability and performance of data platforms, infrastructure and data pipelines
- Drive continuous improvement initiatives that reduce operational overhead and improve service resilience
- Work with technical and business stakeholders to establish meaningful service health metrics and reporting
Requirements
- Strong experience applying Site Reliability Engineering principles within complex production environments
- Observability and monitoring practices
- Root Cause Analysis (RCA)
- Experience designing and implementing reliability frameworks and operational excellence practices
- Hands-on experience with: Dynatrace, Kubernetes, Helm
- Experience developing automation solutions to improve reliability and reduce manual operational effort
- Strong stakeholder management and collaboration skills, with the ability to work effectively across engineering and architecture teams
- Experience supporting cloud-native platforms and services
- Active SC Clearance that has been used within the last 6–12 months
- Ability to operate at SFIA Level 4
- Experience working with data platforms, data engineering teams or large-scale data ecosystems
- Understanding of data pipelines and their operational challenges
- Experience supporting platform engineering or infrastructure teams
- Experience working within secure public sector or regulated environments
- Familiarity with cloud platforms such as AWS, Azure or GCP
What We Offer
- 25 days annual leave plus public holidays, giving you time to properly switch off and recharge
- Private medical insurance with Bupa
- Remote-first working, with access to our Liverpool Street office when you want to collaborate in person
- Personal training budget to support your professional development
- Perkbox membership, with access to discounts across retail, travel, dining, wellness and entertainment
- Electric Vehicle Scheme after one year of service
- Regular team socials and opportunities to connect across the business
- Employer pension contributions
- Referral scheme offering up to £3,000 for successful hires
- The opportunity to join a growing consultancy early and genuinely influence its direction and success
Amber Labs is a specialist digital consultancy delivering technology, data and transformation services across complex and highly regulated environments. We work with organisations tackling some of the UK's most challenging digital programmes, helping them deliver reliable, scalable and user-focused solutions.
Site Reliability Engineer in London employer: Amber Labs
Amber Labs is an exceptional employer, offering a dynamic work culture that prioritises collaboration and innovation within the digital transformation space. With a strong focus on employee growth, we provide a personal training budget, remote-first working options, and regular team socials, ensuring a supportive environment where you can thrive. Join us to make a meaningful impact on critical data platforms while enjoying competitive benefits like private medical insurance and an electric vehicle scheme.