Site Reliability Engineer
Site Reliability Engineer

Site Reliability Engineer

Full-Time 36000 - 60000 £ / year (est.) Home office (partial)
X

At a Glance

  • Tasks: Join us to enhance service reliability and automate operational tasks in a dynamic tech environment.
  • Company: Xceptor, a leader in data manipulation for financial services.
  • Benefits: Competitive salary, inclusive culture, and opportunities for professional growth.
  • Why this job: Be part of an innovative team using AI to revolutionise data reliability.
  • Qualifications: Experience in SRE or DevOps, with a focus on cloud services and automation.
  • Other info: Diverse and inclusive workplace that values unique perspectives.

The predicted salary is between 36000 - 60000 £ per year.

About Xceptor

Data is at the heart of everything we do: Xceptor has been designed around data manipulation in its broadest sense. We source data from wherever it flows. We curate, normalise, validate, repair, and enrich that data so it reaches its destination in a reliable and consistent format. Data coming out of Xceptor is data our clients can trust. We are recognised as an expert in the Financial Services vertical, which strongly aligns with Business Users in Middle and Back-Office teams. We enable these users to solve their data challenges by themselves, rather than through a technology-led project. Our mission is to empower business users within financial institutions to build automated processes that deliver trusted data.

Our values are:

  • Client Centricity
  • One Team
  • Impactful

Your Role

Site Reliability Engineering (SRE) is a cross-cutting function that partners with tribes across Xceptor to make our services reliable, performant, secure, and operable in production. We set and evolve standards for SLOs/SLIs, observability, incident response, and operational controls, and we build automation that reduces toil and enables teams to ship safely at pace across cloud and on-prem deployments. Xceptor operates with an AI-first PDLC. AI agents are a digital delivery partner and a member of the team, accelerating how we design, build, test, document, deploy, and operate our services. Reliability is engineered in through standards, automation, and measurable signals, with humans providing intent, constraints, verification, and accountability.

Who we’re looking for

  • Reliability Engineering (Build reliability into the system)
  • Contributes to defining and improving SLIs/SLOs and service health signals, aligned to customer outcomes.
  • Implements reliability improvements within established patterns (timeouts, retries, graceful degradation, safe failure modes).
  • Supports capacity and performance work: basic baselining, load investigation, and scaling hygiene.
  • Helps maintain operational quality across production and staging, and improves environment consistency where possible.

Incident Management & Operational Excellence

  • Participates in incident response and on-call (as applicable), contributing to triage, mitigation, and recovery.
  • Produces clear post-incident notes and supports root cause analysis, focusing on actions that prevent recurrence.
  • Creates and improves runbooks/playbooks so incidents are faster and more consistent to resolve.
  • Helps improve change safety through practical release/readiness checks and operational guardrails.

Observability & Production Signals

  • Implements and improves observability for services: logs, metrics, traces, dashboards, and alerting aligned to standards.
  • Tunes alerts to reduce noise and improve actionability; helps manage flakiness and false positives.
  • Builds and maintains service health dashboards that support quick diagnosis and release confidence.
  • Works with QA and Engineering to align operational signals with end-to-end journey health.

Automation & Tooling

  • Automates repetitive operational tasks and reduces toil through scripts, tooling, and pipeline improvements.
  • Contributes to deployment automation and reliability guardrails in CI/CD, working with Platform Engineering.
  • Implements and maintains IaC changes under guidance, ensuring changes are safe, reviewed, and measured.
  • Improves diagnostics and “day 2” operations to make support and troubleshooting easier.

AI-First Operations

  • Uses AI routinely to accelerate operational tasks (investigation, diagnostics, runbooks, automation drafts) with explicit verification.
  • Works effectively in an “agents draft, humans verify” model for operational artefacts (scripts, dashboards, alerts, incident summaries).
  • Applies safe operational controls when using AI (no unsafe remediation; careful handling of sensitive data).
  • Learns from production outcomes and improves automation and guardrails based on real incidents and trends.

Collaboration & Enablement

  • Partners effectively with engineering teams to embed reliability into delivery without becoming a bottleneck.
  • Communicates reliability risks and operational impacts clearly, escalating early when needed.
  • Contributes to shared platform practices and standards across tribes (templates, runbooks, alerting patterns).
  • Builds strong working relationships with stakeholders to support customer outcomes.

Key Competencies

  • Technical
  • Experience supporting and improving production services with reliability and performance expectations.
  • Working knowledge of cloud and cloud-native operations (Azure preferred), and the fundamentals of running services safely.
  • Experience with IaC and automation (tooling/framework aligned to your stack), with good review and change discipline.
  • Familiarity with CI/CD and deployment practices; able to improve pipelines and release safety under guidance.
  • Practical observability skills: logs/metrics/traces, dashboards, and alert tuning.
  • Comfortable scripting and automation (e.g., PowerShell, CLI tooling).

AI-First SRE

  • Uses AI to accelerate investigation, automation drafts, and runbook creation, and verifies outputs before use.
  • Can follow and contribute to repeatable operational workflows and templates that improve reliability over time.
  • Understands and mitigates AI risks in operations (unsafe actions, false confidence, confidentiality).

Non Technical

  • Calm, pragmatic, and reliable; communicates clearly during incidents and operational issues.
  • Outcome-focused with a bias for automation and systemic fixes over manual effort.
  • Collaborative and receptive to feedback; grows quickly in a high-tempo environment.
  • Customer-aware mindset suitable for regulated, mission-critical environments.

Required Education & Experience

  • Experience as an SRE / DevOps / Production Engineer (typically 2–5 years).
  • Experience supporting cloud services and operational automation in production environments; Azure experience beneficial.
  • Experience contributing to CI/CD, IaC, and observability practices in a delivery team.
  • Strong academic background, including a degree in a STEM subject discipline, or equivalent experience.

How Success Will Be Measured

This role is measured on outcomes and how they’re achieved: improving reliability and operational signal quality, reducing toil through automation, and supporting controlled change in an AI-first operating model.

  • Reliability: SLO attainment, availability/performance trends, incident frequency/severity trend, and MTTR improvements.
  • Change safety: change failure rate and rollback rate improve; releases become safer and more predictable.
  • Observability: alert signal-to-noise improves (flake/noise down), coverage of key services/journeys increases, faster diagnosis from logs/metrics/traces.
  • Toil reduction: automation increases, manual operational overhead reduces, runbooks/playbooks drive consistent response.
  • Cost & capacity: capacity planning maturity improves; cost optimisation without risking SLOs.

Behaviours: AI-first by default (agents draft, humans verify); strong verification discipline; reliable incident participation; automation mindset; control-aware and security-conscious decisions.

Associated Values and Behaviours

  • Collaboration: Encourage teamwork and knowledge sharing.
  • Innovation: Support the exploration of new ideas and technologies.
  • Integrity: Maintain transparency and ethical behaviour in all decisions.
  • Accountability: Take ownership of responsibilities and results.
  • Respect: Value diverse perspectives and contributions.
  • Continuous Improvement: Strive for excellence in processes and outcomes.
  • Customer Focus: Prioritise solutions that meet or exceed customer expectations.

Diversity & Inclusion at Xceptor

We believe great ideas come from everywhere — and that the best teams are made up of people with different backgrounds, experiences, and perspectives. At Xceptor, we’re committed to building a workplace where everyone feels welcome, valued, and empowered to be themselves. We know that not everyone ticks every single box in a job description — and that’s okay. If you’re excited about this role and think you could make a difference, we’d love to hear from you. Your unique skills and experiences might be just what we need, even if you don’t meet every requirement. We celebrate diversity and are dedicated to creating an inclusive environment for all employees — regardless of race, gender identity, sexual orientation, age, disability, religion, or background.

Please note: Xceptor works with clients in financial services and our offers of employment are subject to the satisfactory completion of background checks, which includes criminal record checks, and credit reference checks. If you have any employment gaps exceeding three months within the last six years, we will request additional information and evidence to clarify those periods.

Site Reliability Engineer employer: Xceptor

Xceptor is an exceptional employer that prioritises employee growth and collaboration within a dynamic work culture. As a Site Reliability Engineer, you will thrive in an AI-first environment that encourages innovation and continuous improvement, while benefiting from a commitment to diversity and inclusion. With opportunities for professional development and a focus on impactful contributions, Xceptor empowers its employees to make a meaningful difference in the financial services sector.
X

Contact Detail:

Xceptor Recruiting Team

StudySmarter Expert Advice 🤫

We think this is how you could land Site Reliability Engineer

Tip Number 1

Network like a pro! Reach out to folks in the industry, attend meetups, and connect with current employees at Xceptor. A friendly chat can sometimes lead to opportunities that aren’t even advertised!

Tip Number 2

Show off your skills! If you’ve got a portfolio or GitHub with projects related to Site Reliability Engineering, make sure to highlight them during interviews. It’s a great way to demonstrate your hands-on experience.

Tip Number 3

Prepare for those tricky questions! Brush up on your incident management scenarios and be ready to discuss how you’ve improved reliability in past roles. We love seeing candidates who can think on their feet!

Tip Number 4

Don’t forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, it shows you’re genuinely interested in joining our team at Xceptor!

We think you need these skills to ace Site Reliability Engineer

Site Reliability Engineering (SRE)
Service Level Objectives (SLOs)
Service Level Indicators (SLIs)
Incident Management
Root Cause Analysis
Observability (logs, metrics, traces)
Automation and Tooling
Infrastructure as Code (IaC)
Continuous Integration/Continuous Deployment (CI/CD)
Cloud Operations (Azure preferred)
Scripting (e.g., PowerShell, CLI tooling)
Collaboration
Customer Focus
Performance Tuning
Capacity Planning

Some tips for your application 🫡

Tailor Your CV: Make sure your CV is tailored to the Site Reliability Engineer role. Highlight your experience with cloud services, automation, and incident management. We want to see how your skills align with our mission of delivering reliable data.

Craft a Compelling Cover Letter: Your cover letter is your chance to shine! Use it to explain why you're passionate about reliability engineering and how you can contribute to our team. Be sure to mention any relevant projects or experiences that showcase your skills.

Showcase Your Technical Skills: Don’t forget to highlight your technical competencies, especially in areas like CI/CD, IaC, and observability. We love seeing practical examples of how you've improved service reliability or automated processes in your previous roles.

Apply Through Our Website: We encourage you to apply through our website for a smoother application process. It helps us keep track of your application and ensures you’re considered for the role. Plus, it’s super easy!

How to prepare for a job interview at Xceptor

Know Your Stuff

Make sure you brush up on your knowledge of Site Reliability Engineering principles, especially around SLIs and SLOs. Be ready to discuss how you've implemented reliability improvements in past roles, as this will show you're aligned with what the company values.

Showcase Your Automation Skills

Prepare examples of how you've automated operational tasks in previous positions. Highlight any experience with CI/CD pipelines and IaC, as these are crucial for the role. Being able to demonstrate your ability to reduce toil through automation will definitely impress them.

Be Ready for Incident Management Scenarios

Expect questions about incident response and how you've handled past incidents. Think of specific examples where you contributed to triage or recovery efforts, and be prepared to discuss what you learned from those experiences.

Emphasise Collaboration

Since the role involves working closely with engineering teams, be ready to talk about how you've collaborated in the past. Share instances where you communicated reliability risks or worked together to improve service health, as this shows you can be a team player.

Site Reliability Engineer
Xceptor

Land your dream job quicker with Premium

You’re marked as a top applicant with our partner companies
Individual CV and cover letter feedback including tailoring to specific job roles
Be among the first applications for new jobs with our AI application
1:1 support and career advice from our career coaches
Go Premium

Money-back if you don't land a job in 6-months

X
Similar positions in other companies
UK’s top job board for Gen Z
discover-jobs-cta
Discover now
>