Site Reliability Engineer in London

Site Reliability Engineer in London

London Full-Time 50000 - 70000 ÂŁ / year (est.) No home office possible
I

At a Glance

  • Tasks: Enhance system reliability and manage monitoring solutions using Prometheus or VictoriaMetrics.
  • Company: Dynamic tech company focused on innovation and operational excellence.
  • Benefits: Competitive salary, flexible working hours, and opportunities for professional growth.
  • Other info: Participate in a rotating on-call schedule and contribute to continuous improvement.
  • Why this job: Join a team that values your input and helps you grow in a fast-paced environment.
  • Qualifications: Experience in SRE/DevOps and strong troubleshooting skills across Linux and Windows.

The predicted salary is between 50000 - 70000 ÂŁ per year.

ALL CANDIDATES MUST BE LOCATED IN THE UK

We are looking for an SRE to improve reliability and operational readiness with a strong focus on metrics, alerting, and event management. The role involves building and maintaining monitoring solutions using Prometheus or VictoriaMetrics, integrating alerts and events with BigPanda, and participating in on‑call rotations to drive fast incident response and continuous improvement across Windows and Linux environments.

Key Responsibilities

  • Build and operate metrics/monitoring platforms: Prometheus and/or VictoriaMetrics (scrape configs, exporters, recording rules)
  • Design and maintain alerting strategy: thresholds, anomaly detection, alert routing, deduplication, and noise reduction
  • Integrate monitoring/alerting and events with BigPanda (correlation, enrichment, routing, incident workflows)
  • Create and maintain dashboards and operational visibility (Grafana or equivalent)
  • Develop and maintain runbooks, operational playbooks, and incident response procedures
  • Participate in on‑call shifts: triage alerts, manage incidents, coordinate response, and lead communication during outages
  • Perform root‑cause analysis, post‑mortems, and implement corrective/preventive actions
  • Improve service reliability via SLOs/SLIs, capacity planning, and automation to reduce toil
  • Support monitoring for core infrastructure and services on Windows and Linux, including HA components and clusters
  • Collaborate with DevOps/Engineering to instrument applications and standardize telemetry (metrics, logs, traces where applicable)

Skills, Knowledge & Expertise

  • Experience in SRE / Operations / DevOps with production incident ownership
  • Hands‑on experience with Prometheus and/or VictoriaMetrics (exporters, alert rules, recording rules, troubleshooting)
  • Experience integrating alerting/event pipelines with BigPanda (or similar event correlation tools)
  • Strong troubleshooting skills across Linux and Windows systems (networking, OS, services)
  • Ability to build reliable alerting with minimal noise (correlation, grouping, suppression, maintenance windows)
  • Experience with Git‑based workflows for monitoring‑as‑code and configuration management

Nice to Have

  • Grafana administration and dashboard design
  • Log management (ELK/EFK, Loki) and/or tracing (OpenTelemetry)
  • Automation skills (Python, PowerShell, Bash) and configuration tools (Ansible)
  • Messaging/cache/proxy operations: RabbitMQ, Redis, NGINX
  • Experience with Windows clustering or HA environments
  • Experience defining SLOs/SLIs and operational KPIs
  • Experience managing VOIP components and protocols (SIP, FreeSwitch, OpenSIP, session border controllers)
  • Experience with load‑balancing components (F5 LTM, F5 GTM)
  • Experience with virtualization platforms such as VMWare or HyperV
  • Experience administering AWS or Azure tenants

On‑call Expectations

  • Participation in a rotating on‑call schedule (including nights/weekends as needed)
  • Ownership of incident response: rapid triage, escalation, mitigation, and follow‑up improvements
  • Commitment to improving monitoring quality to reduce alert fatigue and improve MTTR

Diversity, Inclusion, and Equal Opportunity

We hire, promote, and compensate employees based on their ability to perform their job responsibilities, without regard to race, color, creed, religion, sex, gender, marital status, national origin, ancestry, age, citizenship, physical or mental disability, sexual orientation, or other bases protected by applicable law. We are an equal‑opportunity employer and value diversity at our company.

Site Reliability Engineer in London employer: Intermedia Intelligent Communications

As a Site Reliability Engineer with us, you'll thrive in a dynamic and inclusive work environment that prioritises employee growth and development. We offer competitive benefits, a strong focus on work-life balance, and opportunities to engage in innovative projects that enhance your skills in monitoring and incident management. Join our team in the UK and be part of a culture that values collaboration, diversity, and continuous improvement.
I

Contact Detail:

Intermedia Intelligent Communications Recruiting Team

StudySmarter Expert Advice 🤫

We think this is how you could land Site Reliability Engineer in London

✨Tip Number 1

Network like a pro! Reach out to folks in the industry, attend meetups, and connect with other SREs on LinkedIn. You never know who might have the inside scoop on job openings or can refer you directly.

✨Tip Number 2

Show off your skills! Create a portfolio showcasing your projects, especially those involving Prometheus, VictoriaMetrics, or any monitoring solutions. This gives potential employers a taste of what you can bring to the table.

✨Tip Number 3

Prepare for interviews by brushing up on your troubleshooting skills. Be ready to discuss real-life incidents you've managed, how you approached them, and what you learned. This will demonstrate your hands-on experience and problem-solving abilities.

✨Tip Number 4

Don't forget to apply through our website! It’s the best way to ensure your application gets seen. Plus, we love seeing candidates who are proactive about their job search!

We think you need these skills to ace Site Reliability Engineer in London

Site Reliability Engineering (SRE)
Prometheus
VictoriaMetrics
BigPanda
Grafana
Linux Systems Administration
Windows Systems Administration
Incident Management
Root Cause Analysis
Automation (Python, PowerShell, Bash)
Configuration Management (Ansible)
Git-based Workflows
SLOs/SLIs Definition
Load Balancing (F5 LTM, F5 GTM)
Cloud Administration (AWS, Azure)

Some tips for your application 🫡

Tailor Your CV: Make sure your CV is tailored to the Site Reliability Engineer role. Highlight your experience with Prometheus, VictoriaMetrics, and any relevant incident management tools like BigPanda. We want to see how your skills match what we're looking for!

Craft a Compelling Cover Letter: Your cover letter is your chance to shine! Use it to explain why you're passionate about SRE and how your background makes you a great fit for our team. Don't forget to mention your troubleshooting skills and experience with both Linux and Windows systems.

Showcase Your Projects: If you've worked on any projects related to monitoring solutions or alerting strategies, be sure to include them. We love seeing practical examples of your work, especially if they involve automation or improving service reliability.

Apply Through Our Website: We encourage you to apply directly through our website. It’s the best way for us to receive your application and ensures you’re considered for the role. Plus, it shows us you’re keen on joining the StudySmarter family!

How to prepare for a job interview at Intermedia Intelligent Communications

✨Know Your Tools Inside Out

Make sure you’re well-versed in Prometheus, VictoriaMetrics, and BigPanda. Brush up on how to build and operate metrics platforms, as well as your experience with alerting strategies. Being able to discuss specific configurations or troubleshooting scenarios will show your hands-on expertise.

✨Demonstrate Your Troubleshooting Skills

Prepare to share examples of past incidents you've managed, especially in Linux and Windows environments. Highlight your approach to root-cause analysis and how you’ve implemented corrective actions. This will showcase your ability to handle real-world challenges effectively.

✨Showcase Your Collaboration Experience

Since the role involves working closely with DevOps and Engineering teams, be ready to discuss how you’ve collaborated in the past. Talk about any projects where you’ve standardised telemetry or improved service reliability through teamwork. This will demonstrate your ability to work well in a team setting.

✨Prepare for On-Call Scenarios

Expect questions about your experience with on-call duties and incident response. Be prepared to explain how you triage alerts and manage incidents, including any specific tools or processes you’ve used. This will help convey your readiness for the responsibilities that come with the role.

Site Reliability Engineer in London
Intermedia Intelligent Communications
Location: London

Land your dream job quicker with Premium

You’re marked as a top applicant with our partner companies
Individual CV and cover letter feedback including tailoring to specific job roles
Be among the first applications for new jobs with our AI application
1:1 support and career advice from our career coaches
Go Premium

Money-back if you don't land a job in 6-months

>