At a Glance
- Tasks: Enhance system reliability and manage monitoring solutions using Prometheus or VictoriaMetrics.
- Company: Dynamic tech company focused on innovation and operational excellence.
- Benefits: Competitive salary, flexible working hours, and opportunities for professional growth.
- Other info: Participate in a rotating on-call schedule and contribute to continuous improvement.
- Why this job: Join a team that values your input and helps you grow in a fast-paced environment.
- Qualifications: Experience in SRE/DevOps and strong troubleshooting skills across Linux and Windows.
The predicted salary is between 50000 - 70000 ÂŁ per year.
ALL CANDIDATES MUST BE LOCATED IN THE UK
We are looking for an SRE to improve reliability and operational readiness with a strong focus on metrics, alerting, and event management. The role involves building and maintaining monitoring solutions using Prometheus or VictoriaMetrics, integrating alerts and events with BigPanda, and participating in on‑call rotations to drive fast incident response and continuous improvement across Windows and Linux environments.
Key Responsibilities
- Build and operate metrics/monitoring platforms: Prometheus and/or VictoriaMetrics (scrape configs, exporters, recording rules)
- Design and maintain alerting strategy: thresholds, anomaly detection, alert routing, deduplication, and noise reduction
- Integrate monitoring/alerting and events with BigPanda (correlation, enrichment, routing, incident workflows)
- Create and maintain dashboards and operational visibility (Grafana or equivalent)
- Develop and maintain runbooks, operational playbooks, and incident response procedures
- Participate in on‑call shifts: triage alerts, manage incidents, coordinate response, and lead communication during outages
- Perform root‑cause analysis, post‑mortems, and implement corrective/preventive actions
- Improve service reliability via SLOs/SLIs, capacity planning, and automation to reduce toil
- Support monitoring for core infrastructure and services on Windows and Linux, including HA components and clusters
- Collaborate with DevOps/Engineering to instrument applications and standardize telemetry (metrics, logs, traces where applicable)
Skills, Knowledge & Expertise
- Experience in SRE / Operations / DevOps with production incident ownership
- Hands‑on experience with Prometheus and/or VictoriaMetrics (exporters, alert rules, recording rules, troubleshooting)
- Experience integrating alerting/event pipelines with BigPanda (or similar event correlation tools)
- Strong troubleshooting skills across Linux and Windows systems (networking, OS, services)
- Ability to build reliable alerting with minimal noise (correlation, grouping, suppression, maintenance windows)
- Experience with Git‑based workflows for monitoring‑as‑code and configuration management
Nice to Have
- Grafana administration and dashboard design
- Log management (ELK/EFK, Loki) and/or tracing (OpenTelemetry)
- Automation skills (Python, PowerShell, Bash) and configuration tools (Ansible)
- Messaging/cache/proxy operations: RabbitMQ, Redis, NGINX
- Experience with Windows clustering or HA environments
- Experience defining SLOs/SLIs and operational KPIs
- Experience managing VOIP components and protocols (SIP, FreeSwitch, OpenSIP, session border controllers)
- Experience with load‑balancing components (F5 LTM, F5 GTM)
- Experience with virtualization platforms such as VMWare or HyperV
- Experience administering AWS or Azure tenants
On‑call Expectations
- Participation in a rotating on‑call schedule (including nights/weekends as needed)
- Ownership of incident response: rapid triage, escalation, mitigation, and follow‑up improvements
- Commitment to improving monitoring quality to reduce alert fatigue and improve MTTR
Diversity, Inclusion, and Equal Opportunity
We hire, promote, and compensate employees based on their ability to perform their job responsibilities, without regard to race, color, creed, religion, sex, gender, marital status, national origin, ancestry, age, citizenship, physical or mental disability, sexual orientation, or other bases protected by applicable law. We are an equal‑opportunity employer and value diversity at our company.
Site Reliability Engineer in London employer: Intermedia Intelligent Communications
Contact Detail:
Intermedia Intelligent Communications Recruiting Team
StudySmarter Expert Advice 🤫
We think this is how you could land Site Reliability Engineer in London
✨Tip Number 1
Network like a pro! Reach out to folks in the industry, attend meetups, and connect with other SREs on LinkedIn. You never know who might have the inside scoop on job openings or can refer you directly.
✨Tip Number 2
Show off your skills! Create a portfolio showcasing your projects, especially those involving Prometheus, VictoriaMetrics, or any monitoring solutions. This gives potential employers a taste of what you can bring to the table.
✨Tip Number 3
Prepare for interviews by brushing up on your troubleshooting skills. Be ready to discuss real-life incidents you've managed, how you approached them, and what you learned. This will demonstrate your hands-on experience and problem-solving abilities.
✨Tip Number 4
Don't forget to apply through our website! It’s the best way to ensure your application gets seen. Plus, we love seeing candidates who are proactive about their job search!
We think you need these skills to ace Site Reliability Engineer in London
Some tips for your application 🫡
Tailor Your CV: Make sure your CV is tailored to the Site Reliability Engineer role. Highlight your experience with Prometheus, VictoriaMetrics, and any relevant incident management tools like BigPanda. We want to see how your skills match what we're looking for!
Craft a Compelling Cover Letter: Your cover letter is your chance to shine! Use it to explain why you're passionate about SRE and how your background makes you a great fit for our team. Don't forget to mention your troubleshooting skills and experience with both Linux and Windows systems.
Showcase Your Projects: If you've worked on any projects related to monitoring solutions or alerting strategies, be sure to include them. We love seeing practical examples of your work, especially if they involve automation or improving service reliability.
Apply Through Our Website: We encourage you to apply directly through our website. It’s the best way for us to receive your application and ensures you’re considered for the role. Plus, it shows us you’re keen on joining the StudySmarter family!
How to prepare for a job interview at Intermedia Intelligent Communications
✨Know Your Tools Inside Out
Make sure you’re well-versed in Prometheus, VictoriaMetrics, and BigPanda. Brush up on how to build and operate metrics platforms, as well as your experience with alerting strategies. Being able to discuss specific configurations or troubleshooting scenarios will show your hands-on expertise.
✨Demonstrate Your Troubleshooting Skills
Prepare to share examples of past incidents you've managed, especially in Linux and Windows environments. Highlight your approach to root-cause analysis and how you’ve implemented corrective actions. This will showcase your ability to handle real-world challenges effectively.
✨Showcase Your Collaboration Experience
Since the role involves working closely with DevOps and Engineering teams, be ready to discuss how you’ve collaborated in the past. Talk about any projects where you’ve standardised telemetry or improved service reliability through teamwork. This will demonstrate your ability to work well in a team setting.
✨Prepare for On-Call Scenarios
Expect questions about your experience with on-call duties and incident response. Be prepared to explain how you triage alerts and manage incidents, including any specific tools or processes you’ve used. This will help convey your readiness for the responsibilities that come with the role.