Service Reliability Engineer in London

Service Reliability Engineer in London

London Full-Time 60000 - 80000 £ / year (est.) No working from home possible
Universal Music Group

At a Glance

  • Tasks: Ensure services connecting artists and fans are always on and optimised.
  • Company: Join a global leader in connecting music lovers and creators.
  • Benefits: Competitive salary, flexible hours, and opportunities for professional growth.
  • Other info: Diverse and inclusive workplace committed to continuous learning and operational excellence.
  • Why this job: Make a real impact in the music industry while honing your tech skills.
  • Qualifications: Experience in systems administration and proficiency in programming languages required.

The predicted salary is between 60000 - 80000 £ per year.

Role Overview

As a Site Reliability Engineer, you won’t just be supporting systems; you’ll be ensuring the services that connect artists and fans around the globe are always on.

Responsibilities

  • Design, build, and maintain the availability, scalability, and performance of critical services.
  • Develop and maintain robust monitoring, alerting, and observability systems (e.g., using AWS CloudWatch, Dynatrace) to ensure rapid issue detection and resolution.
  • Monitor infrastructure capacity and performance, providing analysis and suggestions for service delivery improvement.
  • Drive the automation of repetitive operational tasks, including infrastructure provisioning, deployments, and scaling.
  • Create and maintain scripts and custom code to support and enhance our operational toolset.
  • Support and optimise CI/CD pipelines to improve deployment speed and reliability.
  • Participate in an on‑call rotation to troubleshoot and mitigate production incidents.
  • Lead post‑incident reviews and root cause analyses to implement lasting solutions.
  • Partner with engineering and IT stakeholders to embed SRE best practices (SLOs, error budgets) into the design and development lifecycle.
  • Act as the final escalation point for SRE operations and coordinate cross‑functional teams during high‑severity events.
  • Develop and refine escalation frameworks for the Global Technical Operations Centre.
  • Conduct deep‑dive root cause analysis for recurring, complex problems and develop long‑term solutions.
  • Mentor and elevate the team, serving as a technical leader and providing training on advanced security concepts, threat landscapes, and best practices.
  • Collaborate with DevOps and applications architects to enforce standards and promote IaC and toil reduction.
  • Identify opportunities for network automation, scripting, and tool development to streamline operational tasks.
  • Create and maintain comprehensive documentation for configurations, SOPs, and incident response protocols.
  • Communicate effectively with technical and non‑technical stakeholders, including senior management, regarding incident status, resolution plans, and security issues.
  • Foster a culture of continuous learning and operational excellence within the team.
  • Work out of standard business hours will occasionally be required.

Qualifications

  • Strong background in systems administration (Linux/Windows) in a large‑scale environment.
  • Proficiency in at least one programming language (e.g., Python, Go, Java).
  • Hands‑on experience with a major cloud platform (AWS, GCP, or Azure), with a high preference for AWS.
  • Solid understanding of networking, containers (Docker, Kubernetes), and Infrastructure as Code (e.g., Terraform, Ansible).
  • Experience with modern monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, Splunk, Dynatrace).
  • Proven analytical and problem‑solving abilities in a high‑pressure environment.
  • Excellent communication skills and the ability to foster a collaborative team environment.

Preferred Experience & Skills

  • Bachelor’s degree in an IT‑related field.
  • Experience managing large‑scale, distributed systems for a global organisation.
  • Familiarity with IT governance standards like ITIL.
  • Direct experience with ServiceNow for IT service management.
  • Knowledge of chaos engineering, resilience testing, and advanced capacity planning.

Everyone is welcome to apply for our roles, and we are determined to ensure that no applicant or employee receives less favourable treatment because of gender, race, disability, sexual orientation, religion, belief, age, marital status, background, pregnancy, or caring responsibilities. We also recognise the importance of diversity of thought within our teams and are fully committed to embracing the talents of people with autism, dyslexia, ADHD, and other forms of neurocognitive variation.

Service Reliability Engineer in London employer: Universal Music Group

As a Service Reliability Engineer, you will thrive in a dynamic and inclusive work environment that prioritises innovation and collaboration. Our commitment to employee growth is evident through continuous learning opportunities and mentorship, ensuring you can advance your skills while contributing to services that connect artists and fans globally. With a focus on operational excellence and a culture that embraces diversity, we offer a unique chance to make a meaningful impact in the tech industry.

Universal Music Group

Contact Details:

Universal Music Group Recruitment Team

StudySmarter Expert Advice🤫

We think this is how you could land Service Reliability Engineer in London

Tip Number 1

Network like a pro! Reach out to folks in the industry, attend meetups, and connect with current employees at companies you're eyeing. A friendly chat can sometimes lead to opportunities that aren't even advertised!

Tip Number 2

Show off your skills! Create a portfolio or GitHub repository showcasing your projects, especially those related to SRE practices. This gives potential employers a taste of what you can do and sets you apart from the crowd.

Tip Number 3

Prepare for interviews by brushing up on common SRE scenarios and problem-solving questions. Practice explaining your thought process clearly, as communication is key in this role. We want to see how you tackle challenges!

Tip Number 4

Don't forget to apply through our website! It’s the best way to ensure your application gets the attention it deserves. Plus, we love seeing candidates who are proactive about their job search!

We think you need these skills to ace Service Reliability Engineer in London

Systems Administration (Linux/Windows)
AWS Cloud Platform
Monitoring and Observability Tools (e.g., AWS CloudWatch, Dynatrace, Prometheus, Grafana, Datadog, Splunk)
Infrastructure as Code (e.g., Terraform, Ansible)
Containerisation (Docker, Kubernetes)
Scripting (e.g., Python, Go, Java)
CI/CD Pipeline Management

Some tips for your application 🫡

Tailor Your CV:Make sure your CV is tailored to the Service Reliability Engineer role. Highlight your experience with systems administration, cloud platforms, and any relevant programming languages. We want to see how your skills match what we're looking for!

Showcase Your Projects:If you've worked on any projects that involved monitoring, automation, or CI/CD pipelines, be sure to include them! We love seeing practical examples of your work that demonstrate your problem-solving abilities and technical expertise.

Craft a Compelling Cover Letter:Your cover letter is your chance to shine! Use it to explain why you're passionate about the role and how you can contribute to our mission. We appreciate a personal touch, so let your personality come through!

Apply Through Our Website:We encourage you to apply directly through our website. It’s the best way for us to receive your application and ensures you’re considered for the role. Plus, it’s super easy – just follow the prompts!

How to prepare for a job interview at Universal Music Group

Know Your Tech Stack

Make sure you’re well-versed in the technologies mentioned in the job description, like AWS, Python, and Docker. Brush up on your knowledge of monitoring tools like Dynatrace and Prometheus, as these will likely come up during the interview.

Showcase Problem-Solving Skills

Prepare to discuss specific examples where you've tackled complex issues in high-pressure environments. Think about times when you’ve implemented long-term solutions after a root cause analysis, as this aligns perfectly with the role's responsibilities.

Communicate Clearly

Practice explaining technical concepts in simple terms, as you’ll need to communicate effectively with both technical and non-technical stakeholders. Being able to convey your thoughts clearly can set you apart from other candidates.

Emphasise Collaboration

Highlight your experience working in cross-functional teams and how you’ve fostered a collaborative environment. Mention any mentoring roles you've taken on, as they show leadership and a commitment to team growth, which is crucial for this position.