Site Reliability Engineer (SRE) in London

Site Reliability Engineer (SRE) in London

London Full-Time 60000 - 80000 € / year (est.) No home office possible
Monstro

At a Glance

  • Tasks: Own the reliability and observability of our secure platform on Google Cloud.
  • Company: Join Monstro, a pioneering tech company focused on innovative cloud solutions.
  • Benefits: Competitive salary, flexible work options, and opportunities for professional growth.
  • Other info: Be part of a diverse team shaping the future of tech for institutional clients.
  • Why this job: Make a real impact by ensuring our platform runs smoothly and efficiently.
  • Qualifications: Experience with GCP, incident management, and strong coding skills required.

The predicted salary is between 60000 - 80000 € per year.

Monstro is building a secure, multi-tenant platform on Google Cloud, and we’re hiring a Site Reliability Engineer to own the reliability and observability of that platform end-to-end. This is a hands-on role for someone who wants to do real SRE work - not a rebrand of L1 support. You’ll write the dashboards, define the SLOs, build the automation that kills toil, and take your turn on the on-call rotation that proves it all works. When something breaks at 2 AM, you’re the person who keeps it running; when nothing’s breaking, you’re the person making sure the next break is smaller, shorter, or doesn’t happen at all.

What You’ll Do

  • Observability and reliability engineering
    • Define and maintain SLOs and SLIs for our tier-1 services: API gateway, application services, identity, and edge availability
    • Build canonical dashboards and alerts in Google Cloud Monitoring, backed by structured logs and BigQuery log analytics
    • Tune alert routing so every page is actionable — kill the rest
    • Instrument services for distributed tracing and structured logging; push back on services that ship without it
    • Own error budgets and use them to prioritize reliability work over feature work when burned
    • Reduce toil: automate the top recurring page from the previous quarter
    • Maintain runbooks so every page maps to one within a cycle of first occurrence
  • On-call rotation and incident response
    • First responder for production alerts across monitoring, API gateway, edge defense, and CI
    • Triage severity, run the incident bridge, drive mitigation (revision rollback, traffic shift, scaling, edge block, credential rotation)
    • Own internal and external incident comms during your shift
    • Drive postmortems to closure with action items tracked as audit evidence
    • Clean written handoffs at end of shift

Our stack

  • Google Cloud Platform across multiple environments
  • Apigee X for API management
  • Cloud Run, GKE Autopilot, Cloud SQL
  • Identity Platform for customer identity
  • Cloud Armor, Cloud IDS, Security Command Center for edge and posture
  • BigQuery-backed log analytics from an org-level log sink
  • OpenTofu / Terraform for everything; GitHub Actions for CI/CD
  • Linear for work tracking

What You Bring

  • Required:
    • Solid production experience on GCP (or comparable AWS/Azure depth with willingness to ramp on GCP fast)
    • Comfortable on-call: you’ve run incidents, written postmortems, and shipped the action items
    • Strong observability fundamentals: SLOs, log-based metrics, alert hygiene, dashboard discipline
    • Working knowledge of Kubernetes, API gateways, identity systems, and at least one IaC tool
    • Scripting / coding fluency (Python, Go, Bash) for automation and tooling
    • Good written communication — handoffs, postmortems, and runbooks are part of the job
    • Bias toward fixing the system, not the symptom
  • Nice to Have:
    • Apigee or another enterprise API gateway in production
    • BigQuery for log analytics or audit
    • Experience standing up observability from scratch, not just maintaining inherited dashboards
    • SOC2 or similar compliance environments

Why Join Us

You’ll be at the centre of how we bring Monstro to life for our institutional clients. Your work directly shapes the success of every implementation—getting requirements right means we deliver faster, smoother, and with fewer surprises. You’ll be joining at a foundational moment, helping to build the delivery practice from the ground up alongside a Delivery Manager who will rely on you as a critical partner from day one. If you enjoy the puzzle of understanding complex environments, the satisfaction of a well-organised document, and the energy of working directly with clients, this is your role.

We are an equal opportunity employer and value diversity. We do not discriminate on the basis of race, religion, colour, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Site Reliability Engineer (SRE) in London employer: Monstro

Monstro is an exceptional employer that fosters a dynamic and inclusive work culture, where Site Reliability Engineers play a pivotal role in shaping the reliability of our innovative platform on Google Cloud. With a strong emphasis on employee growth, you will have the opportunity to engage in meaningful projects, collaborate closely with clients, and contribute to the foundational development of our delivery practice. Our commitment to diversity and equal opportunity ensures a supportive environment where your contributions are valued and recognised.

Monstro

Contact Detail:

Monstro Recruiting Team

StudySmarter Expert Advice🤫

We think this is how you could land Site Reliability Engineer (SRE) in London

Tip Number 1

Network like a pro! Reach out to folks in the industry, attend meetups, and connect with current SREs on LinkedIn. You never know who might have the inside scoop on job openings or can refer you directly.

Tip Number 2

Show off your skills! Create a portfolio showcasing your projects, especially those involving GCP, Kubernetes, or automation. This gives potential employers a taste of what you can do and sets you apart from the crowd.

Tip Number 3

Prepare for technical interviews by brushing up on your SRE fundamentals. Be ready to discuss SLOs, incident response, and your experience with observability tools. Practice common interview questions to boost your confidence.

Tip Number 4

Don’t forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, it shows you’re genuinely interested in joining our team at Monstro.

We think you need these skills to ace Site Reliability Engineer (SRE) in London

Google Cloud Platform (GCP)
API Management (Apigee X)
Kubernetes
Infrastructure as Code (IaC) tools
Scripting (Python, Go, Bash)
Observability Fundamentals (SLOs, log-based metrics)
Incident Response

Some tips for your application 🫡

Tailor Your Application:Make sure to customise your CV and cover letter for the Site Reliability Engineer role. Highlight your experience with GCP, observability, and incident response. We want to see how your skills align with what we’re looking for!

Show Off Your Communication Skills:Since good written communication is key in this role, ensure your application is clear and concise. Use bullet points where necessary and keep it professional yet approachable. We love a well-organised document!

Demonstrate Your Problem-Solving Mindset:In your application, share examples of how you've tackled incidents or improved reliability in past roles. We’re looking for candidates who focus on fixing the system, not just the symptoms, so let that shine through!

Apply Through Our Website:We encourage you to apply directly through our website. It’s the best way for us to receive your application and ensures you don’t miss any important updates from our team. We can’t wait to hear from you!

How to prepare for a job interview at Monstro

Know Your Stack

Familiarise yourself with Google Cloud Platform and the specific tools mentioned in the job description, like Apigee X and BigQuery. Be ready to discuss your hands-on experience with these technologies and how you've used them to enhance observability and reliability.

Demonstrate Incident Management Skills

Prepare to share examples of incidents you've managed in the past. Highlight your role in triaging alerts, driving incident response, and writing postmortems. This will show that you’re not just familiar with the process but have actively contributed to improving it.

Showcase Your Automation Mindset

Think of specific instances where you've reduced toil through automation. Be ready to discuss the scripts or tools you've developed, especially in Python or Go, and how they improved system reliability or efficiency.

Communicate Clearly

Since good written communication is crucial for this role, practice articulating your thoughts clearly. Prepare to explain complex concepts simply, as you might need to write handoffs or runbooks. This will demonstrate your ability to convey important information effectively.