Site Reliability Engineer
Site Reliability Engineer

Site Reliability Engineer

Full-Time 36000 - 60000 £ / year (est.) No home office possible
A

At a Glance

  • Tasks: Own and enhance our observability stack while leading incident response and release management.
  • Company: Join Albatross, a pioneering tech company shaping the future of AI.
  • Benefits: Enjoy a remote-first culture with ownership, autonomy, and a supportive team.
  • Why this job: Make a real impact on reliability in a cutting-edge AI platform.
  • Qualifications: 5-7+ years in SRE or similar roles with strong Kubernetes experience.
  • Other info: Dynamic environment with opportunities for professional growth and innovation.

The predicted salary is between 36000 - 60000 £ per year.

Location: Remote, right to work and travel in Europe.

At Albatross, we’re building the second pillar of AI: a perception layer that understands how users actually experience content, in real time. Trained on live user interactions, Albatross learns and reasons on the fly. Our technology powers real-time, in-session discovery by adapting to evolving user interests, in real-time. We have raised significant funding and our platform already operates at scale, with billions of events being processed and hundreds of millions of predictions served.

The Role

We’re looking for a Site Reliability Engineer to own the reliability and observability of our platform. This is a hands-on leadership role where you’ll design, build, and maintain our observability stack, lead incident response, oversee releases, and establish the processes and standards that allow the team to ship quickly and confidently.

More specifically you will:

  • Observability & Monitoring: Own and evolve our observability stack (Prometheus, Grafana, Loki, Jaeger), including dashboards, alerts, and SLOs. Instrument services for meaningful metrics and tracing, reducing noise and improving signal.
  • Reliability & Incident Response: Lead incident response and establish blameless postmortems, runbooks, and automated remediation. Define, track, and improve SLIs/SLOs to proactively reduce reliability risk.
  • Release Management: Own the release process end-to-end, improving deployment speed, safety, and recovery. Implement progressive rollouts, feature flags, and rollback strategies.
  • Platform & Tooling: Embed observability into the development lifecycle in close collaboration with engineering. Maintain and evolve our Kubernetes-based platform, adopting new tools when they add real value.

Requirements

  • 5–7+ years in SRE, platform engineering, DevOps, or similar roles.
  • Strong production experience with Kubernetes and modern observability stacks (Prometheus, Grafana, Loki, Jaeger/OpenTelemetry).
  • Proven track record leading incident response and building monitoring systems teams actually use.
  • Deep distributed systems knowledge and production debugging experience.
  • Pragmatic approach to tooling and alerting that teams trust.
  • Clear communicator across engineering, product, and leadership.
  • STEM degree (Computer Science, Engineering, Mathematics, or similar).
  • Plus: contributions to open-source observability projects and background in high-scale or high-availability environments.

Benefits

  • Remote-first, async-friendly culture.
  • Ownership and autonomy, you’ll shape how we do reliability.
  • A team that cares about building things right.

Site Reliability Engineer employer: Albatross

At Albatross, we pride ourselves on being an exceptional employer, offering a remote-first and asynchronous-friendly culture that empowers our Site Reliability Engineers to take ownership and shape the future of our reliability practices. With significant funding and a commitment to building a team that values quality and innovation, we provide ample opportunities for professional growth and collaboration in a dynamic environment where your contributions directly impact our cutting-edge AI technology.
A

Contact Detail:

Albatross Recruiting Team

StudySmarter Expert Advice 🤫

We think this is how you could land Site Reliability Engineer

✨Tip Number 1

Network like a pro! Reach out to folks in the industry, attend meetups, and join online communities. You never know who might have the inside scoop on job openings or can refer you directly.

✨Tip Number 2

Show off your skills! Create a portfolio or GitHub repository showcasing your projects, especially those related to observability and reliability. This gives potential employers a taste of what you can bring to the table.

✨Tip Number 3

Prepare for interviews by brushing up on your technical knowledge and incident response strategies. Practice common SRE scenarios and be ready to discuss how you've handled past incidents.

✨Tip Number 4

Don’t forget to apply through our website! It’s the best way to ensure your application gets seen. Plus, we love seeing candidates who are proactive about their job search.

We think you need these skills to ace Site Reliability Engineer

Observability Stack Management
Prometheus
Grafana
Loki
Jaeger
Incident Response Leadership
Blameless Postmortems
SLI/SLO Definition and Tracking
Release Management
Kubernetes
Production Debugging
Distributed Systems Knowledge
Clear Communication
Tooling Pragmatism
Collaboration with Engineering Teams

Some tips for your application 🫡

Tailor Your CV: Make sure your CV is tailored to the Site Reliability Engineer role. Highlight your experience with Kubernetes and observability stacks like Prometheus and Grafana. We want to see how your skills match what we're looking for!

Craft a Compelling Cover Letter: Your cover letter is your chance to shine! Share your passion for reliability engineering and how you’ve led incident responses in the past. Let us know why you’re excited about joining our team at Albatross.

Showcase Your Projects: If you've contributed to open-source projects or have personal projects that demonstrate your skills, include them! We love seeing practical examples of your work and how you approach problem-solving.

Apply Through Our Website: Don’t forget to apply through our website! It’s the best way for us to receive your application and ensures you’re considered for the role. We can’t wait to see what you bring to the table!

How to prepare for a job interview at Albatross

✨Know Your Tools Inside Out

Make sure you’re well-versed in the observability stack mentioned in the job description, like Prometheus, Grafana, and Jaeger. Be ready to discuss how you've used these tools in past roles, including specific examples of how they helped improve system reliability.

✨Showcase Your Incident Response Skills

Prepare to talk about your experience leading incident responses. Share specific instances where you established blameless postmortems or created runbooks. This will demonstrate your ability to handle high-pressure situations and improve team processes.

✨Communicate Clearly and Confidently

Since clear communication is key for this role, practice explaining complex technical concepts in simple terms. Think about how you would describe your past projects to someone without a technical background, as this will show your ability to bridge gaps between engineering and leadership.

✨Emphasise Your Pragmatic Approach

Be prepared to discuss your approach to tooling and alerting. Highlight how you’ve implemented solutions that teams trust and use regularly. This will show that you understand the importance of practical, effective systems in a fast-paced environment.

Site Reliability Engineer
Albatross

Land your dream job quicker with Premium

You’re marked as a top applicant with our partner companies
Individual CV and cover letter feedback including tailoring to specific job roles
Be among the first applications for new jobs with our AI application
1:1 support and career advice from our career coaches
Go Premium

Money-back if you don't land a job in 6-months

A
Similar positions in other companies
UK’s top job board for Gen Z
discover-jobs-cta
Discover now
>