At a Glance
- Tasks: Own the reliability of our platform and lead incident response.
- Company: Join Albatross, a cutting-edge AI company transforming user experience.
- Benefits: Enjoy remote work, autonomy, and a supportive team culture.
- Why this job: Make a real impact on innovative technology in a dynamic environment.
- Qualifications: 5-7 years in SRE or similar roles with strong Kubernetes experience.
- Other info: Contribute to open-source projects and grow your career with us.
The predicted salary is between 36000 - 60000 £ per year.
Location: Remote, right to work and travel in Europe.
At Albatross, we are building the second pillar of AI: a perception layer that understands how users actually experience content, in real time. Trained on live user interactions, Albatross learns and reasons on the fly. Our technology powers real-time, in-session discovery by adapting to evolving user interests. We have raised significant funding and our platform already operates at scale, processing billions of events and serving hundreds of millions of predictions.
The Role
We are looking for a Site Reliability Engineer to own the reliability and observability of our platform. This is a hands-on leadership role where you will design, build, and maintain our observability stack, lead incident response, oversee releases, and establish the processes and standards that allow the team to ship quickly and confidently.
More specifically you will:
- Observability & Monitoring: Own and evolve our observability stack (Prometheus, Grafana, Loki, Jaeger), including dashboards, alerts, and SLOs. Instrument services for meaningful metrics and tracing, reducing noise and improving signal.
- Reliability & Incident Response: Lead incident response and establish blameless postmortems, runbooks, and automated remediation. Define, track, and improve SLIs/SLOs to proactively reduce reliability risk.
- Release Management: Own the release process end-to-end, improving deployment speed, safety, and recovery. Implement progressive rollouts, feature flags, and rollback strategies.
- Platform & Tooling: Embed observability into the development lifecycle in close collaboration with engineering. Maintain and evolve our Kubernetes-based platform, adopting new tools when they add real value.
Requirements
- 5–7+ years in SRE, platform engineering, DevOps, or similar roles.
- Strong production experience with Kubernetes and modern observability stacks (Prometheus, Grafana, Loki, Jaeger/OpenTelemetry).
- Proven track record leading incident response and building monitoring systems teams actually use.
- Deep distributed systems knowledge and production debugging experience.
- Pragmatic approach to tooling and alerting that teams trust.
- Clear communicator across engineering, product, and leadership.
- STEM degree (Computer Science, Engineering, Mathematics, or similar).
- Plus: contributions to open-source observability projects and background in high-scale or high-availability environments.
Benefits
- Remote-first, async-friendly culture.
- Ownership and autonomy, you will shape how we do reliability.
- A team that cares about building things right.
Site Reliability Engineer in London employer: Albatross
Contact Detail:
Albatross Recruiting Team
StudySmarter Expert Advice 🤫
We think this is how you could land Site Reliability Engineer in London
✨Tip Number 1
Network like a pro! Reach out to folks in the industry on LinkedIn or at meetups. A friendly chat can lead to opportunities that aren’t even advertised yet.
✨Tip Number 2
Show off your skills! Create a portfolio or GitHub repo showcasing your projects, especially those related to observability and reliability. This gives potential employers a taste of what you can do.
✨Tip Number 3
Prepare for interviews by brushing up on your incident response strategies and monitoring tools. Be ready to discuss real-life scenarios where you’ve made a difference in reliability.
✨Tip Number 4
Don’t forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, we love seeing candidates who are proactive!
We think you need these skills to ace Site Reliability Engineer in London
Some tips for your application 🫡
Tailor Your CV: Make sure your CV reflects the skills and experiences that align with the Site Reliability Engineer role. Highlight your experience with Kubernetes and observability stacks like Prometheus and Grafana, as these are key to what we do at Albatross.
Craft a Compelling Cover Letter: Use your cover letter to tell us why you're passionate about reliability and observability. Share specific examples of how you've led incident responses or improved monitoring systems in your previous roles. We love hearing your story!
Showcase Your Technical Skills: Don’t shy away from getting technical! Include details about your hands-on experience with tools and technologies relevant to the role. This is your chance to show us you know your stuff and can hit the ground running.
Apply Through Our Website: We encourage you to apply directly through our website. It’s the best way for us to receive your application and ensures you’re considered for the role. Plus, it shows us you’re keen on joining our team!
How to prepare for a job interview at Albatross
✨Know Your Tools Inside Out
Make sure you’re well-versed in the observability stack mentioned in the job description, like Prometheus, Grafana, and Jaeger. Be ready to discuss how you've used these tools in past roles, including specific examples of dashboards you've created or metrics you've tracked.
✨Showcase Your Incident Response Skills
Prepare to talk about your experience leading incident responses. Share a story where you established a blameless postmortem or improved SLIs/SLOs. This will demonstrate your ability to handle pressure and improve reliability.
✨Communicate Clearly and Confidently
Since clear communication is key for this role, practice explaining complex technical concepts in simple terms. You might be asked to explain your approach to embedding observability into the development lifecycle, so make sure you can articulate your thoughts clearly.
✨Emphasise Your Pragmatic Approach
Be prepared to discuss your pragmatic approach to tooling and alerting. Share examples of how you’ve implemented solutions that teams trust and use regularly. This shows that you understand the balance between functionality and usability.