Staff Site Reliability Engineer - Site Experience

Job Board

Companies

Staff Site Reliability Engineer - Site Experience

Full-Time 80000 - 100000 £ / year (est.) Home office (partial)

Apply Now

At a Glance

Tasks: Lead reliability engineering for user experience and drive operational excellence.
Company: Join Reddit, a vibrant community platform with millions of daily users.
Benefits: Enjoy flexible vacation, health benefits, and professional development opportunities.
Other info: Mentorship opportunities and a dynamic work environment await you.
Why this job: Shape the future of reliability engineering on one of the internet's largest platforms.
Qualifications: 8+ years in Site Reliability Engineering and strong programming skills required.

The predicted salary is between 80000 - 100000 £ per year.

Reddit is a community of communities built on shared interests, passion, and trust. It is home to the most open and authentic conversations on the internet. As Reddit continues to scale globally, reliability and performance are more critical than ever. The Site Experience SRE team sits at the intersection of infrastructure, product engineering, and user experience, ensuring that every interaction across web, mobile, APIs, feeds, media delivery, and real-time systems is fast, reliable, and resilient.

We are looking for a Staff Site Reliability Engineer to lead reliability engineering initiatives for critical user-facing systems at internet scale. In this role, you will partner closely with product and infrastructure teams to improve availability, latency, scalability, and operational excellence across Reddit’s most business-critical experiences. This is a highly technical leadership role for someone who thrives in large-scale distributed systems, enjoys solving complex reliability challenges, and can influence engineering culture across the organization.

What you’ll do:

Lead Reliability Engineering for User Experience: Drive reliability, scalability, and operational excellence for critical user-facing systems and services. Improve performance and resiliency across APIs, content delivery, feed generation, search, messaging, and real-time experiences.
Architect for Scale: Partner with product and infrastructure engineering teams to design systems that remain highly available and performant under massive global load. Guide architectural decisions around failover, redundancy, graceful degradation, traffic management, and capacity planning.
Reduce Operational Risk: Identify systemic risks and reliability bottlenecks across services, dependencies, deployments, and infrastructure. Build proactive mitigation strategies and drive engineering improvements that reduce incidents and improve service health.
Drive Automation: Eliminate repetitive operational work through automation and tooling. Build systems that improve deployment safety, incident response, remediation workflows, and reliability guardrails.
Incident Management: Lead complex incident response efforts across engineering teams. Drive blameless postmortems, identify root causes, and ensure sustainable long-term fixes are implemented.
Influence Engineering Standards: Define and champion best practices around reliability engineering, SLIs/SLOs, capacity management, release engineering, and operational maturity across the company.
Mentor and Multiply Impact: Provide technical leadership and mentorship to engineers across SRE and software engineering teams. Help shape reliability culture and raise the operational excellence bar across the organization.

What We’re Looking For:

8+ years of experience in Site Reliability Engineering, Infrastructure Engineering, or related roles operating large scale distributed systems.
Strong collaboration and communication skills with the ability to influence technical direction across teams.
Strong experience supporting high traffic, user-facing production environments.
Deep understanding of one or more: distributed systems, networking, Linux systems, cloud native architectures.
Experience designing highly available systems with strong operational and reliability practices.
Strong programming skills in languages such as Go, Python, or similar.
Strong understanding of observability systems including metrics, logging, tracing, and alerting.
Experience improving reliability through SLOs, automation, incident management, and performance optimization.
Demonstrated ability to troubleshoot complex issues across applications, infrastructure, networking, and services.

Nice to Have:

Experience operating systems at internet scale traffic volumes.
Experience with Kubernetes, containers, cloud infrastructure, and modern deployment platforms.
Familiarity with technologies such as Prometheus, Grafana, OpenTelemetry, Envoy, Kafka, ClickHouse, Cassandra, Redis, or similar distributed infrastructure technologies.
Experience with CDN optimization, edge reliability, traffic engineering, or global infrastructure.
Contributions to open source software or participation in technical communities.
Experience leading large scale incident response and operational transformation initiatives.

Why Join Reddit? You’ll help shape the reliability and performance of one of the internet’s largest platforms, influencing experiences used by millions of people every day. This is an opportunity to solve deeply complex engineering problems at massive scale while helping define the future of reliability engineering for a modern consumer platform.

Benefits:

Global Benefit programs that fit your lifestyle, from workspace to professional development to caregiving support.
Family Planning Support.
Gender-Affirming Care.
Mental Health & Coaching Benefits.
Group Personal Pension Scheme with Employer match.
Private Medical and Dental Scheme.
Income Replacement Programs.
Bike to Work scheme.
Flexible Vacation & Paid Volunteer Time Off.
Generous Paid Parental Leave.

Reddit is proud to be an equal opportunity employer, and is committed to building a workforce representative of the diverse communities we serve. Reddit is committed to providing reasonable accommodations for qualified individuals with disabilities and disabled veterans in our job application procedures.

Staff Site Reliability Engineer - Site Experience employer: Reddit

Reddit is an exceptional employer that fosters a culture of collaboration and innovation, making it an ideal place for a Staff Site Reliability Engineer to thrive. With a commitment to employee growth through mentorship and professional development, along with comprehensive benefits like flexible vacation and family planning support, Reddit ensures that its team members are well-supported both personally and professionally. Working at Reddit means being part of a dynamic environment where you can influence the reliability and performance of one of the internet's largest platforms, all while enjoying a healthy work-life balance in a diverse and inclusive community.

Contact Details:

Reddit Recruitment Team

View Reddit profile

StudySmarter Expert Advice🤫

We think this is how you could land Staff Site Reliability Engineer - Site Experience

✨Tip Number 1

Network like a pro! Reach out to folks in your industry on LinkedIn or Reddit itself. Join relevant communities and engage in discussions. You never know who might have the inside scoop on job openings!

✨Tip Number 2

Prepare for those interviews! Research common SRE interview questions and practice your answers. Make sure you can talk about your experience with distributed systems and incident management confidently.

✨Tip Number 3

Show off your skills! If you’ve worked on any cool projects, consider sharing them on GitHub or even writing a blog post. This not only showcases your expertise but also gives potential employers a glimpse of what you can bring to the table.

✨Tip Number 4

Don’t forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, it shows you’re genuinely interested in joining the Reddit team!

We think you need these skills to ace Staff Site Reliability Engineer - Site Experience

Site Reliability Engineering

Infrastructure Engineering

Distributed Systems

Networking

Linux Systems

Cloud Native Architectures

Programming in Go or Python

Observability Systems

Metrics, Logging, Tracing, and Alerting

SLOs and Automation

Incident Management

Performance Optimisation

Kubernetes

Cloud Infrastructure

Traffic Engineering

Some tips for your application 🫡

Tailor Your Application:Make sure to customise your CV and cover letter for the Staff Site Reliability Engineer role. Highlight your experience with large-scale distributed systems and any relevant projects that showcase your skills in reliability engineering.

Showcase Your Technical Skills:Don’t hold back on your technical prowess! Mention your programming skills, especially in languages like Go or Python, and any experience you have with observability systems. We want to see how you can contribute to our mission of improving performance and resiliency.

Be Clear and Concise:When writing your application, keep it straightforward. Use clear language and avoid jargon where possible. We appreciate a well-structured application that gets straight to the point about your qualifications and experiences.

Apply Through Our Website:We encourage you to submit your application through our website. It’s the best way to ensure your application gets into the right hands. Plus, it shows us you’re serious about joining our team!

How to prepare for a job interview at Reddit

✨Know Your Stuff

Make sure you brush up on your knowledge of distributed systems, cloud architectures, and the specific technologies mentioned in the job description. Reddit is looking for someone who can hit the ground running, so being well-versed in tools like Kubernetes, Prometheus, and Grafana will definitely give you an edge.

✨Show Your Problem-Solving Skills

Prepare to discuss complex reliability challenges you've faced in the past. Think about specific incidents where you led a response or implemented a solution that improved system performance. This will showcase your ability to troubleshoot and think critically under pressure.

✨Communicate Clearly

Since collaboration is key in this role, practice articulating your thoughts clearly and concisely. Be ready to explain technical concepts in a way that non-technical team members can understand. This will demonstrate your strong communication skills and ability to influence others.

✨Cultural Fit Matters

Reddit values community and trust, so be prepared to discuss how you align with their culture. Share examples of how you've contributed to team dynamics or mentored others in your previous roles. This will help show that you're not just a technical fit, but also a cultural one.

Staff Site Reliability Engineer - Site Experience

Apply Now

Staff Site Reliability Engineer - Site Experience

At a Glance

Staff Site Reliability Engineer - Site Experience employer: Reddit

StudySmarter Expert Advice🤫

We think you need these skills to ace Staff Site Reliability Engineer - Site Experience

Some tips for your application 🫡

How to prepare for a job interview at Reddit

Company

Product

Help