Site Reliability Engineer

Site Reliability Engineer

Full-Time No home office possible
Sphere Digital Recruitment Group

At a Glance

  • Tasks: Enhance system reliability and performance for high-traffic digital services.
  • Company: Leading tech firm focused on innovative solutions and collaboration.
  • Benefits: Competitive daily rate, hybrid working, and opportunities for professional growth.
  • Why this job: Join a dynamic team to shape the future of digital reliability and scalability.
  • Qualifications: 7+ years in SRE or systems engineering with strong cloud platform experience.
  • Other info: Mentorship opportunities and a chance to work with cutting-edge technologies.

My client is looking for a skilled Senior Site Reliability Engineer to play a key role in improving the reliability, scalability, and operational performance of their production systems. This role works closely with product and engineering teams to enhance system reliability, architecture, deployment safety, and observability.

My client is seeking a Senior Site Reliability Engineer to join a centralized Technical Operations function, where you will lead reliability initiatives and support operations across a range of large-scale, customer-facing digital services. Operating within a centralized SRE model, you will partner with product and engineering teams while maintaining shared responsibility for production reliability, resilience, and scalability. The role includes participation in an on-call rotation supporting critical services, with shared ownership of overall system health.

You will be responsible for defining reliability standards, influencing architectural improvements, managing complex incidents, and building automation to improve deployment safety and operational efficiency. Your work will directly support high-traffic systems used by a global audience.

Key Responsibilities
  • Reliability & Risk Engineering: Identify systemic reliability risks and drive long-term preventative improvements. Define and refine SLIs, SLOs, and error budgets aligned with business and customer outcomes. Lead complex incident management, post-incident reviews, and remediation planning.
  • Architecture & Resilience: Review and influence system architecture to improve scalability, availability, and fault isolation. Design strategies for high availability, graceful degradation, and disaster recovery. Evaluate trade-offs between performance, cost, and operational risk.
  • CI/CD & Deployment Safety: Improve deployment pipelines and implement automation to reduce risk and accelerate delivery. Implement safe deployment strategies such as canary releases and blue/green deployments. Ensure strong rollback and recovery mechanisms.
  • Observability & Performance: Build and enhance observability solutions including metrics, logging, and tracing. Work with teams to reduce alert fatigue and improve signal quality. Diagnose performance bottlenecks across infrastructure and applications.
  • Infrastructure & Automation: Design and operate cloud-native, containerised workloads at scale. Use Infrastructure as Code to build and manage resilient platforms. Develop automation to reduce manual effort and operational risk.
  • Cross-Functional Leadership: Mentor engineers and promote SRE best practices across teams. Collaborate with engineering, product, and security stakeholders to improve system reliability.
Required Qualifications
  • A degree in Computer Science, Engineering, or equivalent practical experience.
  • Strong experience designing and operating CI/CD systems with deployment safety practices.
  • Excellent communication skills with the ability to influence cross-functional teams.
  • 7+ years of experience in SRE, production engineering, or systems engineering roles.
  • Strong knowledge of distributed systems concepts, including consistency and failure handling.
  • Hands-on experience with major cloud platforms (e.g., AWS, GCP, Azure), including multi-region environments.
  • Strong experience with Kubernetes and container orchestration at scale.
  • Proficiency in at least one programming language such as Go, Python, or Java.
  • Proven experience managing high-severity incidents and leading remediation efforts.
Preferred Qualifications
  • Experience with multi-region or multi-cloud architectures.
  • Familiarity with observability tools such as Prometheus, Grafana, or Datadog.
  • Previous mentoring or technical leadership experience.
  • Experience with Infrastructure as Code tools such as Terraform or CloudFormation.
  • Exposure to AI-assisted tooling for incident analysis or operational efficiency.

Site Reliability Engineer employer: Sphere Digital Recruitment Group

Join a forward-thinking company as a Senior Site Reliability Engineer, where you will play a pivotal role in enhancing the reliability and performance of high-traffic digital services. With a hybrid working model, competitive daily rates, and a culture that fosters collaboration and innovation, this is an excellent opportunity for professional growth and mentorship within a centralized Technical Operations team. The company prioritises employee development and offers a dynamic work environment that encourages the sharing of best practices and continuous improvement.
Sphere Digital Recruitment Group

Contact Detail:

Sphere Digital Recruitment Group Recruiting Team

StudySmarter Expert Advice 🤫

We think this is how you could land Site Reliability Engineer

✨Tip Number 1

Network with industry professionals! Attend meetups, webinars, or online forums related to Site Reliability Engineering. Engaging with others in the field can lead to job opportunities and valuable insights.

✨Tip Number 2

Showcase your skills through personal projects or contributions to open-source. This not only demonstrates your expertise but also gives you something tangible to discuss during interviews.

✨Tip Number 3

Prepare for technical interviews by practicing common SRE scenarios and problems. Use platforms like StudySmarter to brush up on your knowledge and get comfortable with potential questions.

✨Tip Number 4

Apply directly through our website! It’s a great way to ensure your application gets seen by the right people. Plus, it shows your enthusiasm for the role and the company.

We think you need these skills to ace Site Reliability Engineer

Site Reliability Engineering
AWS
CI/CD Systems
Incident Management
Observability Solutions
Kubernetes
Container Orchestration
Infrastructure as Code
Automation
Network Troubleshooting
Communication Skills
Cross-Functional Collaboration
Performance Diagnosis
Disaster Recovery Strategies
Mentoring

Some tips for your application 🫡

Tailor Your CV: Make sure your CV is tailored to the Site Reliability Engineer role. Highlight your experience with AWS, CI/CD systems, and any relevant projects that showcase your skills in reliability and automation.

Craft a Compelling Cover Letter: Your cover letter is your chance to shine! Use it to explain why you're passionate about SRE and how your background aligns with the responsibilities outlined in the job description. Be genuine and let your personality come through.

Showcase Your Technical Skills: Don’t forget to list your technical skills clearly. Mention your proficiency in programming languages like Go or Python, and your experience with Kubernetes and cloud platforms. This will help us see your fit for the role at a glance.

Apply Through Our Website: We encourage you to apply directly through our website. It’s the best way for us to receive your application and ensures you’re considered for the role. Plus, it makes the process smoother for everyone involved!

How to prepare for a job interview at Sphere Digital Recruitment Group

✨Know Your Stuff

Make sure you brush up on your knowledge of distributed systems, CI/CD practices, and cloud platforms like AWS. Be ready to discuss your hands-on experience with Kubernetes and how you've tackled high-severity incidents in the past.

✨Showcase Your Problem-Solving Skills

Prepare to share specific examples of how you've identified reliability risks and implemented preventative measures. Think about times when you led incident management or post-incident reviews, and be ready to explain your thought process.

✨Communicate Effectively

Since this role involves cross-functional collaboration, practice articulating your ideas clearly. Be prepared to discuss how you've influenced teams in the past and how you can mentor others in SRE best practices.

✨Demonstrate Your Automation Know-How

Be ready to talk about your experience with Infrastructure as Code and automation tools. Highlight any projects where you've improved deployment safety or reduced manual effort, and explain the impact it had on operational efficiency.

Land your dream job quicker with Premium

You’re marked as a top applicant with our partner companies
Individual CV and cover letter feedback including tailoring to specific job roles
Be among the first applications for new jobs with our AI application
1:1 support and career advice from our career coaches
Go Premium

Money-back if you don't land a job in 6-months

>