Senior Site Reliability Engineer in City of London
Senior Site Reliability Engineer

Senior Site Reliability Engineer in City of London

City of London Full-Time 70000 - 90000 £ / year (est.) No home office possible
Realm

At a Glance

  • Tasks: Ensure reliability and performance of a global compute platform while collaborating with cross-functional teams.
  • Company: High-growth infrastructure company focused on advanced machine learning workloads.
  • Benefits: Competitive salary, equity package, health coverage, and generous paid time off.
  • Other info: Dynamic role with opportunities for growth and hands-on engineering challenges.
  • Why this job: Join a fast-paced environment and make a real impact on cutting-edge technology.
  • Qualifications: 5+ years in site reliability engineering or DevOps, strong communication skills, and systems expertise.

The predicted salary is between 70000 - 90000 £ per year.

High-growth infrastructure company focused on delivering large-scale compute, data centre capacity, and power solutions for advanced machine learning workloads. Platforms support leading research and industry teams requiring high-performance computing at significant scale. Fast-paced environment with emphasis on ownership, execution speed, and quality. Culture centred on pragmatic problem-solving, cross-functional collaboration, and full lifecycle responsibility.

Role Overview: Position operating across software, infrastructure, and operations to ensure reliability, scalability, and performance of a globally distributed compute platform. Close collaboration with networking, platform engineering, and physical infrastructure teams to design and operate systems supporting high-demand computational workloads. Hands-on engineering role requiring strong systems expertise, with responsibility for resolving complex production issues, improving system resilience, and enhancing platform observability.

Responsibilities:

  • Deployment and management of large-scale compute clusters using automation tooling, with adaptation to customer requirements.
  • Validation and optimisation of compute, storage, and networking systems in coordination with internal teams and vendors.
  • Execution of large-scale data migrations between cloud and on-premise environments with focus on efficiency and cost.
  • Troubleshooting across the full stack, including hardware, networking, and distributed systems.
  • Development of internal tooling and automation to improve deployment speed, reliability, and operational efficiency.
  • Participation in an on-call rotation required (approximately one week per month).

Key Attributes:

  • Strong ownership mindset with focus on delivery and accountability.
  • Experience building maintainable, well-documented systems in complex environments.
  • Ability to operate effectively in ambiguous and rapidly evolving contexts.
  • Clear and effective communication skills with collaborative, low-ego approach.

Minimum Requirements:

  • 5+ years of experience in site reliability engineering, DevOps, systems administration, or high-performance computing.
  • Strong written and verbal communication skills in English.
  • Experience deploying and operating container orchestration or workload scheduling systems (e.g. Kubernetes or similar).
  • Programming or scripting experience in Go, Python, or Bash.
  • Familiarity with infrastructure automation and infrastructure-as-code tools.
  • Strong technical foundation in computing or related discipline.

Preferred Experience:

  • Experience operating large-scale machine learning or AI-compute workloads.
  • Background in multi-tenant distributed systems at scale.
  • Hands-on experience with data centre or bare-metal infrastructure.
  • Knowledge of high-performance networking technologies.
  • Experience managing large-scale storage systems (commercial or open-source).

Compensation & Benefits:

  • Competitive salary and equity package.
  • Retirement or pension contributions aligned with local standards.
  • Health coverage including medical, dental, and vision.
  • Generous paid time off policy.

Senior Site Reliability Engineer in City of London employer: Realm

Join a high-growth infrastructure company that prioritises innovation and collaboration, offering a dynamic work environment where your contributions directly impact the success of advanced machine learning workloads. With a strong focus on employee development, competitive compensation, and a culture that values ownership and pragmatic problem-solving, this role provides an excellent opportunity for growth in a fast-paced setting. Enjoy generous benefits including health coverage and a robust paid time off policy, all while working alongside leading research and industry teams.
Realm

Contact Detail:

Realm Recruiting Team

StudySmarter Expert Advice 🤫

We think this is how you could land Senior Site Reliability Engineer in City of London

✨Tip Number 1

Network like a pro! Reach out to folks in the industry, attend meetups, and connect with potential colleagues on LinkedIn. We all know that sometimes it’s not just what you know, but who you know that can help you land that dream job.

✨Tip Number 2

Show off your skills! Create a portfolio or GitHub repository showcasing your projects, especially those related to site reliability engineering or high-performance computing. We want to see your hands-on experience and how you tackle real-world problems.

✨Tip Number 3

Prepare for the interview like it’s a big game day! Research the company, understand their tech stack, and be ready to discuss how your experience aligns with their needs. We love candidates who can demonstrate their knowledge and passion for our work.

✨Tip Number 4

Don’t forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, we’re always on the lookout for talent that fits our culture of ownership and collaboration.

We think you need these skills to ace Senior Site Reliability Engineer in City of London

Site Reliability Engineering
DevOps
Systems Administration
High-Performance Computing
Container Orchestration (e.g. Kubernetes)
Programming in Go
Python
Bash
Infrastructure Automation
Infrastructure-as-Code Tools
Troubleshooting Distributed Systems
Data Centre Management
High-Performance Networking Technologies
Large-Scale Data Migrations
Collaboration and Communication Skills

Some tips for your application 🫡

Tailor Your CV: Make sure your CV reflects the skills and experiences that match the Senior Site Reliability Engineer role. Highlight your experience with large-scale compute clusters, automation tooling, and any relevant programming languages like Go or Python.

Craft a Compelling Cover Letter: Use your cover letter to tell us why you're passionate about site reliability engineering and how your background aligns with our fast-paced environment. Share specific examples of your problem-solving skills and collaborative projects.

Showcase Your Technical Skills: Don’t shy away from detailing your technical expertise! Mention your experience with container orchestration systems like Kubernetes, and any hands-on work with data centre infrastructure. We love seeing your technical chops in action!

Apply Through Our Website: We encourage you to apply directly through our website. It’s the best way for us to receive your application and ensures you’re considered for the role. Plus, it shows us you’re keen on joining our team!

How to prepare for a job interview at Realm

✨Know Your Tech Inside Out

Make sure you brush up on your technical skills, especially around site reliability engineering and high-performance computing. Be ready to discuss your experience with container orchestration systems like Kubernetes, and have examples of how you've tackled complex production issues.

✨Showcase Your Problem-Solving Skills

Prepare to share specific instances where you've demonstrated pragmatic problem-solving in a fast-paced environment. Highlight your ownership mindset and how you've taken accountability for delivering results, especially in ambiguous situations.

✨Communicate Clearly and Collaboratively

Practice articulating your thoughts clearly, as effective communication is key in this role. Be ready to discuss how you've collaborated with cross-functional teams, and showcase your low-ego approach to working with others.

✨Demonstrate Your Automation Expertise

Since the role involves developing internal tooling and automation, come prepared with examples of how you've improved deployment speed and operational efficiency in previous roles. Discuss your experience with infrastructure-as-code tools and any scripting you've done in Go, Python, or Bash.

Senior Site Reliability Engineer in City of London
Realm
Location: City of London

Land your dream job quicker with Premium

You’re marked as a top applicant with our partner companies
Individual CV and cover letter feedback including tailoring to specific job roles
Be among the first applications for new jobs with our AI application
1:1 support and career advice from our career coaches
Go Premium

Money-back if you don't land a job in 6-months

>