At a Glance
- Tasks: Ensure reliability and performance of a global compute platform while resolving complex production issues.
- Company: Join a high-growth infrastructure company at the forefront of machine learning workloads.
- Benefits: Enjoy competitive salary, equity, health coverage, and generous paid time off.
- Other info: Collaborate closely with teams to design and operate high-demand computational systems.
- Why this job: Make a real impact in a fast-paced environment with cutting-edge technology.
- Qualifications: 5+ years in site reliability engineering or DevOps, with strong systems expertise.
The predicted salary is between 70000 - 90000 £ per year.
High-growth infrastructure company focused on delivering large-scale compute, data centre capacity, and power solutions for advanced machine learning workloads. Platforms support leading research and industry teams requiring high-performance computing at significant scale. Fast-paced environment with emphasis on ownership, execution speed, and quality.
Position operating across software, infrastructure, and operations to ensure reliability, scalability, and performance of a globally distributed compute platform. Close collaboration with networking, platform engineering, and physical infrastructure teams to design and operate systems supporting high-demand computational workloads.
Hands-on engineering role requiring strong systems expertise, with responsibility for resolving complex production issues, improving system resilience, and enhancing platform observability.
- Deployment and management of large-scale compute clusters using automation tooling, with adaptation to customer requirements.
- Validation and optimisation of compute, storage, and networking systems in coordination with internal teams and vendors.
- Execution of large-scale data migrations between cloud and on-premise environments with focus on efficiency and cost.
- Troubleshooting across the full stack, including hardware, networking, and distributed systems.
- Development of internal tooling and automation to improve deployment speed, reliability, and operational efficiency.
Experience building maintainable, well-documented systems in complex environments.
- 5+ years of experience in site reliability engineering, DevOps, systems administration, or high-performance computing.
- Strong written and verbal communication skills in English.
- Programming or scripting experience in Go, Python, or Bash.
- Strong technical foundation in computing or related discipline.
- Experience operating large-scale machine learning or AI-compute workloads.
- Hands-on experience with data centre or bare-metal infrastructure.
- Knowledge of high-performance networking technologies.
- Experience managing large-scale storage systems (commercial or open-source).
Compensation & Benefits:
- Competitive salary and equity package.
- Retirement or pension contributions aligned with local standards.
- Health coverage including medical, dental, and vision.
- Generous paid time off policy.
Senior Site Reliability Engineer - DevOps (Remote) in City of London employer: Realm
Contact Detail:
Realm Recruiting Team
StudySmarter Expert Advice 🤫
We think this is how you could land Senior Site Reliability Engineer - DevOps (Remote) in City of London
✨Tip Number 1
Network like a pro! Reach out to folks in the industry, attend meetups, and connect with potential colleagues on LinkedIn. We all know that sometimes it’s not just what you know, but who you know that can land you that dream job.
✨Tip Number 2
Show off your skills! Create a portfolio or GitHub repository showcasing your projects, especially those related to site reliability engineering or DevOps. We want to see your hands-on experience and how you tackle real-world problems.
✨Tip Number 3
Prepare for the interview like it’s a high-stakes game! Research the company, understand their tech stack, and be ready to discuss how your experience aligns with their needs. We’re looking for candidates who can hit the ground running!
✨Tip Number 4
Don’t forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, we love seeing candidates who are proactive about their job search!
We think you need these skills to ace Senior Site Reliability Engineer - DevOps (Remote) in City of London
Some tips for your application 🫡
Tailor Your CV: Make sure your CV reflects the skills and experiences that match the job description. Highlight your expertise in site reliability engineering, DevOps, and any relevant programming languages like Go or Python. We want to see how you can contribute to our fast-paced environment!
Craft a Compelling Cover Letter: Your cover letter is your chance to shine! Use it to explain why you're passionate about high-performance computing and how your background aligns with our mission. Be sure to mention any hands-on experience you've had with large-scale compute clusters or data migrations.
Showcase Your Problem-Solving Skills: In your application, don’t shy away from sharing examples of complex production issues you've resolved or how you've improved system resilience. We love seeing candidates who can demonstrate ownership and execution speed in their previous roles!
Apply Through Our Website: We encourage you to apply directly through our website for the best chance of getting noticed. It’s super easy, and you’ll be able to keep track of your application status. Plus, we’re excited to see what you bring to the table!
How to prepare for a job interview at Realm
✨Know Your Tech Inside Out
Make sure you brush up on your technical skills, especially in areas like systems administration, high-performance computing, and the programming languages mentioned in the job description. Be ready to discuss your hands-on experience with large-scale compute clusters and any complex production issues you've resolved.
✨Showcase Your Problem-Solving Skills
Prepare examples of how you've tackled challenging problems in previous roles. Think about specific instances where you improved system resilience or enhanced platform observability. This will demonstrate your ability to think critically and act decisively in a fast-paced environment.
✨Collaboration is Key
Since this role involves close collaboration with various teams, be prepared to discuss your experience working with networking, platform engineering, and physical infrastructure teams. Highlight any successful projects where teamwork played a crucial role in achieving results.
✨Ask Insightful Questions
At the end of the interview, don’t forget to ask questions that show your interest in the company and the role. Inquire about their current challenges with large-scale data migrations or how they ensure the reliability of their globally distributed compute platform. This shows you're not just interested in the job, but also in contributing to their success.