At a Glance
- Tasks: Ensure reliability and performance of a global compute platform while resolving complex production issues.
- Company: Join a high-growth infrastructure company at the forefront of machine learning workloads.
- Benefits: Enjoy competitive salary, equity, health coverage, and generous paid time off.
- Other info: Collaborate closely with teams to design and operate high-demand computational systems.
- Why this job: Make a real impact in a fast-paced environment with cutting-edge technology.
- Qualifications: 5+ years in site reliability engineering or DevOps, with strong systems expertise.
The predicted salary is between 70000 - 90000 € per year.
High-growth infrastructure company focused on delivering large-scale compute, data centre capacity, and power solutions for advanced machine learning workloads. Platforms support leading research and industry teams requiring high-performance computing at significant scale. Fast-paced environment with emphasis on ownership, execution speed, and quality.
Position operating across software, infrastructure, and operations to ensure reliability, scalability, and performance of a globally distributed compute platform. Close collaboration with networking, platform engineering, and physical infrastructure teams to design and operate systems supporting high-demand computational workloads.
Hands-on engineering role requiring strong systems expertise, with responsibility for resolving complex production issues, improving system resilience, and enhancing platform observability.
- Deployment and management of large-scale compute clusters using automation tooling, with adaptation to customer requirements.
- Validation and optimisation of compute, storage, and networking systems in coordination with internal teams and vendors.
- Execution of large-scale data migrations between cloud and on-premise environments with focus on efficiency and cost.
- Troubleshooting across the full stack, including hardware, networking, and distributed systems.
- Development of internal tooling and automation to improve deployment speed, reliability, and operational efficiency.
Experience building maintainable, well-documented systems in complex environments.
- 5+ years of experience in site reliability engineering, DevOps, systems administration, or high-performance computing.
- Strong written and verbal communication skills in English.
- Programming or scripting experience in Go, Python, or Bash.
- Strong technical foundation in computing or related discipline.
- Experience operating large-scale machine learning or AI-compute workloads.
- Hands-on experience with data centre or bare-metal infrastructure.
- Knowledge of high-performance networking technologies.
- Experience managing large-scale storage systems (commercial or open-source).
Compensation & Benefits:
- Competitive salary and equity package.
- Retirement or pension contributions aligned with local standards.
- Health coverage including medical, dental, and vision.
- Generous paid time off policy.
Senior Site Reliability Engineer - DevOps (Remote) in London employer: Realm
Join a high-growth infrastructure company that prioritises innovation and collaboration, offering a dynamic work culture where your contributions directly impact the success of advanced machine learning workloads. With competitive salaries, generous benefits including health coverage and retirement contributions, and ample opportunities for professional growth, this remote role as a Senior Site Reliability Engineer allows you to thrive in a fast-paced environment while working alongside top-tier teams in the industry.
StudySmarter Expert Advice🤫
We think this is how you could land Senior Site Reliability Engineer - DevOps (Remote) in London
✨Tip Number 1
Network like a pro! Reach out to folks in the industry, attend meetups, and connect with potential colleagues on LinkedIn. We all know that sometimes it’s not just what you know, but who you know that can help you land that dream job.
✨Tip Number 2
Show off your skills! Create a portfolio or GitHub repository showcasing your projects, especially those related to site reliability engineering or DevOps. We want to see your hands-on experience and how you tackle real-world problems.
✨Tip Number 3
Prepare for technical interviews by brushing up on your systems expertise and troubleshooting skills. We recommend practicing common scenarios you might face in a high-performance computing environment. The more prepared you are, the more confident you'll feel!
✨Tip Number 4
Don’t forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, we love seeing candidates who take the initiative to engage directly with us.
We think you need these skills to ace Senior Site Reliability Engineer - DevOps (Remote) in London
Some tips for your application 🫡
Tailor Your CV:Make sure your CV reflects the skills and experiences that match the job description. Highlight your experience in site reliability engineering, DevOps, and any relevant programming languages like Go or Python. We want to see how you can contribute to our fast-paced environment!
Craft a Compelling Cover Letter:Your cover letter is your chance to shine! Use it to explain why you're passionate about high-performance computing and how your background aligns with our mission. Keep it concise but engaging – we love a good story!
Showcase Your Projects:If you've worked on any large-scale compute projects or have experience with automation tooling, make sure to mention them. We’re interested in seeing how you’ve tackled complex production issues and improved system resilience in your previous roles.
Apply Through Our Website:We encourage you to apply directly through our website. It’s the best way for us to receive your application and ensures you don’t miss out on any important updates. Plus, we love seeing applications come in through our own platform!
How to prepare for a job interview at Realm
✨Know Your Tech Inside Out
Make sure you brush up on your technical skills, especially in Go, Python, or Bash. Be ready to discuss your experience with large-scale compute clusters and how you've tackled complex production issues in the past.
✨Showcase Your Problem-Solving Skills
Prepare examples of how you've resolved tricky problems in high-performance computing environments. Highlight your hands-on experience with troubleshooting across hardware, networking, and distributed systems.
✨Demonstrate Collaboration
Since this role involves close collaboration with various teams, be ready to share instances where you've worked effectively with others. Discuss how you’ve partnered with networking, platform engineering, or infrastructure teams to achieve common goals.
✨Ask Insightful Questions
Prepare thoughtful questions about the company's approach to system resilience and observability. This shows your genuine interest in the role and helps you understand how you can contribute to their fast-paced environment.