Platform Site Reliability Engineer in London
Platform Site Reliability Engineer

Platform Site Reliability Engineer in London

London Full-Time 36000 - 60000 £ / year (est.) Home office (partial)
Go Premium
ORI

At a Glance

  • Tasks: Manage and optimise Kubernetes clusters for AI workloads in a dynamic environment.
  • Company: Join Ori Industries, a leader in AI infrastructure innovation.
  • Benefits: Enjoy 30 days annual leave, private medical insurance, and a supportive work culture.
  • Other info: Embrace a culture of learning, open communication, and mentorship.
  • Why this job: Make a real impact in the AI era while developing your skills.
  • Qualifications: 5+ years in SRE roles with strong Kubernetes and Linux expertise.

The predicted salary is between 36000 - 60000 £ per year.

About Ori Industries is at the forefront of AI infrastructure, revolutionising the connection between software and hardware for the AI era. Our mission is to empower AI teams with scalable, secure, and efficient infrastructure solutions that support seamless model training, deployment, and scaling.

Role Responsibilities:

  • Deploy and Manage Kubernetes Clusters, deployed at scale to support AI centric workloads, across both our bare metal clusters and via trusted partner infrastructure.
  • Develop Kubernetes Manifests and Operators: Facilitate application deployments and maintain Kubernetes-native services for networking, storage, security, identity and infrastructure management.
  • Optimize Linux system configuration including kernel, driver, filesystem and services to support workloads running via our orchestration layer.
  • Build and maintain automation scripts and infrastructure as code to support platform lifecycle, as well as simplifying troubleshooting for Incident resolution and provision of tooling for our support organisation.
  • Apply ITSM frameworks: Incident, Major Incident, Change Management, and service improvement.
  • Maintain and enhance ORI's observability stack: Prometheus, Grafana, and custom monitoring integrations.
  • Operate and support services in 24x7 production environments, including on-call rotation.
  • Contribute to Incident postmortem analyses, root cause analysis, document learnings, and automate remediations.
  • Mentor junior engineers and act as an Operational requirements consultant to other departments.
  • Communicate technical decisions clearly to non-technical stakeholders and customers.
  • Uphold a culture of: do, document, automate.
  • Willingness to cross train with Platform Engineering/Platform SRE to fully support both our infrastructure and platform stacks.
  • Willingness to cross train with HPC Engineering, supported by NVIDIA to enhance our HPC supportability offering.

Requirements:

  • 5+ Years Proven experience in globally scaled, performance-intensive environments operating to a 24/7 support model in an SRE or equivalent role.
  • 3+ years experience in both running, deploying and optimising orchestration platforms with a strong emphasis on Kubernetes.
  • Expert-level Linux administration, especially Ubuntu distributions.
  • Proficiency in system tuning, disk I/O optimization, and hardware-level performance tweaks.
  • Strong networking fundamentals: TCP/IP, DNS, DHCP, VLANs, routing, switching.
  • Strong experience with API interrogation.
  • Strong experience with infrastructure scripting and automation (Bash, Python, Ansible).
  • Deep understanding of observability principles and tools (Prometheus, Grafana preferred).
  • Strong grasp of ITSM and service operation best practices.
  • Excellent communication and mentorship skills.
  • Comfortable interfacing with internal stakeholders and external customers.

Bonus:

  • Knowledge of running AI workloads via orchestration platforms.

Bonus Requirements:

  • Bachelor or Masters Level degree in Computer Science, Engineering or related field, or equivalent experience.
  • LPIC Certifications.
  • ITIL Foundation level qualification or equivalent experience.
  • Certified Kubernetes Administrator (CKA).

Qualities we look for:

  • You approach problems with a systems mindset - balancing practical execution with long-term scalability.
  • You elevate the team, setting high standards for technical quality and engineering excellence.
  • You hold yourself and others accountable - giving direct feedback and expecting the same.
  • You take initiative, owning challenges end-to-end and proactively driving solutions.
  • You invest in others, mentoring to build both capability and confidence.

Why should you join us? What sets us apart is our blend of modern technology, competitive benefits, and an open, welcoming work culture that enables our people to thrive.

Here are just some of the great things you can expect from us:

  • 30 days of annual leave: we value your peace of mind. With 30 days off (excluding public holidays) and access to mental health resources, we make sure you're as strong mentally as you are professionally.
  • A culture that emphasises results over hierarchy, process & ego: we place great emphasis on the quality, ingenuity and creativity of work.
  • Open communication, regular feedback: we value smooth collaboration, direct and actionable feedback, and believe that leading with empathy and a growth mindset makes us better together.
  • Learning Time: we all have dedicated learning time to focus on new skills, projects or interests that lay outside of your day-to-day job.
  • Health & Wellbeing: we want everyone to feel healthy and happy, so we offer private medical insurance via Bupa.
  • Cycle to Work Scheme: we're committed to building a sustainable business, so we encourage cycling to work.
  • Gympass subscription to a variety of gyms and wellbeing apps.
  • Participation in the company shares program.
  • Enhanced parental pay & leave.

Diversity, Equity, Inclusion and Belonging: We are an equal opportunity employer and we strive to reduce unconscious bias throughout our hiring process. All applicants will be considered for employment without attention to ethnicity, religion, sexual orientation, gender identity, family or parental status, national origin, veteran, neurodiversity status or disability status. To ensure our recruitment processes provide an equal opportunity for all applicants to succeed, we encourage you to let us know if there are any adjustments that we can make.

Platform Site Reliability Engineer in London employer: ORI

At Ori Industries, we pride ourselves on being a leading employer in the AI infrastructure sector, offering a dynamic work environment that fosters innovation and collaboration. Our commitment to employee well-being is reflected in our generous benefits package, including 30 days of annual leave, private medical insurance, and a culture that prioritises open communication and continuous learning. With ample opportunities for professional growth and a focus on diversity and inclusion, Ori Industries is the ideal place for talented individuals looking to make a meaningful impact in the tech industry.
ORI

Contact Detail:

ORI Recruiting Team

StudySmarter Expert Advice 🤫

We think this is how you could land Platform Site Reliability Engineer in London

✨Tip Number 1

Network like a pro! Reach out to folks in the industry, attend meetups, and connect with people on LinkedIn. You never know who might have the inside scoop on job openings or can refer you directly.

✨Tip Number 2

Show off your skills! Create a portfolio or GitHub repository showcasing your projects, especially those related to Kubernetes and AI infrastructure. This gives potential employers a taste of what you can do.

✨Tip Number 3

Prepare for interviews by practising common SRE questions and scenarios. Think about how you’d handle incidents or optimise systems. The more you rehearse, the more confident you’ll feel when it’s showtime!

✨Tip Number 4

Don’t forget to apply through our website! It’s the best way to ensure your application gets seen. Plus, we love seeing candidates who are genuinely interested in joining our team at Ori Industries.

We think you need these skills to ace Platform Site Reliability Engineer in London

Kubernetes Management
Linux Administration
System Tuning
Disk I/O Optimization
Networking Fundamentals
API Interrogation
Infrastructure Scripting
Automation (Bash, Python, Ansible)
Observability Tools (Prometheus, Grafana)
ITSM Frameworks
Incident Management
Communication Skills
Mentorship
Problem-Solving Skills
Cross-Training Willingness

Some tips for your application 🫡

Tailor Your Application: Make sure to customise your CV and cover letter for the Platform Site Reliability Engineer role. Highlight your experience with Kubernetes, Linux administration, and any relevant projects that showcase your skills in AI infrastructure.

Showcase Your Technical Skills: Don’t hold back on detailing your technical expertise! Include specific examples of how you've optimised orchestration platforms or improved system performance. This is your chance to shine, so let us see what you can do!

Communicate Clearly: Remember, we value clear communication! When writing your application, make sure to explain your technical decisions in a way that’s easy to understand. This will show us that you can bridge the gap between tech and non-tech stakeholders.

Apply Through Our Website: We encourage you to apply directly through our website. It’s the best way to ensure your application gets into the right hands. Plus, it shows us you’re keen on joining our team at Ori Industries!

How to prepare for a job interview at ORI

✨Know Your Kubernetes Inside Out

Make sure you brush up on your Kubernetes knowledge before the interview. Be ready to discuss your experience with deploying and managing clusters, as well as developing manifests and operators. Prepare examples of how you've optimised Kubernetes for AI workloads.

✨Show Off Your Linux Skills

Since expert-level Linux administration is crucial for this role, be prepared to talk about your experience with Ubuntu distributions. Highlight any system tuning or performance optimisation you've done, and be ready to answer technical questions that test your Linux knowledge.

✨Demonstrate Your Automation Expertise

Automation is key in this role, so come armed with examples of scripts or infrastructure as code you've built. Discuss your experience with Bash, Python, or Ansible, and how these tools have helped streamline processes or troubleshoot issues in your previous roles.

✨Communicate Clearly and Confidently

You'll need to communicate technical decisions to non-technical stakeholders, so practice explaining complex concepts in simple terms. Think of scenarios where you've had to mentor junior engineers or collaborate with other departments, and be ready to share those experiences.

Platform Site Reliability Engineer in London
ORI
Location: London
Go Premium

Land your dream job quicker with Premium

You’re marked as a top applicant with our partner companies
Individual CV and cover letter feedback including tailoring to specific job roles
Be among the first applications for new jobs with our AI application
1:1 support and career advice from our career coaches
Go Premium

Money-back if you don't land a job in 6-months

>