Platform Site Reliability Engineer
Platform Site Reliability Engineer

Platform Site Reliability Engineer

Full-Time 70000 - 90000 £ / year (est.) Home office (partial)
Radiant

At a Glance

  • Tasks: Manage and optimise Kubernetes clusters for cutting-edge AI workloads.
  • Company: Join Radiant, a leader in AI-native cloud infrastructure.
  • Benefits: 30 days annual leave, private medical insurance, and a supportive work culture.
  • Other info: Mentorship opportunities and a focus on personal growth.
  • Why this job: Be part of a team redefining AI infrastructure with innovative technology.
  • Qualifications: 5+ years in SRE roles and expert-level Linux skills required.

The predicted salary is between 70000 - 90000 £ per year.

About Radiant

Radiant is redefining how AI infrastructure is built. We design and operate AI-native cloud platforms engineered for sovereignty, performance, and scale. Our infrastructure powers GPU-native workloads, multi-tenant control planes, and high-performance AI systems designed for the most demanding environments. We are not building a generic cloud. We are building purpose-built AI infrastructure - from powered land, to compute, to software.

As we scale our platform and expand our engineering organisation, we are looking for leaders who can build strong teams, uphold high standards, and deliver reliably at pace.

Role Responsibilities

  • Deploy and Manage Kubernetes Clusters, deployed at scale to support AI centric workloads, across both our bare metal clusters and via trusted partner infrastructure.
  • Develop Kubernetes Manifests and Operators: Facilitate application deployments and maintain Kubernetes-native services for networking, storage, security, identity and infrastructure management.
  • Optimize Linux system configuration including kernel, driver, filesystem and services to support workloads running via our orchestration layer.
  • Build and maintain automation scripts and infrastructure as code to support platform lifecycle, as well as simplifying troubleshooting for Incident resolution and provision of tooling for our support organisation.
  • Apply ITSM frameworks: Incident, Major Incident, Change Management, and service improvement.
  • Maintain and enhance Radiant’s observability stack: Prometheus, Grafana, and custom monitoring integrations.
  • Operate and support services in 24x7 production environments, including on-call rotation.
  • Contribute to Incident postmortem analyses, root cause analysis, document learnings, and automate remediations.
  • Mentor junior engineers and act as an Operational requirements consultant to other departments.
  • Communicate technical decisions clearly to non-technical stakeholders and customers.
  • Uphold a culture of: do, document, automate.
  • Willingness to cross train with Platform Engineering/Platform SRE to fully support both our infrastructure and platform stacks.
  • Willingness to cross train with HPC Engineering, supported by NVIDIA to enhance our HPC supportability offering.

Requirements

  • 5+ Years Proven experience in globally scaled, performance-intensive environments operating to a 24/7 support model in an SRE or equivalent role.
  • 3+ years experience in both running, deploying and optimising orchestration platforms with a strong emphasis on Kubernetes.
  • Expert-level Linux administration, especially Ubuntu distributions.
  • Proficiency in system tuning, disk I/O optimization, and hardware-level performance tweaks.
  • Strong networking fundamentals: TCP/IP, DNS, DHCP, VLANs, routing, switching.
  • Strong experience with API interrogation.
  • Strong experience with infrastructure scripting and automation (Bash, Python, Ansible).
  • Deep understanding of observability principles and tools (Prometheus, Grafana preferred).
  • Strong grasp of ITSM and service operation best practices.
  • Excellent communication and mentorship skills.
  • Comfortable interfacing with internal stakeholders and external customers.

Bonus: Knowledge of running AI workloads via orchestration platforms.

Bonus Requirements

  • Bachelor or Masters Level degree in Computer Science, Engineering or related field, or equivalent experience.
  • LPIC Certifications.
  • ITIL Foundation level qualification or equivalent experience.
  • Certified Kubernetes Administrator (CKA).

Qualities we look for

  • You approach problems with a systems mindset - balancing practical execution with long-term scalability.
  • You elevate the team, setting high standards for technical quality and engineering excellence.
  • You hold yourself and others accountable - giving direct feedback and expecting the same.
  • You take initiative, owning challenges end-to-end and proactively driving solutions.
  • You invest in others, mentoring to build both capability and confidence.

Why should you join us?

What sets us apart is our blend of modern technology, competitive benefits, and an open, welcoming work culture that enables our people to thrive.

  • 30 days of annual leave: we value your peace of mind. With 30 days off (excluding public holidays) and access to mental health resources, we make sure you're as strong mentally as you are professionally.
  • A culture that emphasises results over hierarchy, process & ego: we place great emphasis on the quality, ingenuity and creativity of work.
  • Open communication, regular feedback: we value smooth collaboration, direct and actionable feedback, and believe that leading with empathy and a growth mindset makes us better together.
  • Learning Time: we all have dedicated learning time to focus on new skills, projects or interests that lay outside of your day-to-day job.
  • Health & Wellbeing: we want everyone to feel healthy and happy, so we offer private medical insurance via Bupa.
  • Cycle to Work Scheme: we're committed to building a sustainable business, so we encourage cycling to work.
  • Gympass subscription to a variety of gyms and wellbeing apps.
  • Participation in the company shares program.
  • Enhanced parental pay & leave.

Diversity, Equality, Inclusion and Belonging

We are an equal opportunity employer and we strive to reduce unconscious bias throughout our hiring process. All applicants will be considered for employment without attention to ethnicity, religion, sexual orientation, gender identity, family or parental status, national origin, veteran, neurodiversity status or disability status. To ensure our recruitment processes provide an equal opportunity for all applicants to succeed, we encourage you to let us know if there are any adjustments that we can make.

Platform Site Reliability Engineer employer: Radiant

Radiant is an exceptional employer that fosters a culture of innovation and collaboration, offering a competitive benefits package including 30 days of annual leave and private medical insurance. With a strong emphasis on employee growth through dedicated learning time and mentorship opportunities, Radiant provides a dynamic work environment where your contributions are valued and your professional development is supported.
Radiant

Contact Detail:

Radiant Recruiting Team

StudySmarter Expert Advice 🤫

We think this is how you could land Platform Site Reliability Engineer

✨Tip Number 1

Network like a pro! Reach out to current employees at Radiant on LinkedIn or other platforms. Ask them about their experiences and any tips they might have for landing the job. Personal connections can make a huge difference!

✨Tip Number 2

Prepare for the technical interview by brushing up on your Kubernetes and Linux skills. We recommend setting up a mini-project at home to showcase your expertise. This hands-on experience will not only boost your confidence but also impress the interviewers.

✨Tip Number 3

Don’t forget to highlight your problem-solving skills during interviews. Radiant values a systems mindset, so be ready to discuss how you've tackled challenges in past roles. Share specific examples that demonstrate your ability to think critically and act decisively.

✨Tip Number 4

Finally, apply through our website! It’s the best way to ensure your application gets seen. Plus, it shows you’re genuinely interested in being part of our team. Don’t miss out on this opportunity!

We think you need these skills to ace Platform Site Reliability Engineer

Kubernetes Management
Linux Administration
System Tuning
Infrastructure as Code
Automation Scripting (Bash, Python, Ansible)
Observability Tools (Prometheus, Grafana)
Networking Fundamentals (TCP/IP, DNS, DHCP, VLANs)
API Interrogation
ITSM Frameworks
Incident Management
Root Cause Analysis
Mentorship
Communication Skills
Problem-Solving Skills

Some tips for your application 🫡

Tailor Your Application: Make sure to customise your CV and cover letter to highlight your experience with Kubernetes and Linux systems. We want to see how your skills align with our needs, so don’t hold back on showcasing your relevant projects!

Show Your Passion for AI Infrastructure: In your application, let us know why you’re excited about working in AI-native cloud platforms. Share any personal projects or experiences that demonstrate your enthusiasm for this field – it really helps us get to know you better!

Be Clear and Concise: When writing your application, keep it straightforward and to the point. Use clear language to explain your technical skills and experiences, especially when discussing complex topics like observability tools or automation scripts.

Apply Through Our Website: We encourage you to submit your application through our website. It’s the best way for us to receive your details and ensures you’re considered for the role. Plus, it’s super easy to do!

How to prepare for a job interview at Radiant

✨Know Your Kubernetes Inside Out

Make sure you brush up on your Kubernetes knowledge before the interview. Be ready to discuss your experience with deploying and managing clusters, as well as developing manifests and operators. Radiant is looking for someone who can hit the ground running, so showcasing your hands-on experience will definitely give you an edge.

✨Show Off Your Linux Skills

Since expert-level Linux administration is a must-have, be prepared to talk about your experience with Ubuntu distributions. Highlight any system tuning or performance optimisation you've done in the past. If you can share specific examples of how you've improved system performance, that’ll really impress them!

✨Communicate Like a Pro

Radiant values clear communication, especially when it comes to technical decisions. Practice explaining complex concepts in simple terms, as you may need to communicate with non-technical stakeholders. Being able to bridge that gap will show that you’re not just technically savvy but also a great team player.

✨Prepare for Incident Management Questions

Since the role involves ITSM frameworks and incident management, be ready to discuss your experience with these processes. Think of specific incidents you've managed, what you learned from them, and how you automated remediations. This will demonstrate your proactive approach and problem-solving skills, which are key qualities they’re looking for.

Platform Site Reliability Engineer
Radiant

Land your dream job quicker with Premium

You’re marked as a top applicant with our partner companies
Individual CV and cover letter feedback including tailoring to specific job roles
Be among the first applications for new jobs with our AI application
1:1 support and career advice from our career coaches
Go Premium

Money-back if you don't land a job in 6-months

>