Infrastructure Site Reliability Engineer
Infrastructure Site Reliability Engineer

Infrastructure Site Reliability Engineer

Full-Time 60000 - 80000 £ / year (est.) Home office (partial)
Radiant

At a Glance

  • Tasks: Run and evolve AI infrastructure, ensuring stability and security 24/7.
  • Company: Radiant, a leader in AI-native cloud platforms.
  • Benefits: 30 days annual leave, private medical insurance, and learning time.
  • Other info: Emphasis on results, open communication, and a culture of mentorship.
  • Why this job: Join a cutting-edge team and make a real impact in AI infrastructure.
  • Qualifications: 5+ years in performance-intensive environments and expert-level Linux skills.

The predicted salary is between 60000 - 80000 £ per year.

About Radiant

Radiant is redefining how AI infrastructure is built. We design and operate AI-native cloud platforms engineered for sovereignty, performance, and scale. Our infrastructure powers GPU-native workloads, multi-tenant control planes, and high-performance AI systems designed for the most demanding environments. We are not building a generic cloud. We are building purpose-built AI infrastructure - from powered land, to compute, to software.

As we scale our platform and expand our engineering organisation, we are looking for leaders who can build strong teams, uphold high standards, and deliver reliably at pace.

Job Summary:

We’re looking for an experienced Infrastructure Site Reliability Engineer to run and evolve our infrastructure stack. You’ll contribute across bare-metal, virtualization, and orchestration layers, keeping things stable and secure 24/7 x 365 — all while mentoring teammates, improving process and automation as well as helping translate deep technical concepts for a wide range of collaborators and customers.

What You’ll Do:

  • Deploy and operate resilient, scalable infrastructure supporting AI/HPC workloads
  • Optimize Linux system configuration, BIOS/firmware, kernel, and disk subsystem for performance
  • Configure, monitor and manage bare-metal infrastructure using IPMI, Redfish, etc.
  • Build and maintain automation scripts and infrastructure as code to support platform lifecycle, as well as simplifying troubleshooting for Incident resolution and provision of tooling for our support organisation
  • Apply ITSM frameworks: Incident, Major Incident, Change Management, and service improvement.
  • Maintain and enhance observability stack: Prometheus, Grafana, and custom monitoring integrations
  • Operate and support services in 24x7 production environments, including on-call rotation
  • Contribute to Incident postmortem analyses, root cause analysis, document learnings, and automate remediations
  • Mentor junior engineers and act as an Operational requirements consultant to other departments
  • Communicate technical decisions clearly to non-technical stakeholders and customers
  • Uphold a culture of: do, document, automate
  • Willingness to cross train with Platform Engineering/Platform SRE to fully support both our infrastructure and platform stacks.
  • Willingness to cross train with HPC Engineering, supported by NVIDIA to enhance our HPC supportability offering

What you bring:

  • 5+ Years Proven experience in globally scaled, performance-intensive environments operating to a 24/7 support model
  • Expert-level Linux administration, especially Ubuntu distributions
  • Proficiency in system tuning, disk I/O optimization, and hardware-level performance tweaks
  • Familiarity with Out of Band management tools (IPMI, Redfish, PXE, etc.)
  • Strong networking fundamentals: TCP/IP, DNS, DHCP, VLANs, routing, switching
  • Strong experience with infrastructure scripting and automation (Bash, Python, Ansible)
  • Deep understanding of observability principles and tools (Prometheus, Grafana)
  • Hands-on experience operating orchestration platforms (Kubernetes, MAAS, Tinkerbell)
  • Strong grasp of ITSM and service operation best practices
  • Excellent communication and mentorship skills
  • Comfortable interfacing with internal stakeholders and external customers

Bonus:

  • Knowledge of HPC workloads and GPU-based infrastructure
  • Experience with InfiniBand networks and HPC performance tuning

Nice to have:

  • Bachelor or Masters Level degree in Computer Science, Engineering or related field, or equivalent experience.
  • LPIC Certifications
  • ITIL Foundation level qualification or equivalent experience

How you work:

  • You approach problems with a systems mindset - balancing practical execution with long‑term scalability
  • You elevate the team, setting high standards for technical quality and engineering excellence.
  • You hold yourself and others accountable - giving direct feedback and expecting the same
  • You take initiative, owning challenges end-to-end and proactively driving solutions.
  • You invest in others, mentoring to build both capability and confidence.
  • You communicate clearly - translating complexity into clarity across engineering and business audiences

Why should you join us?

What sets us apart is our blend of modern technology, competitive benefits, and an open, welcoming work culture that enables our people to thrive. Here are just some of the great things you can expect from us:

  • 30 days of annual leave: we value your peace of mind. With 30 days off (excluding public holidays) and access to mental health resources, we make sure you're as strong mentally as you are professionally.
  • A culture that emphasises results over hierarchy, process & ego: we place great emphasis on the quality, ingenuity and creativity of work.
  • Open communication, regular feedback: we value smooth collaboration, direct and actionable feedback, and believe that leading with empathy and a growth mindset makes us better together.
  • Learning Time: we all have dedicated learning time to focus on new skills, projects or interests that lay outside of your day‑to‑day job.
  • Health & Wellbeing: we want everyone to feel healthy and happy, so we offer private medical insurance via Bupa.
  • Cycle to Work Scheme: we're committed to building a sustainable business, so we encourage cycling to work.
  • Gympass subscription to a variety of gyms and wellbeing apps
  • Participation in the company shares program
  • Enhanced parental pay & leave

Diversity, Equality, Inclusion and Belonging

We are an equal opportunity employer and we strive to reduce unconscious bias throughout our hiring process. All applicants will be considered for employment without attention to ethnicity, religion, sexual orientation, gender identity, family or parental status, national origin, veteran, neurodiversity status or disability status. To ensure our recruitment processes provide an equal opportunity for all applicants to succeed, we encourage you to let us know if there are any adjustments that we can make.

Infrastructure Site Reliability Engineer employer: Radiant

Radiant is an exceptional employer that fosters a culture of innovation and collaboration, making it an ideal place for Infrastructure Site Reliability Engineers to thrive. With 30 days of annual leave, a strong emphasis on mental health, and dedicated learning time, employees are encouraged to grow both personally and professionally. The company's commitment to open communication and diversity ensures a welcoming environment where every team member can contribute meaningfully to cutting-edge AI infrastructure projects.
Radiant

Contact Detail:

Radiant Recruiting Team

StudySmarter Expert Advice 🤫

We think this is how you could land Infrastructure Site Reliability Engineer

✨Tip Number 1

Network like a pro! Reach out to folks in the industry, attend meetups, and connect with people on LinkedIn. You never know who might have the inside scoop on job openings or can put in a good word for you.

✨Tip Number 2

Show off your skills! Create a portfolio or GitHub repository showcasing your projects and contributions. This is a great way to demonstrate your expertise in infrastructure and automation, making you stand out to potential employers.

✨Tip Number 3

Prepare for interviews by practising common technical questions and scenarios related to site reliability engineering. Mock interviews with friends or using online platforms can help you feel more confident and ready to impress.

✨Tip Number 4

Don’t forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, it shows you’re genuinely interested in joining our awesome team at Radiant.

We think you need these skills to ace Infrastructure Site Reliability Engineer

Linux Administration
Ubuntu Distributions
System Tuning
Disk I/O Optimization
Out of Band Management Tools (IPMI, Redfish, PXE)
Networking Fundamentals (TCP/IP, DNS, DHCP, VLANs, Routing, Switching)
Infrastructure Scripting and Automation (Bash, Python, Ansible)
Observability Principles and Tools (Prometheus, Grafana)
Orchestration Platforms (Kubernetes, MAAS, Tinkerbell)
ITSM and Service Operation Best Practices
Communication Skills
Mentorship Skills
Incident Management
Root Cause Analysis
HPC Workloads Knowledge

Some tips for your application 🫡

Tailor Your Application: Make sure to customise your CV and cover letter for the Infrastructure Site Reliability Engineer role. Highlight your experience with Linux administration, automation, and any relevant projects that showcase your skills in AI infrastructure.

Showcase Your Technical Skills: Don’t hold back on detailing your technical expertise! Mention specific tools and technologies you’ve worked with, like Prometheus, Grafana, or Kubernetes. We want to see how you can contribute to our high-performance AI systems.

Communicate Clearly: Remember, we value clear communication! When describing your past experiences, aim to translate complex technical concepts into simple terms. This will show us that you can effectively communicate with both technical and non-technical stakeholders.

Apply Through Our Website: We encourage you to apply directly through our website. It’s the best way for us to receive your application and ensures you’re considered for the role. Plus, it gives you a chance to explore more about our culture and values!

How to prepare for a job interview at Radiant

✨Know Your Tech Inside Out

Make sure you brush up on your Linux administration skills, especially with Ubuntu. Be ready to discuss system tuning and performance optimisation techniques, as well as your experience with tools like IPMI and Redfish. The more you can demonstrate your technical expertise, the better!

✨Showcase Your Automation Skills

Prepare to talk about your experience with scripting and automation tools like Bash, Python, and Ansible. Have examples ready of how you've built automation scripts or used infrastructure as code to improve processes. This will show that you can contribute to their goal of simplifying troubleshooting and enhancing operational efficiency.

✨Communicate Clearly

Since you'll be interfacing with both technical and non-technical stakeholders, practice explaining complex concepts in simple terms. Think of scenarios where you've had to translate technical jargon for a non-technical audience and be ready to share those experiences during the interview.

✨Emphasise Teamwork and Mentorship

Radiant values collaboration and mentorship, so be prepared to discuss how you've supported junior engineers in the past. Share specific examples of how you've elevated your team and contributed to a culture of learning and accountability. This will align perfectly with their emphasis on building strong teams.

Infrastructure Site Reliability Engineer
Radiant

Land your dream job quicker with Premium

You’re marked as a top applicant with our partner companies
Individual CV and cover letter feedback including tailoring to specific job roles
Be among the first applications for new jobs with our AI application
1:1 support and career advice from our career coaches
Go Premium

Money-back if you don't land a job in 6-months

>