Infrastructure Site Reliability Engineer

Job Board

Companies

Radiant

Infrastructure Site Reliability Engineer

Full-Time 60000 - 80000 £ / year (est.) No working from home possible

Apply Now

At a Glance

Tasks: Run and evolve AI infrastructure, ensuring stability and security 24/7.
Company: Radiant, a leader in AI-native cloud platforms.
Benefits: 30 days annual leave, private medical insurance, and learning time.
Other info: Emphasis on results, open communication, and a culture of mentorship.
Why this job: Join a cutting-edge team and make a real impact in AI infrastructure.
Qualifications: 5+ years in performance-intensive environments and expert-level Linux skills.

The predicted salary is between 60000 - 80000 £ per year.

About Radiant

Radiant is redefining how AI infrastructure is built. We design and operate AI-native cloud platforms engineered for sovereignty, performance, and scale. Our infrastructure powers GPU-native workloads, multi-tenant control planes, and high-performance AI systems designed for the most demanding environments. We are not building a generic cloud. We are building purpose-built AI infrastructure - from powered land, to compute, to software.

As we scale our platform and expand our engineering organisation, we are looking for leaders who can build strong teams, uphold high standards, and deliver reliably at pace.

Job Summary:

We’re looking for an experienced Infrastructure Site Reliability Engineer to run and evolve our infrastructure stack. You’ll contribute across bare-metal, virtualization, and orchestration layers, keeping things stable and secure 24/7 x 365 — all while mentoring teammates, improving process and automation as well as helping translate deep technical concepts for a wide range of collaborators and customers.

What You’ll Do:

Deploy and operate resilient, scalable infrastructure supporting AI/HPC workloads
Optimize Linux system configuration, BIOS/firmware, kernel, and disk subsystem for performance
Configure, monitor and manage bare-metal infrastructure using IPMI, Redfish, etc.
Build and maintain automation scripts and infrastructure as code to support platform lifecycle, as well as simplifying troubleshooting for Incident resolution and provision of tooling for our support organisation
Apply ITSM frameworks: Incident, Major Incident, Change Management, and service improvement.
Maintain and enhance observability stack: Prometheus, Grafana, and custom monitoring integrations
Operate and support services in 24x7 production environments, including on-call rotation
Contribute to Incident postmortem analyses, root cause analysis, document learnings, and automate remediations
Mentor junior engineers and act as an Operational requirements consultant to other departments
Communicate technical decisions clearly to non-technical stakeholders and customers
Uphold a culture of: do, document, automate
Willingness to cross train with Platform Engineering/Platform SRE to fully support both our infrastructure and platform stacks.
Willingness to cross train with HPC Engineering, supported by NVIDIA to enhance our HPC supportability offering

What you bring:

5+ Years Proven experience in globally scaled, performance-intensive environments operating to a 24/7 support model
Expert-level Linux administration, especially Ubuntu distributions
Proficiency in system tuning, disk I/O optimization, and hardware-level performance tweaks
Familiarity with Out of Band management tools (IPMI, Redfish, PXE, etc.)
Strong networking fundamentals: TCP/IP, DNS, DHCP, VLANs, routing, switching
Strong experience with infrastructure scripting and automation (Bash, Python, Ansible)
Deep understanding of observability principles and tools (Prometheus, Grafana)
Hands-on experience operating orchestration platforms (Kubernetes, MAAS, Tinkerbell)
Strong grasp of ITSM and service operation best practices
Excellent communication and mentorship skills
Comfortable interfacing with internal stakeholders and external customers

Bonus:

Knowledge of HPC workloads and GPU-based infrastructure
Experience with InfiniBand networks and HPC performance tuning

Nice to have:

Bachelor or Masters Level degree in Computer Science, Engineering or related field, or equivalent experience.
LPIC Certifications
ITIL Foundation level qualification or equivalent experience

How you work:

You approach problems with a systems mindset - balancing practical execution with long‑term scalability
You elevate the team, setting high standards for technical quality and engineering excellence.
You hold yourself and others accountable - giving direct feedback and expecting the same
You take initiative, owning challenges end-to-end and proactively driving solutions.
You invest in others, mentoring to build both capability and confidence.
You communicate clearly - translating complexity into clarity across engineering and business audiences

Why should you join us?

What sets us apart is our blend of modern technology, competitive benefits, and an open, welcoming work culture that enables our people to thrive. Here are just some of the great things you can expect from us:

30 days of annual leave: we value your peace of mind. With 30 days off (excluding public holidays) and access to mental health resources, we make sure you're as strong mentally as you are professionally.
A culture that emphasises results over hierarchy, process & ego: we place great emphasis on the quality, ingenuity and creativity of work.
Open communication, regular feedback: we value smooth collaboration, direct and actionable feedback, and believe that leading with empathy and a growth mindset makes us better together.
Learning Time: we all have dedicated learning time to focus on new skills, projects or interests that lay outside of your day‑to‑day job.
Health & Wellbeing: we want everyone to feel healthy and happy, so we offer private medical insurance via Bupa.
Cycle to Work Scheme: we're committed to building a sustainable business, so we encourage cycling to work.
Gympass subscription to a variety of gyms and wellbeing apps
Participation in the company shares program
Enhanced parental pay & leave

Diversity, Equality, Inclusion and Belonging

We are an equal opportunity employer and we strive to reduce unconscious bias throughout our hiring process. All applicants will be considered for employment without attention to ethnicity, religion, sexual orientation, gender identity, family or parental status, national origin, veteran, neurodiversity status or disability status. To ensure our recruitment processes provide an equal opportunity for all applicants to succeed, we encourage you to let us know if there are any adjustments that we can make.

Infrastructure Site Reliability Engineer employer: Radiant

Radiant is an exceptional employer that fosters a culture of innovation and collaboration, making it an ideal place for Infrastructure Site Reliability Engineers to thrive. With 30 days of annual leave, a strong emphasis on mental health, and dedicated learning time, employees are encouraged to grow both personally and professionally. The company's commitment to open communication and diversity ensures a welcoming environment where every team member can contribute meaningfully to cutting-edge AI infrastructure projects.

Contact Details:

Radiant Recruitment Team

View Radiant profile

StudySmarter Expert Advice🤫

We think this is how you could land Infrastructure Site Reliability Engineer

✨Tip Number 1

Network like a pro! Reach out to folks in the industry, attend meetups, and connect with people on LinkedIn. You never know who might have the inside scoop on job openings or can put in a good word for you.

✨Tip Number 2

Show off your skills! Create a portfolio or GitHub repository showcasing your projects and contributions. This is a great way to demonstrate your expertise in infrastructure and automation, making you stand out to potential employers.

✨Tip Number 3

Prepare for interviews by practising common technical questions and scenarios related to site reliability engineering. Mock interviews with friends or using online platforms can help you feel more confident and ready to impress.

✨Tip Number 4

Don’t forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, it shows you’re genuinely interested in joining our awesome team at Radiant.

We think you need these skills to ace Infrastructure Site Reliability Engineer

Linux Administration

Ubuntu Distributions

System Tuning

Disk I/O Optimization

Out of Band Management Tools (IPMI, Redfish, PXE)

Networking Fundamentals (TCP/IP, DNS, DHCP, VLANs, Routing, Switching)

Infrastructure Scripting and Automation (Bash, Python, Ansible)

Observability Principles and Tools (Prometheus, Grafana)

Orchestration Platforms (Kubernetes, MAAS, Tinkerbell)

ITSM and Service Operation Best Practices

Communication Skills

Mentorship Skills

Incident Management

Root Cause Analysis

HPC Workloads Knowledge

Some tips for your application 🫡

Tailor Your Application:Make sure to customise your CV and cover letter for the Infrastructure Site Reliability Engineer role. Highlight your experience with Linux administration, automation, and any relevant projects that showcase your skills in AI infrastructure.

Showcase Your Technical Skills:Don’t hold back on detailing your technical expertise! Mention specific tools and technologies you’ve worked with, like Prometheus, Grafana, or Kubernetes. We want to see how you can contribute to our high-performance AI systems.

Communicate Clearly:Remember, we value clear communication! When describing your past experiences, aim to translate complex technical concepts into simple terms. This will show us that you can effectively communicate with both technical and non-technical stakeholders.

Apply Through Our Website:We encourage you to apply directly through our website. It’s the best way for us to receive your application and ensures you’re considered for the role. Plus, it gives you a chance to explore more about our culture and values!

How to prepare for a job interview at Radiant

✨Know Your Tech Inside Out

Make sure you brush up on your Linux administration skills, especially with Ubuntu. Be ready to discuss system tuning and performance optimisation techniques, as well as your experience with tools like IPMI and Redfish. The more you can demonstrate your technical expertise, the better!

✨Showcase Your Automation Skills

Prepare to talk about your experience with scripting and automation tools like Bash, Python, and Ansible. Have examples ready of how you've built automation scripts or used infrastructure as code to improve processes. This will show that you can contribute to their goal of simplifying troubleshooting and enhancing operational efficiency.

✨Communicate Clearly

Since you'll be interfacing with both technical and non-technical stakeholders, practice explaining complex concepts in simple terms. Think of scenarios where you've had to translate technical jargon for a non-technical audience and be ready to share those experiences during the interview.

✨Emphasise Teamwork and Mentorship

Radiant values collaboration and mentorship, so be prepared to discuss how you've supported junior engineers in the past. Share specific examples of how you've elevated your team and contributed to a culture of learning and accountability. This will align perfectly with their emphasis on building strong teams.

Infrastructure Site Reliability Engineer

Radiant

Apply Now

Infrastructure Site Reliability Engineer

At a Glance

Infrastructure Site Reliability Engineer employer: Radiant

StudySmarter Expert Advice🤫

We think you need these skills to ace Infrastructure Site Reliability Engineer

Some tips for your application 🫡

How to prepare for a job interview at Radiant

Company

Product

Help