HPC Infrastructure Site Reliability Engineer

HPC Infrastructure Site Reliability Engineer

Full-Time 70000 - 90000 £ / year (est.) Home office (partial)
Radiant

At a Glance

  • Tasks: Ensure reliability and performance of cutting-edge HPC infrastructure in a 24/7 environment.
  • Company: Fast-growing GPU-as-a-Service provider with a focus on AI and HPC workloads.
  • Benefits: Competitive salary, flexible work environment, and opportunities for professional growth.
  • Other info: Collaborative culture that values innovation, inclusion, and continuous learning.
  • Why this job: Join a team at the forefront of high-performance computing and make a real impact.
  • Qualifications: 8+ years in SRE or Infrastructure Engineering, with strong Linux and HPC experience.

The predicted salary is between 70000 - 90000 £ per year.

We’re a fast-growing GPU-as-a-Service provider, delivering scalable, high-performance compute infrastructure purpose-built for AI and HPC workloads. Operating across global data centres, we run mission-critical environments where uptime, throughput, and ultra-low latency are non-negotiable.

We are looking for a senior Infrastructure Site Reliability Engineer with deep experience operating large-scale distributed systems and recent hands-on expertise in high-performance computing (HPC) and AI infrastructure. This is an operations-first SRE role, working in a 24/7/365 on-call environment, responsible for ensuring reliability, performance, and continuous improvement of mission-critical infrastructure. This role sits within a cross-functional organisation spanning network engineering, infrastructure SRE, Platform SRE, infrastructure tooling engineers (software) and data centre operations.

The ideal candidate has progressed through large-scale, globally distributed or multi-site infrastructure environments and has more recently specialised in GPU-accelerated HPC systems. This role provides exposure to the latest high-density AI compute platforms, including next-generation GPU infrastructure at significant scale. You will bring strong breadth across bare metal, networking, storage, virtualisation, and orchestration, alongside deep HPC experience including NVIDIA GPU ecosystems, RDMA networking (RoCE and InfiniBand), and performance validation and benchmarking. Strong Linux and distributed systems expertise is essential.

Alongside operational ownership, this is a deeply technical Infrastructure SRE role centred on advanced operational troubleshooting and performance evaluation across large-scale HPC systems. You will investigate complex, cross-layer issues spanning GPU compute, networking, storage, and orchestration, building a clear understanding of system behaviour under real production AI and HPC workloads.

A key responsibility is performance evaluation, testing, and operational acceptance of new HPC environments, ensuring platforms meet defined reliability, scalability, and performance expectations before entering production. You will work across hardware, network, and software layers to validate readiness of high-density GPU infrastructure and support safe, predictable deployment at scale.

You will also play a central role in continuous service improvement (CSI)—reducing operational toil, increasing automation, and improving reliability, consistency, and operational efficiency across the platform. This includes strengthening observability, refining operational workflows, and eliminating repetitive or failure-prone processes.

Over time, you will help shape future infrastructure design and deployment approaches, feeding operational insight back into infrastructure engineering decisions and ensuring production learnings directly influence next-generation HPC platform evolution.

Join a team operating some of the world’s most advanced high-performance computing infrastructure. As a HPC Infrastructure SRE, you’ll work hands-on with cutting-edge GPU and CPU platforms — including the latest NVIDIA architectures — powering dense, large-scale compute environments used for AI, machine learning, and next-generation workloads.

This is an opportunity to build expertise at the forefront of modern infrastructure, where reliability, scale, and performance matter every day. You’ll collaborate with experienced engineers across a globally distributed organisation that values openness, inclusion, technical excellence, and continuous learning.

We move quickly, solve meaningful challenges, and give people the space to make an impact. If you thrive in fast-paced environments, enjoy working with advanced technology, and want to help shape the future of high-performance compute, you’ll find both challenge and opportunity here.

You can also expect:

  • Exposure to industry-leading GPU and AI infrastructure
  • Opportunities to grow alongside a rapidly scaling global business
  • A collaborative, inclusive, and supportive engineering culture
  • Real ownership and the ability to influence operational excellence
  • Work that sits at the intersection of people, performance, and technology
  • A modern, flexible, globally connected workplace with ambitious goals

Key Responsibilities:

  • Operate and improve high-density AI/HPC infrastructure in a 24/7 production environment
  • Participate in a 24x7x365 on-call rotation, supporting mission-critical systems and incident response
  • Troubleshoot complex issues across compute, networking, storage, and orchestration layers in GPU-accelerated environments
  • Lead performance evaluation, testing, and operational acceptance of new HPC infrastructure before production release
  • Drive continuous service improvement (CSI), reducing toil through automation, tooling, and process refinement
  • Build and maintain infrastructure automation and tooling (IaC and scripting) to improve reliability and operational efficiency
  • Optimise Linux systems for performance, including kernel, BIOS/firmware, and storage tuning for HPC workloads
  • Configure and operate bare-metal infrastructure using IPMI, iLO, iDRAC, Redfish, and related tooling
  • Partner with infrastructure tooling and observability teams to improve telemetry, alerting, and system visibility at scale
  • Own ITIL-aligned processes across Incident, Major Incident, Problem, and Change Management, ensuring strong execution and continuous improvement
  • Lead root cause analysis and ensure corrective actions are implemented and automated where possible
  • Play a key role in designing and delivering future HPC cluster and site builds, shaping global consistency and operational standards
  • Collaborate closely with Platform Engineering, Network Engineering, Infrastructure Tooling, and Data Centre Operations to improve reliability and deployment quality
  • Feed operational insight back into infrastructure design to influence next-generation HPC platform evolution
  • Mentor engineers and act as a technical authority for operational best practices across teams
  • Communicate clearly with technical and non-technical stakeholders, translating complex issues into actionable outcomes
  • Uphold a culture of: do, document, automate

Essential Skills & Experience:

  • 8+ years experience in Site Reliability Engineering, Infrastructure Engineering, or similar roles in large-scale distributed production environments operating a 24/7 support model
  • 2–3+ years recent experience in HPC and/or AI infrastructure, including GPU-based compute environments at scale
  • Strong Linux expertise (preferably Ubuntu), including deep systems administration and production troubleshooting
  • Proven experience in performance tuning across compute systems, including kernel, BIOS/firmware, and storage subsystem optimisation
  • Strong hands-on experience with bare-metal infrastructure and out-of-band management tooling (IPMI, iLo, iDRAC, Redfish or equivalent)
  • Solid networking fundamentals including TCP/IP, DNS, DHCP, VLANs, routing, and switching, with exposure to high-performance networking environments
  • Exposure to NVIDIA GPU ecosystems, including CUDA-based workloads and GPU-accelerated compute environments, including the NVIDIA AI reference architecture.
  • Familiarity with high-performance networking technologies such as InfiniBand and RoCE
  • Strong experience with infrastructure automation and scripting (e.g. Bash, Python, Ansible or similar IaC/tooling approaches)
  • Understanding of observability principles and practical use of monitoring and telemetry systems (e.g. Prometheus, Grafana or equivalents)
  • Understanding of workload schedulers and running workloads across multiple systems in parallel.
  • Practical experience with at least one parallel storage platform.
  • Experience working in ITIL-aligned environments, including Incident, Major Incident, Problem, and Change Management
  • Strong troubleshooting skills in high-pressure operational environments, with a track record of incident ownership and resolution
  • Strong communication skills with the ability to work across engineering teams and interface with non-technical stakeholders
  • Ability and willingness to collaborate closely with Platform SRE teams, including exposure to and learning of Kubernetes-based orchestration environments (not a core requirement)
  • Experience contributing to or influencing infrastructure design, reliability improvements, or operational best practices

Bonus / Highly Desirable:

  • Deep experience with HPC workloads and GPU-accelerated infrastructure at scale
  • Experience with InfiniBand, RoCE, or other HPC-grade networking fabrics in production environments
  • Experience with HPC benchmarking, validation, or performance testing (e.g. linpac, fio, NCCL, ibdiagnet)
  • Exposure to large-scale multi-site or global infrastructure deployments

Preferred Qualifications:

  • Bachelor or Masters Level degree in Computer Science, Engineering or related field, or equivalent ‘on-the-job’ experience.
  • LPIC Certifications
  • ITIL Foundation level qualification or equivalent experience

HPC Infrastructure Site Reliability Engineer employer: Radiant

Join a pioneering GPU-as-a-Service provider that champions innovation and excellence in high-performance computing. As an HPC Infrastructure Site Reliability Engineer, you'll thrive in a dynamic, inclusive work culture that prioritises collaboration and continuous learning, while having the opportunity to influence operational excellence and shape the future of cutting-edge technology. With exposure to industry-leading infrastructure and a commitment to employee growth, this role offers a unique chance to make a meaningful impact in a fast-paced environment.

Radiant

Contact Details:

Radiant Recruitment Team

StudySmarter Expert Advice🤫

We think this is how you could land HPC Infrastructure Site Reliability Engineer

Tip Number 1

Network, network, network! Get out there and connect with folks in the HPC and AI space. Attend meetups, webinars, or conferences where you can chat with industry professionals. You never know who might have a lead on your dream job!

Tip Number 2

Show off your skills! Create a portfolio or GitHub repository showcasing your projects related to HPC and AI infrastructure. This gives potential employers a taste of what you can do and sets you apart from the crowd.

Tip Number 3

Don’t be shy about reaching out directly to companies you’re interested in. A quick email or LinkedIn message expressing your enthusiasm for their work can go a long way. Plus, applying through our website shows you're serious about joining the team!

Tip Number 4

Prepare for technical interviews by brushing up on your troubleshooting skills and understanding of distributed systems. Practice explaining complex concepts clearly, as communication is key in this role. We want to see how you think on your feet!

We think you need these skills to ace HPC Infrastructure Site Reliability Engineer

Site Reliability Engineering
High-Performance Computing (HPC)
AI Infrastructure
Linux Systems Administration
Performance Tuning
Bare-Metal Infrastructure Management
Out-of-Band Management Tooling (IPMI, iLO, iDRAC, Redfish)

Some tips for your application 🫡

Tailor Your CV:Make sure your CV reflects the skills and experiences that align with the HPC Infrastructure Site Reliability Engineer role. Highlight your hands-on expertise in HPC, AI infrastructure, and any relevant projects you've worked on.

Craft a Compelling Cover Letter:Use your cover letter to tell us why you're passionate about high-performance computing and how your background makes you a great fit for our team. Be sure to mention specific technologies or experiences that relate to the job description.

Showcase Your Problem-Solving Skills:In your application, include examples of complex issues you've tackled in previous roles, especially in high-pressure environments. We want to see how you approach troubleshooting and performance evaluation in HPC systems.

Apply Through Our Website:We encourage you to submit your application through our website. It’s the best way for us to receive your details and ensures you’re considered for the role. Plus, it shows you’re keen on joining our team!

How to prepare for a job interview at Radiant

Know Your HPC Stuff

Make sure you brush up on your high-performance computing knowledge, especially around GPU-accelerated environments. Be ready to discuss specific technologies like NVIDIA GPUs, RDMA networking, and performance validation techniques. Showing that you can talk the talk will impress the interviewers.

Demonstrate Problem-Solving Skills

Prepare to tackle some real-world scenarios during your interview. Think about complex issues you've resolved in previous roles, particularly in high-pressure situations. Being able to articulate your troubleshooting process and the steps you took to resolve incidents will showcase your operational expertise.

Showcase Your Automation Experience

Since this role involves continuous service improvement, highlight your experience with infrastructure automation and scripting. Be ready to discuss tools like Ansible or Python, and how you've used them to improve reliability and efficiency in past projects. This will show that you're proactive about reducing operational toil.

Communicate Clearly

Strong communication skills are key, especially when working across teams. Practice explaining complex technical concepts in simple terms, as you'll need to interact with both technical and non-technical stakeholders. This will demonstrate your ability to bridge gaps and ensure everyone is on the same page.