HPC Infrastructure Site Reliability Engineer

Job Board

Companies

United States Digital Space LLC

HPC Infrastructure Site Reliability Engineer

Full-Time 80000 - 100000 £ / year (est.) Home office (partial)

Apply Now

At a Glance

Tasks: Ensure reliability and performance of high-density AI/HPC infrastructure in a 24/7 environment.
Company: Fast-growing GPU-as-a-Service provider with cutting-edge technology.
Benefits: Competitive salary, flexible work, and opportunities for professional growth.
Other info: Collaborative culture that values innovation, inclusion, and continuous learning.
Why this job: Work hands-on with advanced GPU and AI platforms, shaping the future of high-performance computing.
Qualifications: 8+ years in SRE or Infrastructure Engineering, with HPC and AI experience preferred.

The predicted salary is between 80000 - 100000 £ per year.

We’re a fast-growing GPU-as-a-Service provider, delivering scalable, high-performance compute infrastructure purpose-built for AI and HPC workloads. Operating across global data centres, we run mission-critical environments where uptime, throughput, and ultra-low latency are non-negotiable.

We are looking for a senior Infrastructure Site Reliability Engineer with deep experience operating large-scale distributed systems and recent hands-on expertise in high-performance computing (HPC) and AI infrastructure. This is an operations-first SRE role, working in a 24/7/365 on-call environment, responsible for ensuring reliability, performance, and continuous improvement of mission-critical infrastructure.

This role sits within a cross-functional organisation spanning network engineering, infrastructure SRE, Platform SRE, infrastructure tooling engineers (software) and data centre operations. The ideal candidate has progressed through large-scale, globally distributed or multi-site infrastructure environments and has more recently specialised in GPU-accelerated HPC systems.

This role provides exposure to the latest high-density AI compute platforms, including next-generation GPU infrastructure at significant scale. You will bring strong breadth across bare metal, networking, storage, virtualisation, and orchestration, alongside deep HPC experience including NVIDIA GPU ecosystems, RDMA networking (RoCE and InfiniBand), and performance validation and benchmarking. Strong Linux and distributed systems expertise is essential.

Alongside operational ownership, this is a deeply technical Infrastructure SRE role centred on advanced operational troubleshooting and performance evaluation across large-scale HPC systems. You will investigate complex, cross-layer issues spanning GPU compute, networking, storage, and orchestration, building a clear understanding of system behaviour under real production AI and HPC workloads.

A key responsibility is performance evaluation, testing, and operational acceptance of new HPC environments, ensuring platforms meet defined reliability, scalability, and performance expectations before entering production. You will work across hardware, network, and software layers to validate readiness of high-density GPU infrastructure and support safe, predictable deployment at scale.

You will also play a central role in continuous service improvement (CSI)—reducing operational toil, increasing automation, and improving reliability, consistency, and operational efficiency across the platform. This includes strengthening observability, refining operational workflows, and eliminating repetitive or failure-prone processes.

Over time, you will help shape future infrastructure design and deployment approaches, feeding operational insight back into infrastructure engineering decisions and ensuring production learnings directly influence next-generation HPC platform evolution.

Join a team operating some of the world’s most advanced high-performance computing infrastructure. As a HPC Infrastructure SRE, you’ll work hands-on with cutting-edge GPU and CPU platforms — including the latest NVIDIA architectures — powering dense, large-scale compute environments used for AI, machine learning, and next-generation workloads.

This is an opportunity to build expertise at the forefront of modern infrastructure, where reliability, scale, and performance matter every day. You’ll collaborate with experienced engineers across a globally distributed organisation that values openness, inclusion, technical excellence, and continuous learning.

If you thrive in fast-paced environments, enjoy working with advanced technology, and want to help shape the future of high-performance compute, you’ll find both challenge and opportunity here.

You can also expect:

Exposure to industry-leading GPU and AI infrastructure
Opportunities to grow alongside a rapidly scaling global business
A collaborative, inclusive, and supportive engineering culture
Real ownership and the ability to influence operational excellence
Work that sits at the intersection of people, performance, and technology
A modern, flexible, globally connected workplace with ambitious goals

Key Responsibilities:

Operate and improve high-density AI/HPC infrastructure in a 24/7 production environment
Participate in a 24x7x365 on-call rotation, supporting mission-critical systems and incident response
Troubleshoot complex issues across compute, networking, storage, and orchestration layers in GPU-accelerated environments
Lead performance evaluation, testing, and operational acceptance of new HPC infrastructure before production release
Drive continuous service improvement (CSI), reducing toil through automation, tooling, and process refinement
Build and maintain infrastructure automation and tooling (IaC and scripting) to improve reliability and operational efficiency
Optimise Linux systems for performance, including kernel, BIOS/firmware, and storage tuning for HPC workloads
Configure and operate bare-metal infrastructure using IPMI, iLO, iDRAC, Redfish, and related tooling
Partner with infrastructure tooling and observability teams to improve telemetry, alerting, and system visibility at scale
Own ITIL-aligned processes across Incident, Major Incident, Problem, and Change Management, ensuring strong execution and continuous improvement
Lead root cause analysis and ensure corrective actions are implemented and automated where possible
Play a key role in designing and delivering future HPC cluster and site builds, shaping global consistency and operational standards
Collaborate closely with Platform Engineering, Network Engineering, Infrastructure Tooling, and Data Centre Operations to improve reliability and deployment quality
Feed operational insight back into infrastructure design to influence next-generation HPC platform evolution
Mentor engineers and act as a technical authority for operational best practices across teams
Communicate clearly with technical and non-technical stakeholders, translating complex issues into actionable outcomes

Uphold a culture of: do, document, automate

Essential Skills & Experience:

8+ years experience in Site Reliability Engineering, Infrastructure Engineering, or similar roles in large-scale distributed production environments operating a 24/7 support model
2–3+ years recent experience in HPC and/or AI infrastructure, including GPU-based compute environments at scale
Strong Linux expertise (preferably Ubuntu), including deep systems administration and production troubleshooting
Proven experience in performance tuning across compute systems, including kernel, BIOS/firmware, and storage subsystem optimisation
Strong hands-on experience with bare-metal infrastructure and out-of-band management tooling (IPMI, iLo, iDRAC, Redfish or equivalent)
Solid networking fundamentals including TCP/IP, DNS, DHCP, VLANs, routing, and switching, with exposure to high-performance networking environments
Exposure to NVIDIA GPU ecosystems, including CUDA-based workloads and GPU-accelerated compute environments, including the NVIDIA AI reference architecture.
Familiarity with high-performance networking technologies such as InfiniBand and RoCE
Strong experience with infrastructure automation and scripting (e.g. Bash, Python, Ansible or similar IaC/tooling approaches)
Understanding of observability principles and practical use of monitoring and telemetry systems (e.g. Prometheus, Grafana or equivalents)
Understanding of workload schedulers and running workloads across multiple systems in parallel.
Practical experience with at least one parallel storage platform.
Experience working in ITIL-aligned environments, including Incident, Major Incident, Problem, and Change Management
Strong troubleshooting skills in high-pressure operational environments, with a track record of incident ownership and resolution
Strong communication skills with the ability to work across engineering teams and interface with non-technical stakeholders
Ability and willingness to collaborate closely with Platform SRE teams, including exposure to and learning of Kubernetes-based orchestration environments (not a core requirement)
Experience contributing to or influencing infrastructure design, reliability improvements, or operational best practices

Bonus / Highly Desirable:

Deep experience with HPC workloads and GPU-accelerated infrastructure at scale
Experience with InfiniBand, RoCE, or other HPC-grade networking fabrics in production environments
Experience with HPC benchmarking, validation, or performance testing (e.g. linpac, fio, NCCL, ibdiagnet)
Exposure to large-scale multi-site or global infrastructure deployments

Preferred Qualifications:

Bachelor or Masters Level degree in Computer Science, Engineering or related field, or equivalent ‘on-the-job’ experience.
LPIC Certifications
ITIL Foundation level qualification or equivalent experience

HPC Infrastructure Site Reliability Engineer employer: United States Digital Space LLC

Join a pioneering GPU-as-a-Service provider that is at the forefront of high-performance computing and AI infrastructure. Our collaborative and inclusive work culture fosters continuous learning and technical excellence, offering you the chance to work with cutting-edge technology while making a significant impact in a fast-paced environment. With ample opportunities for professional growth and a commitment to operational excellence, you'll thrive as part of a team dedicated to shaping the future of compute infrastructure.

Contact Details:

United States Digital Space LLC Recruitment Team

View United States Digital Space LLC profile

StudySmarter Expert Advice🤫

We think this is how you could land HPC Infrastructure Site Reliability Engineer

✨Join Local Tech Meetups

Get out there and mingle with fellow developers by joining local tech meetups. It’s a fantastic way to meet people who might be working at United States Digital Space LLC or know someone who does. Plus, you can pick up some trendy tech skills and trends while you're at it!

✨Contribute to Open Source Projects

Show off your coding chops by jumping into open-source projects. Not only does this give you practical experience, but it also gets you noticed in the dev community. You'll create a killer portfolio that speaks volumes about your skills to United States Digital Space LLC.

✨Tap into Online Developer Communities

Don’t underestimate the power of online developer communities like GitHub, Stack Overflow, and even Reddit. Participate in discussions, share your projects, and build your visibility. We can often find opportunities through these channels that can lead to a full-time gig at companies like United States Digital Space LLC.

✨Explore Job Boards Specifically for Tech Roles

Keep your eyes peeled on job boards that focus on tech roles. Sites like TechCareers or Stack Overflow Jobs can often have listings for companies like United States Digital Space LLC that might not show up on broader job sites. Make it a habit to check these regularly, and don’t hesitate to apply directly through our website!

We think you need these skills to ace HPC Infrastructure Site Reliability Engineer

Site Reliability Engineering

High-Performance Computing (HPC)

AI Infrastructure

Linux Systems Administration

Performance Tuning

Bare-Metal Infrastructure Management

Networking Fundamentals

NVIDIA GPU Ecosystems

Infrastructure Automation

Scripting (Bash, Python, Ansible)

Observability Principles

Incident Management

Troubleshooting Skills

Collaboration with Engineering Teams

Some tips for your application 🫡

Show off your coding skills:When applying for a software engineering role, it's super important to showcase your coding skills. Make sure your CV includes your tech stack, any relevant programming languages you’re comfortable with, and examples of projects you've worked on. If you have a GitHub profile, link it up! We love to see code in action.

Tailor your portfolio:For a full-time role, we’d expect to see some solid examples of your work in your portfolio. Make sure to include at least two or three projects that highlight your problem-solving skills and your ability to work with different technologies. Focus on the projects that are most relevant to the position at United States Digital Space LLC.

Craft a killer cover letter:Your cover letter is your chance to stand out—make it personal! Explain why you want to work at United States Digital Space LLC and how your skills align with the role. Show us your passion for software development. We dig enthusiastic candidates who understand the value of collaboration and continuous learning!

Be clear and concise:When it comes to writing your CV and cover letter, clarity is key. Avoid jargon that could confuse us and stick to simple, direct language. Highlight your achievements with quantifiable results where possible, and keep everything easy to read. A well-organised application goes a long way!

How to prepare for a job interview at United States Digital Space LLC

✨Brush Up on Your Coding Skills

For a full-time software engineering role, it's crucial that we stay sharp with our coding abilities. Expect technical questions that might involve solving problems on the spot or discussing algorithms. Practise on platforms like LeetCode or HackerRank to get comfortable with the types of questions that often come up.

✨Know Your Tools and Frameworks

Make sure we’re well-acquainted with the tools and technologies listed in the job description. Familiarise ourselves with any specific frameworks or programming languages mentioned. If United States Digital Space LLC uses React or Node.js, for instance, be ready to discuss how we’ve used them in previous projects or coursework.

✨Showcase Your Projects

Bring along a portfolio that highlights our best work. This could be code samples, GitHub repositories, or any side projects we’ve built. Make sure we can talk through our thought process for each project, especially the challenges we faced and how we solved them—this shows our problem-solving skills in action.

✨Prepare for Behavioural Questions

While technical skills are key, full-time positions also require cultural fit. Be ready to discuss our previous experiences and how we handle teamwork, conflict, and deadlines. Brush up on the STAR method—Situation, Task, Action, Result—to clearly articulate our past experiences when discussing how we've contributed to a team.

HPC Infrastructure Site Reliability Engineer

United States Digital Space LLC

Apply Now

HPC Infrastructure Site Reliability Engineer

At a Glance

HPC Infrastructure Site Reliability Engineer employer: United States Digital Space LLC

StudySmarter Expert Advice🤫

We think you need these skills to ace HPC Infrastructure Site Reliability Engineer

Some tips for your application 🫡

How to prepare for a job interview at United States Digital Space LLC

Company

Product

Help