At a Glance
- Tasks: Support and optimise HPC and AI workloads on GPU infrastructure, solving complex customer issues.
- Company: NScale is a cutting-edge GPU cloud provider focused on AI innovation and high-performance infrastructure.
- Benefits: Enjoy remote work flexibility, a collaborative culture, and opportunities for personal growth.
- Why this job: Join a dynamic team driving AI breakthroughs with hands-on problem-solving in a fast-paced environment.
- Qualifications: Experience in HPC/AI support, strong technical skills, and a proactive, customer-first mindset required.
- Other info: Transitioning to a hybrid model in 2025; inclusive workplace welcoming diverse applicants.
The predicted salary is between 43200 - 72000 ÂŁ per year.
Join NScale as a Senior HPC Support Engineer. NScale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI startups and enterprises. Our platform reduces the complexity of AI development, empowering our customers to achieve faster innovation and better outcomes. Our mission is to enable the AI breakthroughs of tomorrow by delivering exceptional infrastructure today. At NScale, we’re builders at heart — driven by ownership, innovation, and urgency.
About the Role
We’re looking for a Senior HPC Support Engineer to join our fast-growing team, focused on enabling and optimising HPC and AI workloads on GPU-accelerated infrastructure. You’ll work directly with customers solving some of the most complex problems in AI, helping them troubleshoot and optimize performance in compute-intensive, distributed environments. This is a hands-on role requiring deep technical acumen, exceptional problem-solving ability, and comfort working across a diverse set of technologies including GPUs (NVIDIA and AMD), InfiniBand networking, and orchestration systems like Slurm.
What You’ll Be Doing
- Provide expert-level support for customer HPC and AI workloads running in production.
- Troubleshoot complex system-level issues across networking, storage, containers, and GPUs.
- Collaborate with engineering and vendor partners to resolve hardware/software compatibility and performance issues.
- Analyse distributed workloads and assist with tuning of MPI-based applications.
- Develop internal tools and automation to improve support workflows.
- Contribute to documentation and knowledge-sharing initiatives.
- Participate in on-call rotations to support high-priority incidents and escalations.
About You
Skills & Experience
- Proven experience supporting HPC and/or AI workloads in production environments.
- Strong expertise with Slurm workload manager, including tuning and troubleshooting.
- Proficiency with system-level debugging, including kernel modules and network interfaces.
- Experience with GPU compute platforms (NVIDIA and/or AMD) and associated libraries.
- Familiarity with MPI libraries (e.g., OpenMPI), InfiniBand, and high-speed Ethernet networking.
- Solid Linux administration skills and troubleshooting experience.
- Working knowledge of HPC container runtimes (e.g., Singularity, Apptainer).
- Exposure to provisioning and automation tools (e.g., Ansible, PXE, Terraform).
- Experience with monitoring tools such as Prometheus, Grafana, and DCGM.
- Understanding of GPU/accelerator toolchains like CUDA or ROCm.
- A proactive, customer-first mindset with strong communication skills.
- Ability to work effectively in both individual and team settings.
- Comfort operating in fast-paced, ambiguous, high-growth environments.
Nice to have
- Experience with OpenStack and troubleshooting infrastructure in cloud environments.
- Kubernetes expertise, particularly in HPC or AI workload contexts.
- Familiarity with distributed file systems and advanced storage configurations.
- Understanding of GPU virtualization and multi-tenant HPC architecture.
- Exposure to machine learning frameworks and AI optimization workflows.
- Scripting skills in Python, Bash, or similar for automation and tooling.
Personal Attributes
- Proactive and self-motivated, with a strong sense of ownership.
- Thrives in a fast-paced, dynamic, and high-growth environment.
- Collaborative team player with a passion for delivering outstanding candidate and stakeholder experiences.
- Strong attention to detail and documentation skills.
- Excellent communication skills, both written and verbal.
- A self-starter mindset with a “see a problem, fix a problem” mentality.
- Experience in designing and implementing processes to optimize deployment workflows.
Please Note: We’re currently working remotely, but plan to transition to a hybrid working model in 2025 as we look to secure a modern office space in London.
In all we do, our core values guide us:
- Relentless Innovation
- Ownership and Accountability
- Openness and Transparency
- Customer-Centric Focus
- Sustainability
- Full-Speed Collaboration
Equal Opportunities Statement
At NScale, we are committed to fostering an inclusive, diverse, and equitable workplace. We believe that a variety of perspectives enriches our work environment, and we warmly welcome applications from individuals of all backgrounds, experiences, and perspectives. We strongly encourage applications from people of colour, the LGBTQ+ community, people with disabilities, neurodivergent people, parents, carers, and people from lower socio-economic backgrounds.
Senior HPC Support Engineer (, , United Kingdom) employer: Nscale
Contact Detail:
Nscale Recruiting Team
StudySmarter Expert Advice 🤫
We think this is how you could land Senior HPC Support Engineer (, , United Kingdom)
✨Tip Number 1
Familiarise yourself with the specific technologies mentioned in the job description, such as Slurm, NVIDIA and AMD GPUs, and InfiniBand networking. Having hands-on experience or projects that showcase your skills with these technologies can set you apart from other candidates.
✨Tip Number 2
Engage with the HPC and AI community through forums, webinars, or local meetups. Networking with professionals in the field can provide insights into the latest trends and challenges, and may even lead to referrals or recommendations for the role.
✨Tip Number 3
Prepare to discuss real-world scenarios where you've successfully troubleshot complex system-level issues. Be ready to explain your thought process and the steps you took to resolve these problems, as this will demonstrate your problem-solving abilities.
✨Tip Number 4
Showcase your proactive mindset by thinking of ways to improve support workflows or automation processes. Consider developing a small project or tool that highlights your ability to innovate and streamline operations, which aligns with NScale's values.
We think you need these skills to ace Senior HPC Support Engineer (, , United Kingdom)
Some tips for your application 🫡
Tailor Your CV: Make sure your CV highlights relevant experience in HPC and AI workloads. Emphasise your expertise with Slurm, GPU platforms, and any troubleshooting skills that align with the job description.
Craft a Compelling Cover Letter: Write a cover letter that showcases your passion for AI and HPC. Mention specific projects or experiences where you solved complex problems, demonstrating your proactive mindset and customer-first approach.
Highlight Technical Skills: In your application, clearly list your technical skills related to the role, such as proficiency with InfiniBand networking, MPI libraries, and Linux administration. Use bullet points for clarity.
Showcase Problem-Solving Examples: Include examples of how you've tackled challenging issues in previous roles. This could involve debugging system-level problems or optimising performance in distributed environments, which are key aspects of the position.
How to prepare for a job interview at Nscale
✨Showcase Your Technical Expertise
Be prepared to discuss your experience with HPC and AI workloads in detail. Highlight specific projects where you've optimised performance or solved complex issues, especially involving GPUs, Slurm, and networking.
✨Demonstrate Problem-Solving Skills
Expect to face technical scenarios during the interview. Practice articulating your thought process when troubleshooting system-level issues, and be ready to explain how you would approach resolving them.
✨Emphasise Collaboration and Communication
NScale values teamwork and customer-centric approaches. Share examples of how you've worked effectively with engineering teams or customers to resolve issues, and highlight your communication skills.
✨Prepare for a Fast-Paced Environment
Given the dynamic nature of NScale, convey your adaptability and comfort in high-growth settings. Discuss experiences where you've thrived under pressure or adapted quickly to changing circumstances.