HPC Infrastructure and Support Engineer in Northampton
HPC Infrastructure and Support Engineer

HPC Infrastructure and Support Engineer in Northampton

Northampton Full-Time 36000 - 60000 £ / year (est.) No home office possible
Go Premium
asobbi

At a Glance

  • Tasks: Maintain and optimise high-performance computing environments for cutting-edge AI solutions.
  • Company: Rapidly growing cloud provider redefining high-performance computing with innovative GPUaaS.
  • Benefits: Competitive salary, flexible work options, and opportunities for professional growth.
  • Other info: Collaborative environment with excellent career advancement opportunities.
  • Why this job: Join a dynamic team and make an impact in the AI and ML ecosystem.
  • Qualifications: Experience with HPC systems, Linux, and networking; strong problem-solving skills.

The predicted salary is between 36000 - 60000 £ per year.

A rapidly growing cloud provider is redefining high-performance computing with cutting-edge GPUaaS, delivering scalable, enterprise-grade AI infrastructure at unmatched efficiency. With deep ties to Nvidia, they’re quickly becoming a powerhouse in the US and Europe’s AI and ML ecosystem, providing solutions for HPC, AI, and deep learning workloads.

As the Principal HPC Support Engineer, you will play a pivotal role in maintaining and supporting high-performance computing environments on bare-metal infrastructure, primarily serving clients in research, higher education, and enterprise AI sectors. You will focus on both the software and networking aspects of HPC deployments, ensuring that large-scale GPU clusters remain operational, secure, and optimized for client needs.

Key Responsibilities
  • System Maintenance and Performance Optimization
  • Manage, maintain, and tune bare-metal HPC clusters running Linux-based operating systems (e.g., Fedora, Debian, Ubuntu).
  • Optimize Nvidia GPU compute environments, including CUDA, NCCL, and GPU resource management in multi-node HPC clusters.
  • Oversee high-speed networking configurations, including InfiniBand (Mellanox), RDMA, and Ethernet fabric tuning for low-latency HPC workloads.
  • Configure and fine-tune HPC schedulers (e.g., Slurm, OpenPBS, SGE) for optimal GPU workload distribution.
  • Implement containerization strategies (Podman, Docker) and orchestration platforms (K3s, Kubernetes) for managing distributed AI/ML workloads.
  • Networking and Infrastructure Support
    • Configure, monitor, and troubleshoot high-performance network fabrics, ensuring low-latency, high-throughput communication between GPU nodes.
    • Deploy and maintain InfiniBand, RoCE, and high-speed Ethernet for HPC and AI clusters.
    • Collaborate with networking teams to optimize routing, switching, and load balancing for distributed computing environments.
    • Work closely with Nvidia engineers and system architects to implement GPUDirect Storage, NVLink, and Magnum IO for accelerated workloads.
  • Security, Automation, and Monitoring
    • Maintain authentication and authorization systems such as Active Directory, OpenLDAP, and Keycloak.
    • Automate system provisioning and configuration using Ansible, Terraform, or other Infrastructure-as-Code tools.
    • Monitor system performance using Prometheus, Grafana, and ELK Stack, identifying and resolving bottlenecks in GPU workloads.
    • Implement security best practices for multi-tenant HPC clusters, ensuring compliance with industry standards.
  • Troubleshooting and Client Support
    • Serve as the lead technical resource for diagnosing and resolving complex software, networking, and hardware issues in large-scale GPU clusters.
    • Analyze logs, conduct performance profiling, and debug CUDA, MPI, and RDMA-related issues.
    • Work closely with AI/ML research teams, cloud engineers, and enterprise clients to optimize workload performance.
  • Collaboration and Process Improvement
    • Support the ongoing development of internal HPC test environments and customer POCs.
    • Work cross-functionally with Service Desk, Operations, and Service Delivery Management to ensure seamless service.
    • Provide technical documentation, training, and mentorship to junior team members.

    HPC Infrastructure and Support Engineer in Northampton employer: asobbi

    As a rapidly growing cloud provider at the forefront of high-performance computing, we offer an exceptional work environment that fosters innovation and collaboration. Our commitment to employee growth is evident through continuous training opportunities and a culture that values teamwork and knowledge sharing, particularly in our cutting-edge AI infrastructure projects. Located in a dynamic tech hub, we provide our employees with access to industry leaders and the chance to work on transformative technologies that shape the future of AI and machine learning.
    asobbi

    Contact Detail:

    asobbi Recruiting Team

    StudySmarter Expert Advice 🤫

    We think this is how you could land HPC Infrastructure and Support Engineer in Northampton

    ✨Network Like a Pro

    Get out there and connect with people in the HPC and AI sectors! Attend industry events, webinars, or local meetups. You never know who might have the inside scoop on job openings or can put in a good word for you.

    ✨Show Off Your Skills

    When you get the chance to chat with potential employers, don’t hold back! Share your experiences with managing HPC clusters, optimising GPU environments, and troubleshooting complex issues. Let them see how you can add value to their team.

    ✨Tailor Your Approach

    Before any interview, do your homework! Research the company’s projects and challenges in HPC and AI. Tailor your answers to show how your skills in system maintenance and performance optimisation can help them achieve their goals.

    ✨Apply Through Us!

    Don’t forget to check out our website for job openings! Applying through us not only gives you access to exclusive roles but also shows you’re serious about joining the team. Let’s land that dream job together!

    We think you need these skills to ace HPC Infrastructure and Support Engineer in Northampton

    HPC Cluster Management
    Linux Operating Systems (Fedora, Debian, Ubuntu)
    Nvidia GPU Optimization (CUDA, NCCL)
    High-Speed Networking (InfiniBand, RDMA, Ethernet)
    HPC Scheduler Configuration (Slurm, OpenPBS, SGE)
    Containerization (Podman, Docker)
    Orchestration Platforms (K3s, Kubernetes)
    Network Fabric Configuration and Troubleshooting
    Authentication and Authorization Systems (Active Directory, OpenLDAP, Keycloak)
    Infrastructure-as-Code (Ansible, Terraform)
    System Monitoring (Prometheus, Grafana, ELK Stack)
    Performance Profiling and Debugging (CUDA, MPI, RDMA)
    Technical Documentation and Mentorship
    Collaboration with Cross-Functional Teams

    Some tips for your application 🫡

    Tailor Your CV: Make sure your CV is tailored to the HPC Infrastructure and Support Engineer role. Highlight your experience with Linux-based systems, GPU environments, and any relevant networking skills. We want to see how your background aligns with what we’re looking for!

    Craft a Compelling Cover Letter: Your cover letter is your chance to shine! Use it to explain why you’re passionate about high-performance computing and how your skills can contribute to our mission. Keep it engaging and personal – we love to see your personality come through!

    Showcase Relevant Projects: If you've worked on any projects related to HPC, AI, or deep learning, make sure to mention them! Whether it's optimising GPU clusters or automating processes, we want to know how you've applied your skills in real-world scenarios.

    Apply Through Our Website: We encourage you to apply directly through our website. It’s the best way for us to receive your application and ensures you don’t miss out on any important updates. Plus, it shows you’re keen to join our team!

    How to prepare for a job interview at asobbi

    ✨Know Your HPC Stuff

    Make sure you brush up on your knowledge of high-performance computing, especially around Linux-based systems and Nvidia GPU environments. Be ready to discuss specific tools like CUDA and Slurm, as well as your experience with bare-metal clusters.

    ✨Showcase Your Troubleshooting Skills

    Prepare to share examples of how you've diagnosed and resolved complex issues in previous roles. Think about specific instances where you tackled software or networking problems, and be ready to explain your thought process.

    ✨Familiarise Yourself with Networking

    Since this role involves high-speed networking configurations, it’s crucial to understand InfiniBand and RDMA. Brush up on your knowledge of network fabrics and be prepared to discuss how you’ve optimised communication between nodes in past projects.

    ✨Demonstrate Collaboration Experience

    This position requires working closely with various teams, so be ready to talk about your experience collaborating with others. Highlight any cross-functional projects you've been involved in and how you contributed to their success.

    HPC Infrastructure and Support Engineer in Northampton
    asobbi
    Location: Northampton
    Go Premium

    Land your dream job quicker with Premium

    You’re marked as a top applicant with our partner companies
    Individual CV and cover letter feedback including tailoring to specific job roles
    Be among the first applications for new jobs with our AI application
    1:1 support and career advice from our career coaches
    Go Premium

    Money-back if you don't land a job in 6-months

    >