Operations & Support Engineer (HPC)
Operations & Support Engineer (HPC)

Operations & Support Engineer (HPC)

Telford Full-Time No home office possible
Go Premium
A

About the Company

A rapidly growing cloud provider is redefining high-performance computing with cutting-edge GPUaaS, delivering scalable, enterprise-grade AI infrastructure at unmatched efficiency. With deep ties to Nvidia, they’re quickly becoming a powerhouse in the US and Europe’s AI and ML ecosystem, providing solutions for HPC, AI, and deep learning workloads.

Role Overview

As the Principal HPC Support Engineer, you will play a pivotal role in maintaining and supporting high-performance computing environments on bare-metal infrastructure, primarily serving clients in research, higher education, and enterprise AI sectors. You will focus on both the software and networking aspects of HPC deployments, ensuring that large-scale GPU clusters remain operational, secure, and optimized for client needs.

Key Responsibilities

System Maintenance and Performance Optimization

• Manage, maintain, and tune bare-metal HPC clusters running Linux-based operating systems (e.g., Fedora, Debian, Ubuntu).

• Optimize Nvidia GPU compute environments, including CUDA, NCCL, and GPU resource management in multi-node HPC clusters.

• Oversee high-speed networking configurations, including InfiniBand (Mellanox), RDMA, and Ethernet fabric tuning for low-latency HPC workloads.

• Configure and fine-tune HPC schedulers (e.g., Slurm, OpenPBS, SGE) for optimal GPU workload distribution.

• Implement containerization strategies (Podman, Docker) and orchestration platforms (K3s, Kubernetes) for managing distributed AI/ML workloads.

Networking and Infrastructure Support

• Configure, monitor, and troubleshoot high-performance network fabrics, ensuring low-latency, high-throughput communication between GPU nodes.

• Deploy and maintain InfiniBand, RoCE, and high-speed Ethernet for HPC and AI clusters.

• Collaborate with networking teams to optimize routing, switching, and load balancing for distributed computing environments.

• Work closely with Nvidia engineers and system architects to implement GPUDirect Storage, NVLink, and Magnum IO for accelerated workloads.

Security, Automation, and Monitoring

• Maintain authentication and authorization systems such as Active Directory, OpenLDAP, and Keycloak.

• Automate system provisioning and configuration using Ansible, Terraform, or other Infrastructure-as-Code tools.

• Monitor system performance using Prometheus, Grafana, and ELK Stack, identifying and resolving bottlenecks in GPU workloads.

• Implement security best practices for multi-tenant HPC clusters, ensuring compliance with industry standards.

Troubleshooting and Client Support

• Serve as the lead technical resource for diagnosing and resolving complex software, networking, and hardware issues in large-scale GPU clusters.

• Analyze logs, conduct performance profiling, and debug CUDA, MPI, and RDMA-related issues.

• Work closely with AI/ML research teams, cloud engineers, and enterprise clients to optimize workload performance.

Collaboration and Process Improvement

• Support the ongoing development of internal HPC test environments and customer POCs.

• Work cross-functionally with Service Desk, Operations, and Service Delivery Management to ensure seamless service.

• Provide technical documentation, training, and mentorship to junior team members.

A

Contact Detail:

asobbi Recruiting Team

Operations & Support Engineer (HPC)
asobbi
Go Premium

Land your dream job quicker with Premium

You’re marked as a top applicant with our partner companies
Individual CV and cover letter feedback including tailoring to specific job roles
Be among the first applications for new jobs with our AI application
1:1 support and career advice from our career coaches
Go Premium

Money-back if you don't land a job in 6-months

A
  • Operations & Support Engineer (HPC)

    Telford
    Full-Time

    Application deadline: 2027-11-05

  • A

    asobbi

Similar positions in other companies
UK’s top job board for Gen Z
discover-jobs-cta
Discover now
>