High-Performance Computing (HPC) Specialist – AI Training Infrastructure
High-Performance Computing (HPC) Specialist – AI Training Infrastructure

High-Performance Computing (HPC) Specialist – AI Training Infrastructure

Full-Time 43200 - 72000 £ / year (est.) No home office possible
J

At a Glance

  • Tasks: Design and manage cutting-edge AI training environments for large-scale machine learning models.
  • Company: Meta builds technologies that connect people and empower communities globally.
  • Benefits: Enjoy flexible work options, competitive pay, and a vibrant company culture.
  • Why this job: Join a team at the forefront of AI technology and make a real impact on social tech.
  • Qualifications: Bachelor’s or Master’s in Computer Science or related field; 3+ years in HPC for AI/ML workloads.
  • Other info: Meta is an equal opportunity employer committed to inclusivity and diversity.

The predicted salary is between 43200 - 72000 £ per year.

We are seeking experienced and passionate High-Performance Computing (HPC) Specialists to join our AI Training Infrastructure team. In this role, you will design, optimize, and manage cutting-edge AI training environments for large-scale machine learning models. You will collaborate with a multidisciplinary team to ensure seamless integration and scalability across heterogeneous hardware platforms.

Responsibilities

  • Design and implement HPC solutions for large-scale AI/ML training workloads, ensuring high performance, scalability, and efficiency.
  • Optimize AI training pipelines and workflows to maximize utilization of GPUs and other specialized accelerators.
  • Analyze and troubleshoot hardware bottlenecks, network issues, and performance inefficiencies in large-scale AI training environments.
  • Collaborate with AI/ML researchers and data scientists to tailor HPC solutions that meet their specific model training requirements.
  • Develop monitoring and profiling systems to ensure efficient utilization of resources across heterogeneous systems.
  • Stay updated with advancements in HPC, AI/ML frameworks, and heterogeneous hardware technologies.
  • Contribute to documentation, best practices, and knowledge sharing within the team.

Minimum Qualifications

  • Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or a related field.
  • 3+ years of experience in HPC environments, particularly for AI/ML workloads.
  • Proficiency in parallel programming, distributed systems, and HPC-specific libraries (e.g., MPI, OpenMP, CUDA, ROCm).
  • Hands-on experience with at least one hardware platform (e.g., NVIDIA GPUs, AMD GPUs, TPUs, FPGAs, or custom ASICs).
  • Familiarity with PyTorch.
  • Requires understanding of networked storage solutions, interconnects (e.g., InfiniBand, NVLink), and high-speed networking.
  • Past experience in optimizing resource utilization in multi-node training environments.
  • Problem-solving, communication, and collaboration skills.
J

Contact Detail:

Job Traffic Recruiting Team

StudySmarter Expert Advice 🤫

We think this is how you could land High-Performance Computing (HPC) Specialist – AI Training Infrastructure

Tip Number 1

Familiarise yourself with the latest advancements in HPC and AI/ML frameworks. This knowledge will not only help you during interviews but also demonstrate your passion for the field and your commitment to staying updated.

Tip Number 2

Network with professionals in the HPC and AI communities. Attend relevant conferences, webinars, or local meetups to connect with others in the industry. These connections can provide valuable insights and potentially lead to referrals.

Tip Number 3

Prepare to discuss specific projects where you've optimised AI training pipelines or resolved performance issues. Having concrete examples ready will showcase your hands-on experience and problem-solving skills.

Tip Number 4

Collaborate with peers on open-source projects related to HPC or AI. This not only enhances your skills but also builds your portfolio, making you a more attractive candidate for the role.

We think you need these skills to ace High-Performance Computing (HPC) Specialist – AI Training Infrastructure

HPC Environment Expertise
AI/ML Workload Optimisation
Parallel Programming
Distributed Systems
HPC Libraries (MPI, OpenMP, CUDA, ROCm)
Hardware Platform Proficiency (NVIDIA GPUs, AMD GPUs, TPUs, FPGAs, ASICs)
Familiarity with PyTorch
Networked Storage Solutions Knowledge
High-Speed Networking Understanding (InfiniBand, NVLink)
Resource Utilisation Optimisation
Problem-Solving Skills
Effective Communication Skills
Collaboration Skills
Monitoring and Profiling System Development

Some tips for your application 🫡

Tailor Your CV: Make sure your CV highlights your experience in High-Performance Computing (HPC) and AI/ML workloads. Include specific projects where you've designed or optimised HPC solutions, and mention any relevant technologies or programming languages you are proficient in.

Craft a Compelling Cover Letter: In your cover letter, express your passion for HPC and AI training infrastructure. Discuss how your background aligns with the responsibilities outlined in the job description, and provide examples of how you've collaborated with multidisciplinary teams in the past.

Showcase Relevant Skills: Clearly list your technical skills that are relevant to the role, such as parallel programming, experience with specific hardware platforms, and familiarity with AI/ML frameworks like PyTorch. Use bullet points for clarity and impact.

Proofread and Edit: Before submitting your application, thoroughly proofread your documents for any spelling or grammatical errors. Ensure that your formatting is consistent and professional, as attention to detail is crucial in technical roles.

How to prepare for a job interview at Job Traffic

Showcase Your Technical Skills

Be prepared to discuss your experience with HPC environments, particularly in AI/ML workloads. Highlight specific projects where you optimised training pipelines or resolved hardware bottlenecks, as this will demonstrate your hands-on expertise.

Understand the Role's Requirements

Familiarise yourself with the key technologies mentioned in the job description, such as MPI, OpenMP, and CUDA. Being able to speak knowledgeably about these tools will show that you are well-prepared and genuinely interested in the role.

Prepare for Problem-Solving Questions

Expect to face technical challenges during the interview. Practice articulating your thought process when troubleshooting issues in large-scale AI training environments, as this will showcase your analytical skills and problem-solving abilities.

Emphasise Collaboration Experience

Since the role involves working with a multidisciplinary team, be ready to share examples of how you've successfully collaborated with AI/ML researchers or data scientists in the past. This will highlight your communication skills and ability to work effectively in a team setting.

High-Performance Computing (HPC) Specialist – AI Training Infrastructure
Job Traffic
J
  • High-Performance Computing (HPC) Specialist – AI Training Infrastructure

    Full-Time
    43200 - 72000 £ / year (est.)

    Application deadline: 2027-07-12

  • J

    Job Traffic

Similar positions in other companies
UK’s top job board for Gen Z
discover-jobs-cta
Discover now
>