Senior HPC AI Cluster Engineer

Senior HPC AI Cluster Engineer

Full-Time 70000 - 90000 £ / year (est.) No working from home possible
N

At a Glance

  • Tasks: Design and maintain cutting-edge HPC/AI clusters while collaborating with top researchers.
  • Company: Join NVIDIA, a leader in AI and supercomputing technology.
  • Benefits: Competitive salary, inclusive culture, and opportunities for professional growth.
  • Other info: Diverse team environment with a commitment to innovation and excellence.
  • Why this job: Be at the forefront of AI breakthroughs and work with groundbreaking technologies.
  • Qualifications: Degree in Computer Science or Engineering with 8+ years of relevant experience.

The predicted salary is between 70000 - 90000 £ per year.

NVIDIA is looking for an experienced HPC-AI Engineer to join the Networking Clusters Solutions Infrastructure team. We are focused on building supercomputers and AI clusters based on groundbreaking technologies. We are looking for an outstanding engineer to be a key player in the most exciting computing hardware and software to contribute to the latest breakthroughs in artificial intelligence and GPU computing.

You will provide insights on at-scale system design and tuning mechanisms for large-scale compute runs. You will work with the latest accelerated computing and deep learning software and hardware platforms, and with many scientific researchers, developers, and customers to craft improved workflows and develop new, leading differentiated solutions. You will interact with HPC, OS, GPU compute, and systems specialists to architect, develop, and bring up large scale performance platforms.

What You Will Be Doing

  • Design, implement and maintain large scale HPC/AI clusters with monitoring, logging and alerting.
  • Manage Linux job/workload schedules and orchestration tools.
  • Develop and maintain continuous integration and delivery pipelines.
  • Develop tooling to automate deployment and management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources.
  • Deploy monitoring solutions for the servers, network and storage.
  • Perform troubleshooting from bare metal, operating system, software stack and application level.
  • Being a technical resource, develop, re-define and document standard methodologies to share with internal teams.
  • Support Research & Development activities and engage in POCs/POVs for future improvements.

What We Need To See

  • A degree in Computer Science, Engineering, or a related field and 8+ years of experience.
  • Knowledge of HPC and AI solution technologies from CPUs and GPUs to high speed interconnects and supporting software.
  • Experience with job scheduling workloads and orchestration tools such as Slurm, K8s.
  • Excellent knowledge of Windows and Linux (Redhat/CentOS and Ubuntu) networking and internals, ACLs and OS level security protection and common protocols e.g. TCP, DHCP, DNS, etc.
  • Experience with multiple storage solutions such as Lustre, GPFS, Weka.io.
  • Familiarity with newer and emerging storage technologies.
  • Python programming and bash scripting experience.
  • Comfortable with automation and configuration management tools such as Jenkins, Ansible, Puppet/Chef.
  • Deep knowledge of Networking Protocols like InfiniBand, Ethernet.
  • Deep understanding and experience with virtual systems (for example VMware, Hyper-V, KVM, or Citrix).
  • Familiarity with cloud computing platforms (e.g. AWS, Azure, Google Cloud).

Ways To Stand Out From The Crowd

  • Knowledge of CPU and/or GPU architecture.
  • Knowledge of Kubernetes, container related microservice technologies.
  • Experience with GPU-focused hardware/software (DGX, Cuda).
  • Experience with RDMA (InfiniBand or RoCE) fabrics.

We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, colour, national origin, sex, gender, gender expression, sexual orientation, age, marital status, veteran status, or disability status. We will ensure that individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment. Please contact us to request accommodation.

Senior HPC AI Cluster Engineer employer: NVIDIA AI

NVIDIA is an exceptional employer, offering a dynamic work environment where innovation thrives. As a Senior HPC AI Cluster Engineer, you will engage with cutting-edge technologies and collaborate with leading experts in the field, fostering both personal and professional growth. The company promotes a culture of diversity and inclusion, ensuring that every employee has the opportunity to contribute meaningfully while enjoying comprehensive benefits and a supportive atmosphere.

N

Contact Details:

NVIDIA AI Recruitment Team

StudySmarter Expert Advice🤫

We think this is how you could land Senior HPC AI Cluster Engineer

Tip Number 1

Network like a pro! Attend industry meetups, conferences, or online webinars related to HPC and AI. Connecting with professionals in the field can lead to job opportunities that aren't even advertised yet.

Tip Number 2

Show off your skills! Create a portfolio showcasing your projects, especially those involving HPC and AI technologies. This gives potential employers a tangible look at what you can do and sets you apart from the crowd.

Tip Number 3

Prepare for technical interviews by brushing up on your knowledge of HPC systems, job scheduling tools, and networking protocols. Practising common interview questions can help you feel more confident when it’s time to shine.

Tip Number 4

Don’t forget to apply through our website! We’re always on the lookout for talented individuals like you. Plus, it’s a great way to ensure your application gets the attention it deserves.

We think you need these skills to ace Senior HPC AI Cluster Engineer

HPC and AI solution technologies
Job scheduling workloads
Orchestration tools (e.g., Slurm, K8s)
Linux (Redhat/CentOS and Ubuntu) networking
Networking protocols (e.g., TCP, DHCP, DNS)
Storage solutions (e.g., Lustre, GPFS, Weka.io)
Python programming

Some tips for your application 🫡

Tailor Your CV:Make sure your CV is tailored to the Senior HPC AI Cluster Engineer role. Highlight your experience with HPC and AI technologies, job scheduling tools, and any relevant projects you've worked on. We want to see how your skills align with what we're looking for!

Craft a Compelling Cover Letter:Your cover letter is your chance to shine! Use it to explain why you're passionate about HPC and AI, and how your background makes you a perfect fit for our team. Be sure to mention specific technologies or experiences that relate to the job description.

Showcase Your Technical Skills:In your application, don't forget to showcase your technical skills, especially in areas like Python programming, Linux systems, and networking protocols. We love seeing candidates who can demonstrate their expertise and problem-solving abilities!

Apply Through Our Website:We encourage you to apply through our website for the best chance of getting noticed. It’s super easy, and you'll be able to keep track of your application status. Plus, we love seeing applications come directly from our site!

How to prepare for a job interview at NVIDIA AI

Know Your Tech Inside Out

Make sure you brush up on your knowledge of HPC and AI technologies, especially around CPU and GPU architectures. Be ready to discuss specific projects where you've designed or maintained large-scale clusters, as this will show your hands-on experience.

Showcase Your Problem-Solving Skills

Prepare to share examples of how you've tackled complex issues in previous roles. Think about times when you had to troubleshoot from the bare metal up to the application level, and be ready to explain your thought process and the tools you used.

Familiarise Yourself with Job Scheduling Tools

Since job scheduling and orchestration tools like Slurm and Kubernetes are key for this role, make sure you can discuss your experience with them. If you’ve implemented or optimised these tools in past projects, have those examples ready to share.

Demonstrate Your Automation Know-How

Highlight your experience with automation and configuration management tools such as Jenkins, Ansible, or Puppet. Be prepared to discuss how you've used these tools to streamline processes or improve efficiency in your previous roles.