Senior HPC-AI Cluster Architect

Job Board

Companies

NVIDIA AI

Senior HPC-AI Cluster Architect

Full-Time 70000 - 90000 £ / year (est.) No working from home possible

Apply Now

At a Glance

Tasks: Design and maintain cutting-edge HPC/AI clusters for groundbreaking tech.
Company: Join a leading tech company revolutionising AI and supercomputing.
Benefits: Competitive salary, flexible work options, and opportunities for professional growth.
Other info: Diverse and inclusive workplace with excellent career advancement opportunities.
Why this job: Be at the forefront of AI innovation and make a real impact.
Qualifications: 8+ years in HPC/AI with strong Linux and networking skills.

The predicted salary is between 70000 - 90000 £ per year.

NVIDIA is looking for an experienced HPC-AI Engineer to join the Networking Clusters Solutions Infrastructure team. We are focused on building supercomputers and AI clusters based on groundbreaking technologies. We are looking for an outstanding engineer to be a key player in the most exciting computing hardware and software to contribute to the latest breakthroughs in artificial intelligence and GPU computing. You will work with the latest Accelerated computing and Deep Learning software and hardware platforms, and with many scientific researchers, developers, and customers to craft improved workflows and develop new, leading differentiated solutions. You will interact with HPC, OS, GPU compute, and systems specialists to architect, develop and bring up large scale performance platforms.

What You Will Be Doing

Design, implement and maintain large scale HPC/AI clusters with monitoring, logging and alerting.
Manage Linux job/workload schedules and orchestration tools.
Develop and maintain continuous integration and delivery pipelines.
Develop tooling to automate deployment and management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources.
Deploy monitoring solutions for the servers, network and storage.
Perform troubleshooting from bare metal, operating system, software stack and application level.
Being a technical resource, develop, re-define and document standard methodologies to share with internal teams.
Support Research & Development activities and engage in POCs/POVs for future improvements.

What We Need To See

A degree in Computer Science, Engineering, or a related field and 8+ years of experience.
Knowledge of HPC and AI solution technologies from CPUs and GPUs to high speed interconnects and supporting software.
Experience with job scheduling workloads and orchestration tools such as Slurm, K8s.
Excellent knowledge of Windows and Linux (Redhat/CentOS and Ubuntu) networking and internals, ACLs and OS level security protection and common protocols e.g. TCP, DHCP, DNS.
Experience with multiple storage solutions such as Lustre, GPFS, Weka.io.
Familiarity with newer and emerging storage technologies.
Python programming and bash scripting experience.
Comfortable with automation and configuration management tools such as Jenkins, Ansible, Puppet/Chef.
Deep knowledge of Networking Protocols like InfiniBand, Ethernet.
Deep understanding and experience with virtual systems (for example VMware, Hyper-V, KVM, or Citrix).
Familiarity with cloud computing platforms (e.g. AWS, Azure, Google Cloud).

Ways To Stand Out From The Crowd

Knowledge of CPU and/or GPU architecture.
Knowledge of Kubernetes, container related microservice technologies.
Experience with GPU-focused hardware/software (DGX, Cuda).
Experience with RDMA (InfiniBand or RoCE) fabrics.

We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, colour, national origin, sex, gender, gender expression, sexual orientation, age, marital status, veteran status, or disability status. We will ensure that individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment. Please contact us to request accommodation.

Senior HPC-AI Cluster Architect employer: NVIDIA AI

NVIDIA is an exceptional employer, offering a dynamic work environment where innovation thrives. As a Senior HPC-AI Cluster Architect, you will be at the forefront of cutting-edge technology, collaborating with top-tier researchers and developers to shape the future of AI and supercomputing. With a strong emphasis on employee growth, diversity, and a culture that fosters creativity, NVIDIA provides unparalleled opportunities for professional development and meaningful contributions in a vibrant location.

Contact Details:

NVIDIA AI Recruitment Team

View NVIDIA AI profile

StudySmarter Expert Advice🤫

We think this is how you could land Senior HPC-AI Cluster Architect

✨Tip Number 1

Network like a pro! Reach out to folks in the HPC and AI space on LinkedIn or at industry events. A friendly chat can open doors that a CV just can't.

✨Tip Number 2

Show off your skills! Create a portfolio or GitHub repo showcasing your projects related to HPC, AI, or any relevant tech. This gives potential employers a taste of what you can do.

✨Tip Number 3

Prepare for those interviews! Brush up on your technical knowledge and be ready to discuss your experience with tools like Slurm or Kubernetes. We want to see your passion and expertise shine through!

✨Tip Number 4

Apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, it shows you're genuinely interested in joining our team.

We think you need these skills to ace Senior HPC-AI Cluster Architect

HPC and AI solution technologies

Job scheduling workloads

Orchestration tools (e.g. Slurm, K8s)

Linux (Redhat/CentOS and Ubuntu) networking

Networking protocols (TCP, DHCP, DNS)

Storage solutions (Lustre, GPFS, Weka.io)

Python programming

Bash scripting

Automation and configuration management tools (Jenkins, Ansible, Puppet/Chef)

Networking Protocols (InfiniBand, Ethernet)

Virtual systems (VMware, Hyper-V, KVM, Citrix)

Cloud computing platforms (AWS, Azure, Google Cloud)

GPU-focused hardware/software (DGX, Cuda)

RDMA fabrics (InfiniBand or RoCE)

Some tips for your application 🫡

Tailor Your CV:Make sure your CV is tailored to the Senior HPC-AI Cluster Architect role. Highlight your experience with HPC and AI technologies, job scheduling tools, and any relevant projects you've worked on. We want to see how your skills align with what we're looking for!

Craft a Compelling Cover Letter:Your cover letter is your chance to shine! Use it to explain why you're passionate about HPC and AI, and how your background makes you a perfect fit for our team. Don't forget to mention specific technologies or projects that excite you!

Showcase Your Technical Skills:In your application, be sure to showcase your technical skills clearly. Mention your experience with Linux, Python, and any orchestration tools like Slurm or Kubernetes. We love seeing candidates who can demonstrate their hands-on experience!

Apply Through Our Website:We encourage you to apply through our website for the best chance of getting noticed. It’s super easy, and you'll be able to keep track of your application status. Plus, we love seeing applications come directly from our site!

How to prepare for a job interview at NVIDIA AI

✨Know Your Tech Inside Out

Make sure you’re well-versed in the latest HPC and AI technologies, especially around CPU and GPU architectures. Brush up on your knowledge of job scheduling tools like Slurm and Kubernetes, as well as networking protocols. Being able to discuss these topics confidently will show that you're not just familiar with the basics but are ready to dive deep.

✨Showcase Your Problem-Solving Skills

Prepare to discuss specific challenges you've faced in previous roles, particularly around large-scale system design and troubleshooting. Use the STAR method (Situation, Task, Action, Result) to structure your answers, highlighting how you approached problems and what solutions you implemented.

✨Demonstrate Your Automation Know-How

Since automation is key in this role, be ready to talk about your experience with tools like Jenkins, Ansible, or Puppet. Share examples of how you've automated deployment processes or improved operational efficiency in past projects. This will show that you can contribute to the continuous integration and delivery pipelines they’re looking for.

✨Engage with Questions

Prepare thoughtful questions about the team’s current projects, challenges they face, or future directions in HPC and AI. This not only shows your interest in the role but also gives you insight into whether the company aligns with your career goals. Plus, it makes for a more engaging conversation!

Senior HPC-AI Cluster Architect

NVIDIA AI

Apply Now

Senior HPC-AI Cluster Architect

At a Glance

Senior HPC-AI Cluster Architect employer: NVIDIA AI

StudySmarter Expert Advice🤫

We think you need these skills to ace Senior HPC-AI Cluster Architect

Some tips for your application 🫡

How to prepare for a job interview at NVIDIA AI

Company

Product

Help