At a Glance
- Tasks: Design and maintain large-scale HPC/AI clusters while managing workloads and automating processes.
- Company: NVIDIA is a leader in computer graphics and AI, driving innovation for over 25 years.
- Benefits: Enjoy competitive salaries, extensive benefits, and a flexible, inclusive work environment.
- Why this job: Join a team pushing the boundaries of technology and contributing to groundbreaking AI advancements.
- Qualifications: Bachelor's degree or equivalent experience with 5+ years in HPC and AI technologies required.
- Other info: Opportunity to work with cutting-edge hardware and collaborate with top researchers and developers.
The predicted salary is between 48000 - 84000 £ per year.
NVIDIA is looking for an experienced HPC Engineer to join the E2E software verification HPC/AI Infrastructure team. We are building supercomputers and HPC clusters based on groundbreaking technologies. We are looking for an outstanding architect for a senior HPC, be a key player to the most exciting computing hardware and software to contribute to the latest breakthroughs in artificial intelligence and GPU computing. Provide insights on at-scale system design and tuning mechanisms for large-scale compute runs.
You will work with the latest Accelerated computing and Deep Learning software and hardware platforms, and with many scientific researchers, developers, and customers to craft improved workflows and develop new, leading differentiated solutions. You will interact with HPC, OS, GPU compute, and systems specialist to architect, develop and bring up large scale performance platforms. Does this sound like you? If so, we would love to hear from you!
What you will be doing:
- Designing, implementing and maintaining large scale HPC/AI clusters with monitoring, logging and alerting
- Managing Linux job/workload schedules and orchestration tools
- Developing and maintaining continuous integration and delivery pipelines
- Developing tooling to automate deployment and management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources
- Deploying monitoring solutions for the servers, network and storage
- Troubleshooting and fixing, bottom up from bare metal, operating system, software stack and application level
- Being a technical resource, developing, re-defining and documenting standard methodologies to share with internal teams
- Supporting Research & Development activities and engaging in POCs/POVs for future improvements
What we need to see:
- Bachelor\’s Degree in Computer Science, Engineering, or a related field; or equivalent experience
- 5+ years of experience
- Knowledge of HPC and AI solution technologies from CPU\’s and GPU\’s to high speed interconnects and supporting software
- Experience with job scheduling workloads and orchestration tools such as Slurm, K8s
- Excellent knowledge of Windows and Linux (Redhat/CentOS and Ubuntu) networking (sockets, firewalls, iptables, wireshark, etc.) and internals, ACLs and OS level security protection and common protocols e.g. TCP, DHCP, DNS, etc.
- Experience with multiple storage solutions such as Lustre, GPFS, zfs and xfs. Familiarity with newer and emerging storage technologies.
- Python programming and bash scripting experience.
- Comfortable with automation and configuration management tools such as Jenkins, Ansible, Puppet/chef
- Deep knowledge of Networking Protocols like InfiniBand, Ethernet
- Deep understanding and experience with virtual systems (for example VMware, Hyper-V, KVM, or Citrix)
- Familiarity with cloud computing platforms (e.g. AWS, Azure, Google Cloud)
Ways to stand out from the crowd:
- Knowledge of CPU and/or GPU architecture
- Knowledge of Kubernetes, container related microservice technologies
- Experience with GPU-focused hardware/software (DGX, Cuda)
- Background with RDMA (InfiniBand or RoCE) fabrics
NVIDIA has been redefining computer graphics, PC gaming, and accelerated computing for more than 25 years. We have a unique legacy of innovation that\’s fueled by great technology-and amazing people. Today, we\’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what\’s never been done before takes vision, innovation, and the world\’s best talent. Our teams are composed of driven, innovative professionals dedicated to pushing the boundaries of technology. We offer highly competitive salaries, an extensive benefits package, and a work environment that promotes diversity, inclusion, and flexibility. As an equal opportunity employer, we are committed to fostering a supportive and empowering workplace for all #J-18808-Ljbffr
Senior HPC AI Cluster Engineer employer: NVIDIA Corporation
Contact Detail:
NVIDIA Corporation Recruiting Team
StudySmarter Expert Advice 🤫
We think this is how you could land Senior HPC AI Cluster Engineer
✨Tip Number 1
Familiarise yourself with the latest HPC and AI technologies, especially those related to NVIDIA's offerings. Understanding their GPU architecture and how it integrates with AI workloads will give you a significant edge during discussions.
✨Tip Number 2
Engage with the HPC community through forums, webinars, and conferences. Networking with professionals in the field can provide insights into current trends and challenges, which you can leverage in your conversations with us.
✨Tip Number 3
Showcase your hands-on experience with job scheduling tools like Slurm and orchestration platforms such as Kubernetes. Being able to discuss specific projects where you've implemented these tools will demonstrate your practical knowledge.
✨Tip Number 4
Prepare to discuss your experience with automation and configuration management tools like Jenkins and Ansible. Highlighting how you've used these tools to streamline processes in previous roles will resonate well with our team.
We think you need these skills to ace Senior HPC AI Cluster Engineer
Some tips for your application 🫡
Tailor Your CV: Make sure your CV highlights relevant experience in HPC and AI technologies. Focus on your achievements in designing and maintaining large-scale clusters, as well as any specific tools or programming languages mentioned in the job description.
Craft a Compelling Cover Letter: In your cover letter, express your passion for HPC and AI. Mention specific projects or experiences that align with NVIDIA's goals, and explain how your skills can contribute to their innovative work in supercomputing.
Showcase Technical Skills: Clearly outline your technical skills related to job scheduling, orchestration tools, and programming languages like Python and bash scripting. Provide examples of how you've used these skills in previous roles to solve complex problems.
Highlight Collaborative Experience: Since the role involves working with researchers and developers, emphasise any collaborative projects you've been part of. Discuss how you contributed to team success and improved workflows, showcasing your ability to work in a multidisciplinary environment.
How to prepare for a job interview at NVIDIA Corporation
✨Showcase Your Technical Expertise
Be prepared to discuss your experience with HPC and AI technologies in detail. Highlight specific projects where you've designed or maintained large-scale clusters, and be ready to explain the challenges you faced and how you overcame them.
✨Demonstrate Problem-Solving Skills
Expect technical questions that assess your troubleshooting abilities. Prepare examples of how you've resolved issues from bare metal to application level, showcasing your systematic approach to problem-solving.
✨Familiarise Yourself with Relevant Tools
Make sure you know the orchestration tools and job scheduling systems mentioned in the job description, such as Slurm and Kubernetes. Being able to discuss your hands-on experience with these tools will set you apart.
✨Engage with the Interviewers
During the interview, ask insightful questions about the team’s current projects and future goals. This shows your genuine interest in the role and helps you understand how you can contribute to their success.