Senior Networking Solution Test Engineer – AI Cluster Debugging
Senior Networking Solution Test Engineer – AI Cluster Debugging

Senior Networking Solution Test Engineer – AI Cluster Debugging

Full-Time 48000 - 84000 £ / year (est.) No home office possible
Nvidia

At a Glance

  • Tasks: Join our team to debug and test cutting-edge AI networking solutions.
  • Company: NVIDIA, a leader in innovative technology and inclusivity.
  • Benefits: Competitive salary, diverse work environment, and opportunities for growth.
  • Why this job: Make a real impact in AI technology while working with top-tier professionals.
  • Qualifications: 8+ years in networking testing, strong Linux skills, and debugging expertise.
  • Other info: Collaborative culture focused on pushing technological boundaries.

The predicted salary is between 48000 - 84000 £ per year.

We are looking for a Senior Networking Test Engineer with strong system‑level debugging skills to join our End‑to‑End Verification team! You will work on pioneering NVLink, Ethernet and InfiniBand‑based AI clusters. Additionally, you will own complex issues across hardware, system software and AI workloads.

What You’ll Be Doing

  • Design and review test and product requirements across the NVLink, Ethernet and InfiniBand / NIC / DPU / Switch portfolio, focusing on large‑scale AI cluster behaviour.
  • Build and maintain realistic customer‑like testbeds, including heterogeneous hardware, OS / driver combinations and complex network fabrics.
  • Own end‑to‑end cluster troubleshooting: reproduce customer scenarios, triage across the stack and drive issues to root cause and fix.
  • Read and understand relevant source code to identify defects, validate fixes and improve logging and instrumentation.
  • Collaborate closely with development teams to debug NCCL, RoCE/RDMA and related networking components using logs, code inspection and targeted experiments.
  • Define tests and guide the automation team to implement robust, debuggable suites that produce actionable logs, metrics and traces.
  • Run Regression, Performance, Functional and Scale testing, analyse results and provide clear, data‑driven reports to collaborators.
  • Profile and benchmark deep learning training and inference workloads, correlating model‑level metrics with system and network telemetry to uncover bottlenecks.

What We Need To See

  • B.A./B.Sc. in Computer Science, Electrical Engineering, or equivalent IT/Network/Systems experience.
  • 8+ years of hands‑on networking or system‑level testing and debugging on Linux.
  • Strong Linux networking and debugging skills (for example perf, tcpdump, ethtool, iproute2).
  • Proven production‑grade debugging experience: forming hypotheses, running experiments, and driving issues to root cause under pressure.
  • Expertise in host‑side NIC validation and tuning (offloads, queues, interrupts, firmware/driver interactions).
  • Strong knowledge of AI networking libraries (such as NCCL) and protocols (such as RoCE and RDMA), including performance and correctness debugging.
  • Ability to read and reason about source code (C/C++/Python or similar) and collaborate closely with developers on fixes.
  • Solid scripting and automation skills with Bash / Python / Ansible for setup, log collection, and experiment orchestration.
  • Fast learner, familiar with modern AI tools and workflows, able to adapt quickly.
  • Excellent analytical, problem‑solving and communication skills, with strong ownership and a collaborative approach.

Ways To Stand Out From The Crowd

  • Hands‑on debugging of collective communication libraries (for example NCCL) or large‑scale LLM training / inference clusters.
  • Experience with large cluster environments (tens to thousands of GPUs or nodes), including incident response and post‑mortem analysis.
  • Deep expertise in tuning and debugging congestion control and lossless Ethernet for AI workloads (for example DCQCN, ECN, PFC).
  • Familiarity with NVIDIA networking technologies (for example BlueField / BF3, ConnectX NICs) and their software stack and diagnostics.
  • Experience debugging issues that span multiple layers (L2/L3, transport, AI frameworks) or contributing to open‑source networking / AI systems.

At NVIDIA, we value diversity and are committed to creating an inclusive environment for all employees. We do not discriminate on the basis of race, religion, colour, national origin, sex, gender, gender expression, sexual orientation, age, marital status, veteran status, or disability status. We provide reasonable accommodations to ensure all individuals can participate in the job application or interview process, perform essential job functions, and receive other benefits and privileges of employment. Join us and be part of a team that’s pushing the boundaries of technology and making a real impact in the world.

Senior Networking Solution Test Engineer – AI Cluster Debugging employer: Nvidia

At NVIDIA, we pride ourselves on being an exceptional employer, offering a dynamic work culture that fosters innovation and collaboration. Our commitment to employee growth is evident through continuous learning opportunities and the chance to work on cutting-edge AI technologies in a diverse and inclusive environment. Join us in our vibrant location, where you can make a meaningful impact while enjoying competitive benefits and a supportive team atmosphere.
Nvidia

Contact Detail:

Nvidia Recruiting Team

StudySmarter Expert Advice 🤫

We think this is how you could land Senior Networking Solution Test Engineer – AI Cluster Debugging

Tip Number 1

Network, network, network! Get out there and connect with folks in the industry. Attend meetups, webinars, or even just chat with people on LinkedIn. You never know who might have a lead on that perfect Senior Networking Test Engineer role!

Tip Number 2

Show off your skills! Create a portfolio or GitHub repository showcasing your projects, especially those related to AI clusters and networking. This gives potential employers a taste of what you can do and sets you apart from the crowd.

Tip Number 3

Prepare for technical interviews by brushing up on your debugging skills. Practice explaining your thought process while solving problems, as this is key for roles like the one we’re hiring for. Remember, it’s not just about getting the right answer but how you approach the problem!

Tip Number 4

Don’t forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, we love seeing candidates who are genuinely interested in joining our team at NVIDIA!

We think you need these skills to ace Senior Networking Solution Test Engineer – AI Cluster Debugging

System-Level Debugging
Networking Skills
Linux Networking
Debugging Tools (perf, tcpdump, ethtool, iproute2)
NIC Validation and Tuning
AI Networking Libraries (NCCL)
Protocols (RoCE, RDMA)
Source Code Reading (C/C++/Python)
Scripting and Automation (Bash, Python, Ansible)
Analytical Skills
Problem-Solving Skills
Communication Skills
Adaptability
Collaboration Skills

Some tips for your application 🫡

Tailor Your CV: Make sure your CV highlights your relevant experience in networking and system-level debugging. Use keywords from the job description to show we’re on the same page about what you bring to the table.

Show Off Your Skills: In your application, don’t just list your skills—demonstrate them! Share specific examples of how you've tackled complex issues or improved processes in previous roles. We love seeing real-world applications of your expertise.

Craft a Compelling Cover Letter: Your cover letter is your chance to shine! Explain why you’re excited about this role and how your background makes you a perfect fit. Keep it conversational but professional, and let your passion for AI and networking come through.

Apply Through Our Website: We encourage you to apply directly through our website. It’s the best way to ensure your application gets into the right hands. Plus, you’ll find all the details you need about the role and our team!

How to prepare for a job interview at Nvidia

Know Your Tech Inside Out

Make sure you’re well-versed in the technologies mentioned in the job description, especially NVLink, Ethernet, and InfiniBand. Brush up on your Linux networking skills and be ready to discuss specific tools like tcpdump and ethtool, as well as your hands-on experience with debugging.

Prepare Real-World Scenarios

Think of complex issues you've tackled in previous roles, particularly those involving AI clusters or large-scale environments. Be prepared to walk through your thought process on how you identified and resolved these issues, showcasing your analytical and problem-solving skills.

Showcase Your Collaboration Skills

Since this role involves working closely with development teams, be ready to share examples of how you’ve collaborated in the past. Highlight any experiences where you’ve worked on debugging projects, especially those that required cross-team communication and cooperation.

Demonstrate Your Passion for AI and Networking

Express your enthusiasm for AI technologies and networking solutions. Discuss any personal projects or continuous learning efforts related to AI workloads or networking protocols. This will show your commitment to staying current in the field and your eagerness to contribute to the team.

Senior Networking Solution Test Engineer – AI Cluster Debugging
Nvidia

Land your dream job quicker with Premium

You’re marked as a top applicant with our partner companies
Individual CV and cover letter feedback including tailoring to specific job roles
Be among the first applications for new jobs with our AI application
1:1 support and career advice from our career coaches
Go Premium

Money-back if you don't land a job in 6-months

>