At a Glance
- Tasks: Design and optimise GPU communication kernels for AI/ML workloads in a dynamic team.
- Company: Join Oriole Networks, a leader in photonic networking and AI technology.
- Benefits: Enjoy a hybrid work model, competitive salary, and opportunities for professional growth.
- Why this job: Make a real impact on cutting-edge AI technologies and revolutionise data centres.
- Qualifications: Proficient in C++ and Python with experience in GPU programming and high-performance computing.
- Other info: Collaborative environment focused on innovation and sustainability in tech.
The predicted salary is between 48000 - 72000 Β£ per year.
Oriole is seeking a talented ML Systems/Infrastructure Engineer to help co-optimize our AI/ML software stack with cutting-edge network hardware. You will be a key contributor to a high-impact, agile team focused on integrating middleware communication libraries and modelling the performance of large-scale AI/ML workloads.
Key Responsibilities
- Design and optimize custom GPU communication kernels to enhance performance and scalability across multi-node environments.
- Develop and maintain distributed communication frameworks for large-scale deep learning models, ensuring efficient parallelization and optimal resource utilization.
- Profile, benchmark, and debug GPU applications to identify and resolve bottlenecks in communication and computation pipelines.
- Collaborate closely with hardware and software teams to integrate optimized kernels with Oriole's next-generation network hardware and software stack.
- Contribute to system-level architecture decisions for large-scale GPU clusters, with a focus on communication efficiency, fault tolerance, and novel architectures for advanced optical network infrastructure.
Required Skills & Experience
- Proficient in C++ and Python, with a strong track record in high-performance computing or machine learning projects.
- Expertise in GPU programming with CUDA, including deep knowledge of GPU memory hierarchies and kernel optimization.
- Hands-on experience debugging GPU kernels using tools such as Cuda-gdb, Cuda Memcheck, NSight Systems, PTX, and SASS.
- Strong understanding of communication libraries and protocols, including NCCL, NVSHMEM, OpenMPI, UCX, or custom collective communication implementations.
- Familiarity with HPC networking protocols/libraries such as RoCE, Infiniband, Libibverbs, and libfabric.
- Experience with distributed deep learning/MoE frameworks, including PyTorch Distributed, vLLM, or DeepEP.
- Solid understanding of deploying and optimizing large-scale distributed deep learning workloads in production environments, including Linux, Kubernetes, SLURM, OpenMPI, GPU drivers, Docker, and CI/CD automation.
ML Systems/Infrastructure Engineer in London employer: EPIC Centre
Contact Detail:
EPIC Centre Recruiting Team
StudySmarter Expert Advice π€«
We think this is how you could land ML Systems/Infrastructure Engineer in London
β¨Tip Number 1
Network, network, network! Get out there and connect with folks in the industry. Attend meetups, webinars, or even just grab a coffee with someone who works in ML or infrastructure. You never know who might have the inside scoop on job openings!
β¨Tip Number 2
Show off your skills! Create a portfolio showcasing your projects, especially those involving GPU programming or distributed deep learning. This will give potential employers a taste of what you can do and set you apart from the crowd.
β¨Tip Number 3
Donβt be shy about reaching out directly to companies like Oriole. A quick email or LinkedIn message expressing your interest can go a long way. Plus, applying through our website shows you're serious about joining the team!
β¨Tip Number 4
Prepare for interviews by brushing up on your technical skills and understanding of communication libraries and protocols. Practice explaining your past projects and how they relate to the role. Confidence is key, so get ready to shine!
We think you need these skills to ace ML Systems/Infrastructure Engineer in London
Some tips for your application π«‘
Tailor Your CV: Make sure your CV is tailored to the ML Systems/Infrastructure Engineer role. Highlight your experience with C++, Python, and GPU programming, as well as any relevant projects that showcase your skills in high-performance computing.
Craft a Compelling Cover Letter: Your cover letter is your chance to shine! Use it to explain why you're passionate about AI/ML and how your background makes you a perfect fit for our team. Donβt forget to mention specific experiences that relate to the job description.
Showcase Your Projects: If you've worked on any projects involving distributed deep learning or GPU optimization, make sure to include them in your application. We love seeing real-world applications of your skills, so donβt hold back!
Apply Through Our Website: We encourage you to apply through our website for the best chance of getting noticed. Itβs super easy, and youβll be able to keep track of your application status directly with us!
How to prepare for a job interview at EPIC Centre
β¨Know Your Tech Inside Out
Make sure youβre well-versed in C++ and Python, as well as GPU programming with CUDA. Brush up on your knowledge of communication libraries like NCCL and OpenMPI, and be ready to discuss your hands-on experience with debugging tools. This will show that youβre not just familiar with the tech but can also apply it effectively.
β¨Showcase Your Problem-Solving Skills
Prepare to discuss specific challenges you've faced in previous projects, especially those related to performance bottlenecks in GPU applications. Use examples that highlight your ability to profile, benchmark, and debug effectively. This will demonstrate your analytical skills and how you approach complex problems.
β¨Collaborate Like a Pro
Since this role involves working closely with hardware and software teams, be ready to talk about your collaborative experiences. Share examples of how youβve integrated different systems or worked on cross-functional teams. This will illustrate your teamwork skills and your ability to contribute to system-level architecture decisions.
β¨Understand the Bigger Picture
Familiarise yourself with Oriole's mission and how your role as an ML Systems/Infrastructure Engineer fits into their vision of revolutionising data centres. Being able to articulate how your work contributes to accelerating AI in a low carbon world will set you apart and show your genuine interest in the company.