At a Glance
- Tasks: Design and optimise GPU communication kernels for AI/ML workloads in a dynamic team.
- Company: Join Oriole, a cutting-edge tech company revolutionising AI/ML infrastructure.
- Benefits: Competitive salary, flexible work options, and opportunities for professional growth.
- Why this job: Make a real impact on the future of AI/ML technology with innovative projects.
- Qualifications: Proficient in C++ and Python, with experience in GPU programming and high-performance computing.
- Other info: Collaborative environment with a focus on advanced network infrastructure and career advancement.
The predicted salary is between 36000 - 60000 £ per year.
Oriole is seeking talented ML Systems/Infrastructure Engineer to help co-optimize our AI/ML software stack with cutting-edge network hardware. You’ll be a key contributor to a high-impact, agile team focused on integrating middleware communication libraries and modelling the performance of large-scale AI/ML workloads.
Key Responsibilities
- Design and optimize custom GPU communication kernels to enhance performance and scalability across multi-node environments.
- Develop and maintain distributed communication frameworks for large-scale deep learning models, ensuring efficient parallelization and optimal resource utilization.
- Profile, benchmark, and debug GPU applications to identify and resolve bottlenecks in communication and computation pipelines.
- Collaborate closely with hardware and software teams to integrate optimized kernels with Oriole’s next-generation network hardware and software stack.
- Contribute to system-level architecture decisions for large-scale GPU clusters, with a focus on communication efficiency, fault tolerance, and novel architectures for advanced optical network infrastructure.
Required Skills & Experience
- Proficient in C++ and Python, with a strong track record in high-performance computing or machine learning projects.
- Expertise in GPU programming with CUDA, including deep knowledge of GPU memory hierarchies and kernel optimization.
- Hands-on experience debugging GPU kernels using tools such as Cuda-gdb, Cuda Memcheck, NSight Systems, PTX, and SASS.
- Strong understanding of communication libraries and protocols, including NCCL, NVSHMEM, OpenMPI, UCX, or custom collective communication implementations.
- Familiarity with HPC networking protocols/libraries such as RoCE, Infiniband, Libibverbs, and libfabric.
- Experience with distributed deep learning/MoE frameworks, including PyTorch Distributed, vLLM, or DeepEP.
- Solid understanding of deploying and optimizing large-scale distributed deep learning workloads in production environments, including Linux, Kubernetes, SLURM, OpenMPI, GPU drivers, Docker, and CI/CD automation.
ML Systems/Infrastructure Engineer in London employer: Oriole
Contact Detail:
Oriole Recruiting Team
StudySmarter Expert Advice 🤫
We think this is how you could land ML Systems/Infrastructure Engineer in London
✨Tip Number 1
Network, network, network! Get out there and connect with folks in the industry. Attend meetups, webinars, or even online forums related to ML and infrastructure. You never know who might have a lead on your dream job!
✨Tip Number 2
Show off your skills! Create a portfolio showcasing your projects, especially those involving GPU programming and distributed systems. This will give potential employers a taste of what you can do and set you apart from the crowd.
✨Tip Number 3
Don’t just apply blindly! Tailor your approach for each application. Research the company, understand their tech stack, and mention how your experience with CUDA or communication libraries can benefit them directly.
✨Tip Number 4
Apply through our website! We love seeing candidates who take the initiative. Plus, it gives you a better chance to stand out in our system. So, don’t hesitate – hit that apply button and show us what you've got!
We think you need these skills to ace ML Systems/Infrastructure Engineer in London
Some tips for your application 🫡
Tailor Your CV: Make sure your CV is tailored to the ML Systems/Infrastructure Engineer role. Highlight your experience with C++, Python, and GPU programming, as well as any relevant projects that showcase your skills in high-performance computing.
Showcase Your Projects: Include specific examples of your work with distributed communication frameworks and GPU applications. We want to see how you've tackled challenges in previous roles, especially those related to deep learning and performance optimisation.
Be Clear and Concise: When writing your cover letter, keep it clear and to the point. Explain why you're excited about the role at Oriole and how your skills align with the job description. We appreciate straightforward communication!
Apply Through Our Website: Don’t forget to apply through our website! It’s the best way for us to receive your application and ensures you’re considered for the role. Plus, it shows you’re keen on joining our team!
How to prepare for a job interview at Oriole
✨Know Your Tech Inside Out
Make sure you’re well-versed in C++ and Python, as well as GPU programming with CUDA. Brush up on your knowledge of communication libraries like NCCL and OpenMPI, and be ready to discuss how you've used these in past projects.
✨Showcase Your Problem-Solving Skills
Prepare to talk about specific challenges you've faced in high-performance computing or machine learning. Think of examples where you debugged GPU kernels or optimised communication frameworks, and be ready to explain your thought process.
✨Collaborate Like a Pro
Since collaboration is key, think of instances where you worked closely with hardware and software teams. Be prepared to discuss how you contributed to system-level architecture decisions and how you ensured efficient resource utilisation.
✨Demonstrate Your Passion for Innovation
Oriole is looking for someone who can contribute to cutting-edge solutions. Share your thoughts on novel architectures for advanced optical network infrastructure and any personal projects that showcase your enthusiasm for the field.