At a Glance
- Tasks: Build and maintain cutting-edge ML training frameworks for large-scale models.
- Company: Join a diverse team at Cohere, shaping the future of AI.
- Benefits: Enjoy competitive perks like remote flexibility, health benefits, and 6 weeks of vacation.
- Why this job: Make a real impact on AI systems and work with world-class talent.
- Qualifications: Strong experience in distributed training and HPC systems required.
- Other info: Inclusive culture that values diversity and encourages applicants from all backgrounds.
The predicted salary is between 48000 - 72000 ÂŁ per year.
Who are we? Our mission is to scale intelligence to serve humanity. We’re training and deploying frontier models for developers and enterprises who are building AI systems to power magical experiences like content generation, semantic search, RAG, and agents. We believe that our work is instrumental to the widespread adoption of AI. We obsess over what we build. Each one of us is responsible for contributing to increasing the capabilities of our models and the value they drive for our customers. We like to work hard and move fast to do what’s best for our customers. Cohere is a team of researchers, engineers, designers, and more, who are passionate about their craft. Each person is one of the best in the world at what they do. We believe that a diverse range of perspectives is a requirement for building great products. Join us on our mission and shape the future!
We’re looking for a senior engineer to help build, maintain and evolve the training framework that powers our frontier-scale language models. This role sits at the intersection of large‑scale training, distributed systems, and HPC infrastructure. You will design and maintain the core components that enable fast, reliable, and scalable model training — and build the tooling that connects research ideas to thousands of GPUs. If you enjoy working across the full stack of ML systems, this role gives you the opportunity and autonomy to have massive impact.
What You’ll Work On
- Build and own the training framework responsible for large‑scale LLM training.
- Design distributed training abstractions (data/tensor/pipeline parallelism, FSDP/ZeRO strategies, memory management, checkpointing).
- Improve training throughput and stability on multi‑node clusters (e.g., GB200/300, AMD, H200/100).
- Develop and maintain tooling for monitoring, logging, debugging, and developer ergonomics.
- Collaborate closely with infra teams to ensure Slurm setups, container environments, and hardware configurations support high‑performance training.
- Investigate and resolve performance bottlenecks across the ML systems stack.
- Build robust systems that ensure reproducible, debuggable, large‑scale runs.
You Might Be a Good Fit If You Have
- Strong engineering experience in large‑scale distributed training or HPC systems.
- Deep familiarity with JAX internals, distributed training libraries, or custom kernels/fused ops.
- Experience with multi‑node cluster orchestration (Slurm, Ray, Kubernetes, or similar).
- Comfort debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines.
- Experience working with containerized environments (Docker, Singularity/Apptainer).
- A track record of building tools that increase developer velocity for ML teams.
- Excellent judgment around trade‑offs: performance vs complexity, research velocity vs maintainability.
- Strong collaboration skills — you’ll work closely with infra, research, and deployment teams.
Nice to Have
- Experience with training LLMs or other large transformer architectures.
- Contributions to ML frameworks (PyTorch, JAX, DeepSpeed, Megatron, xFormers, etc.).
- Familiarity with evaluation and serving frameworks (vLLM, TensorRT‑LLM, custom KV caches).
- Experience with data pipeline optimization, sharded datasets, or caching strategies.
- Background in performance engineering, profiling, or low‑level systems.
Bonus: paper at top‑tier venues (such as NeurIPS, ICML, ICLR, AIStats, MLSys, JMLR, AAAI, Nature, COLING, ACL, EMNLP).
Why Join Us
- You’ll work on some of the most challenging and consequential ML systems problems today.
- You’ll collaborate with a world‑class team working fast and at scale.
- You’ll have end‑to‑end ownership over critical components of the training stack.
- You’ll shape the next generation of infrastructure for frontier‑scale models.
- You’ll build tools and systems that directly accelerate research and model quality.
Sample Projects:
- Build a high‑performance data loading and caching pipeline.
- Implement performance profiling across the ML systems stack.
- Develop internal metrics and monitoring for training runs.
- Build reproducibility and regression testing infrastructure.
- Develop a performant fault‑tolerant distributed checkpointing system.
If some of the above doesn’t line up perfectly with your experience, we still encourage you to apply! We value and celebrate diversity and strive to create an inclusive work environment for all. We welcome applicants from all backgrounds and are committed to providing equal opportunities. Should you require any accommodations during the recruitment process, please submit an Accommodations Request Form, and we will work together to meet your needs.
Full‑Time Employees at Cohere enjoy these Perks:
- An open and inclusive culture and work environment.
- Work closely with a team on the cutting edge of AI research.
- Weekly lunch stipend, in‑office lunches & snacks.
- Full health and dental benefits, including a separate budget to take care of your mental health.
- 100% Parental Leave top‑up for up to 6 months.
- Personal enrichment benefits towards arts and culture, fitness and well‑being, quality time, and workspace improvement.
- Remote‑flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co‑working stipend.
- 6 weeks of vacation (30 working days!).
Senior ML Systems Engineer, Frameworks & Tooling employer: Cohere
Contact Detail:
Cohere Recruiting Team
StudySmarter Expert Advice 🤫
We think this is how you could land Senior ML Systems Engineer, Frameworks & Tooling
✨Tip Number 1
Network like a pro! Reach out to folks in the industry, attend meetups, and connect with people on LinkedIn. You never know who might have the inside scoop on job openings or can put in a good word for you.
✨Tip Number 2
Show off your skills! Create a portfolio or GitHub repository showcasing your projects and contributions. This is your chance to demonstrate your expertise in ML systems and distributed training — make it shine!
✨Tip Number 3
Prepare for interviews by brushing up on technical concepts and problem-solving skills. Practice coding challenges and system design questions relevant to ML frameworks. We want to see how you think and tackle real-world problems!
✨Tip Number 4
Apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, we love seeing candidates who are genuinely interested in joining our mission to scale intelligence for humanity.
We think you need these skills to ace Senior ML Systems Engineer, Frameworks & Tooling
Some tips for your application 🫡
Show Your Passion: When writing your application, let your enthusiasm for AI and ML shine through! We want to see how your passion aligns with our mission to scale intelligence and serve humanity.
Tailor Your Experience: Make sure to highlight your relevant experience in large-scale distributed training or HPC systems. We love seeing how your skills can contribute to building and evolving our training framework!
Be Clear and Concise: Keep your application clear and to the point. We appreciate straightforward communication, so make sure to articulate your thoughts without unnecessary fluff.
Apply Through Our Website: Don’t forget to apply through our website! It’s the best way for us to receive your application and ensures you’re considered for this exciting opportunity.
How to prepare for a job interview at Cohere
✨Know Your Tech Inside Out
Make sure you’re well-versed in the technologies mentioned in the job description, especially JAX internals and distributed training libraries. Brush up on your knowledge of multi-node cluster orchestration tools like Slurm or Kubernetes, as these will likely come up during technical discussions.
✨Showcase Your Problem-Solving Skills
Prepare to discuss specific examples where you've tackled performance bottlenecks or improved training throughput. Be ready to explain your thought process and the trade-offs you considered, as this will demonstrate your engineering judgement and ability to handle complex challenges.
✨Collaborate Like a Pro
Since this role involves working closely with various teams, think of examples that highlight your collaboration skills. Be prepared to discuss how you’ve successfully worked with infra, research, and deployment teams in the past, and how you can bring that experience to Cohere.
✨Ask Insightful Questions
Prepare thoughtful questions about the team’s current projects, challenges they face, and their vision for the future. This shows your genuine interest in the role and helps you gauge if the company culture aligns with your values, especially around diversity and inclusion.