Senior System Engineer
Who are we?
Our mission is to scale intelligence to serve humanity. We\âre training and deploying frontier models for developers and enterprises who are building AI systems to power magical experiences like content generation, semantic search, RAG, and agents. We believe that our work is instrumental to the widespread adoption of AI. We obsess over what we build and insist that each of us is responsible for contributing to increasing the capabilities of our models and the value they drive for our customers. We love working hard, moving fast, and cultivating a diverse range of perspectives.
We\âre looking for a senior engineer to help build, maintain and evolve the training framework that powers our frontierâscale language models. This role sits at the intersection of largeâscale training, distributed systems, and HPC infrastructure. You will design and maintain the core components that enable fast, reliable, and scalable model training and will build the tooling that connects research ideas to thousands of GPUs. If you enjoy working across the full stack of ML systems, this role gives you the opportunity and autonomy to have massive impact.
What You\âll Work On
Build and own the training framework responsible for largeâscale LLM training.
Design distributed training abstractions (data/tensor/pipeline parallelism, FSDP/ZeRO strategies, memory management, checkpointing).
Improve training throughput and stability on multiânode clusters (e.g., GB200/300, AMD, H200/100).
Develop and maintain tooling for monitoring, logging, debugging, and developer ergonomics.
Collaborate closely with infra teams to ensure Slurm setups, container environments, and hardware configurations support highâperformance training.
Investigate and resolve performance bottlenecks across the ML systems stack.
Build robust systems that ensure reproducible, debuggable, largeâscale runs.
You Might Be a Good Fit If You Have
Strong engineering experience in largeâscale distributed training or HPC systems. Deep familiarity with JAX internals, distributed training libraries, or custom kernels/fused ops.
Experience with multiânode cluster orchestration (Slurm, Ray, Kubernetes, or similar).
Comfort debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines.
Experience working with containerized environments (Docker, Singularity/Apptainer).
A track record of building tools that increase developer velocity for ML teams.
Excellent judgment around tradeâoffs: performance vs complexity, research velocity vs maintainability.
Strong collaboration skills â you\âll work closely with infra, research, and deployment teams.
Nice to Have
Experience with training LLMs or other large transformer architectures.
Contributions to ML frameworks (PyTorch, JAX, DeepSpeed, Megatron, xFormers, etc.).
Familiarity with evaluation and serving frameworks (vLLM, TensorRTâLLM, custom KV caches).
Experience with data pipeline optimization, sharded datasets, or caching strategies.
Background in performance engineering, profiling, or lowâlevel systems.
Bonus: paper at topâtier venues such as NeurIPS, ICML, ICLR, AIStats, MLSys, JMLR, AAAI, Nature, COLING, ACL, or EMNLP.
Why Join Us
You\âll work on some of the most challenging and consequential ML systems problems today.
You\âll collaborate with a worldâclass team working fast and at scale.
You\âll have endâtoâend ownership over critical components of the training stack.
You\âll shape the next generation of infrastructure for frontierâscale models.
You\âll build tools and systems that directly accelerate research and model quality.
Sample Projects
Build a highâperformance data loading and caching pipeline.
Implement performance profiling across the ML systems stack.
Develop internal metrics and monitoring for training runs.
Build reproducibility and regression testing infrastructure.
Develop a performant faultâtolerant distributed checkpointing system.
We value and celebrate diversity and strive to create an inclusive work environment for all. We welcome applicants from all backgrounds and are committed to providing equal opportunities. If you require any accommodations during the recruitment process, please submit an Accommodations Request Form, and we will work together to meet your needs.
FullâTime Employees At Cohere Enjoy These Perks
An open and inclusive culture and work environment
Work closely with a team on the cutting edge of AI research
Weekly lunch stipend, inâoffice lunches & snacks
Full health and dental benefits, including a separate budget for mental health
100% Parental Leave topâup for up to 6 months
Personal enrichment benefits towards arts, culture, fitness, wellâbeing, and workspace improvement
Remoteâflexible, offices in Toronto, New York, San Francisco, London, and Paris, plus a coâworking stipend
6 weeks of vacation (30 working days!)
#J-18808-Ljbffr
Contact Detail:
Cohere Recruiting Team