At a Glance
- Tasks: Enhance LLM training performance and develop innovative model architectures.
- Company: Join JetBrains, a leader in developer tools since 2000.
- Benefits: Competitive salary, flexible work options, and opportunities for professional growth.
- Why this job: Be at the forefront of AI technology and make a real impact.
- Qualifications: Strong experience with PyTorch and multi-node GPU setups required.
- Other info: Dynamic team environment with exciting challenges and career advancement.
The predicted salary is between 36000 - 60000 ÂŁ per year.
At JetBrains, code is our passion. Ever since we started back in 2000, we have been striving to make the strongest, most effective developer tools on earth. By automating routine checks and corrections, our tools speed up production, freeing developers to grow, discover, and create. We are looking for a Research Engineer who will own the training stack and model architecture for our Mellum LLM family.
Responsibilities
- Be responsible for improving end-to-end performance for multi-node LLM pre-training and post-training pipelines.
- Profile hotspots (Nsight Systems/Compute, NVTX) and fix them using compute/comm overlap, kernel fusion, scheduling, etc.
- Design and evaluate architecture choices (depth/width, attention variants including GQA/MQA/MLA/Flash-style, RoPE scaling/NTK, and MoE routing and load-balancing).
- Implement custom ops (Triton and/or CUDA C++), integrate via PyTorch extensions, and upstream when possible.
- Push memory/perf levers: FSDP/ZeRO, activation checkpointing, FP8/TE, tensor/pipeline/sequence/expert parallelism, NCCL tuning.
- Harden large runs by building elastic and fault-tolerant training setups, ensuring robust checkpointing, strengthening reproducibility, and improving resilience to preemption.
- Keep the data path fast using streaming and sharded data loaders and tokenizer pipelines, as well as improve overall throughput and cache efficiency.
- Define the right metrics, build dashboards, and deliver steady improvements.
- Run both pre-training and post-training (including SFT, RLHF, and GRPO-style methods) efficiently across sizable clusters.
Qualifications
- Strong PyTorch and PyTorch Distributed experience, having run multi-node jobs with tens to hundreds of GPUs.
- Handsâon experience with MegatronâLM/MegatronâCore/NeMo, DeepSpeed, or serious FSDP/ZeRO expertise.
- Real profiling expertise (Nsight Systems/Compute, nvprof) and experience with NVTXâinstrumented workflows.
- GPU programming skills with Triton and/or CUDA, and the ability to write, test, and debug kernels.
- A solid understanding of NCCL collectives, as well as topology and fabric effects (IB/RoCE), and how they show up in traces.
Ideal Candidate Experience
- FlashAttentionâ2 and 3, CUTLASS and CuTe, TransformerEngine and FP8, Inductor, AOTAutograd, and torch.compile.
- MoE at scale (expert parallel, router losses, capacity management) and longâcontext tricks (ALiBi/YaRN/NTK scaling).
- Kubernetes or SLURM at scale, placement and affinity tuning, as well as AWS, GCP, and Azure GPU fleets.
- Webâscale data plumbing (streaming datasets, Parquet and TFRecord, tokenizer perf), eval harnesses, and benchmarking.
- Safety and postâtraining methods, such as DPO, ORPO, GRPO, and reward models.
- Inference ecosystems such as vLLM and paged KV.
We process the data provided in your job application in accordance with the Recruitment Privacy Policy.
Research Engineer (LLM Training and Performance) in London employer: JetBrains
Contact Detail:
JetBrains Recruiting Team
StudySmarter Expert Advice đ¤Ť
We think this is how you could land Research Engineer (LLM Training and Performance) in London
â¨Tip Number 1
Network like a pro! Reach out to folks in the industry, attend meetups, and connect with people on LinkedIn. You never know who might have the inside scoop on job openings or can refer you directly.
â¨Tip Number 2
Show off your skills! Create a portfolio or GitHub repository showcasing your projects, especially those related to LLMs and PyTorch. This gives potential employers a taste of what you can do and sets you apart from the crowd.
â¨Tip Number 3
Prepare for interviews by brushing up on technical questions and practical scenarios related to LLM training and performance. Practice coding challenges and be ready to discuss your past experiences in detail.
â¨Tip Number 4
Donât forget to apply through our website! Itâs the best way to ensure your application gets seen. Plus, we love seeing candidates who are genuinely interested in joining our team at StudySmarter.
We think you need these skills to ace Research Engineer (LLM Training and Performance) in London
Some tips for your application đŤĄ
Show Your Passion for Code: When you're writing your application, let your love for coding shine through! Share specific examples of projects or tools you've worked on that relate to the role. We want to see your enthusiasm for developing effective solutions.
Tailor Your CV and Cover Letter: Make sure to customise your CV and cover letter for this position. Highlight your experience with PyTorch, GPU programming, and any relevant frameworks. We appreciate when candidates take the time to align their skills with what we're looking for!
Be Clear and Concise: Keep your application straightforward and to the point. Use bullet points where possible to make it easy for us to read. We love a well-structured application that gets straight to the good stuff without unnecessary fluff!
Apply Through Our Website: Don't forget to submit your application through our website! Itâs the best way for us to receive your details and ensures youâre considered for the role. Plus, it helps us keep everything organised on our end!
How to prepare for a job interview at JetBrains
â¨Know Your Tech Inside Out
Make sure youâre well-versed in the technologies mentioned in the job description, especially PyTorch and GPU programming. Brush up on your experience with multi-node jobs and profiling tools like Nsight Systems. Being able to discuss specific projects where you've applied these skills will really impress.
â¨Showcase Your Problem-Solving Skills
Prepare to discuss how you've tackled performance issues in the past. Think about examples where youâve optimised training pipelines or improved throughput. Be ready to explain your thought process and the impact of your solutions.
â¨Familiarise Yourself with the Companyâs Tools
Since JetBrains is all about developer tools, itâs a good idea to familiarise yourself with their products. Understanding their philosophy and how they approach tool development can give you an edge in the interview.
â¨Ask Insightful Questions
Prepare some thoughtful questions about the teamâs current challenges or future projects. This shows your genuine interest in the role and helps you gauge if the company is the right fit for you. Plus, it opens up a dialogue that can make the interview feel more like a conversation.