At a Glance
- Tasks: Enhance LLM performance and architecture while working with cutting-edge technologies.
- Company: Join JetBrains, a leader in developer tools since 2000.
- Benefits: Competitive salary, flexible work options, and opportunities for professional growth.
- Why this job: Make a real impact on AI technology and shape the future of LLMs.
- Qualifications: Strong experience in PyTorch, GPU programming, and multi-node job management.
- Other info: Dynamic team environment with exciting challenges and career advancement.
The predicted salary is between 36000 - 60000 ÂŁ per year.
At JetBrains, code is our passion. Ever since we started back in 2000, we have been striving to make the strongest, most effective developer tools on earth. By automating routine checks and corrections, our tools speed up production, freeing developers to grow, discover, and create.
We’re looking for a Research Engineer who will own the training stack and model architecture for our Mellum LLM family.
Responsibilities- Be responsible for improving end-to-end performance for multi-node LLM pre-training and post-training pipelines.
- Profile hotspots (Nsight Systems/Compute, NVTX) and fix them using compute/comm overlap, kernel fusion, scheduling, etc.
- Design and evaluate architecture choices (depth/width, attention variants including GQA/MQA/MLA/Flash-style, RoPE scaling/NTK, and MoE routing and load-balancing).
- Implement custom ops (Triton and/or CUDA C++), integrate via PyTorch extensions, and upstream when possible.
- Push memory/perf levers: FSDP/ZeRO, activation checkpointing, FP8/TE, tensor/pipeline/sequence/expert parallelism, NCCL tuning.
- Harden large runs by building elastic and fault-tolerant training setups, ensuring robust checkpointing, strengthening reproducibility, and improving resilience to preemption.
- Keep the data path fast using streaming and sharded data loaders and tokenizer pipelines, as well as improve overall throughput and cache efficiency.
- Define the right metrics, build dashboards, and deliver steady improvements.
- Run both pre-training and post-training (including SFT, RLHF, and GRPO-style methods) efficiently across sizable clusters.
- Strong PyTorch and PyTorch Distributed experience, having run multi-node jobs with tens to hundreds of GPUs.
- Hands‑on experience with Megatron‑LM/Megatron‑Core/NeMo, DeepSpeed, or serious FSDP/ZeRO expertise.
- Real profiling expertise (Nsight Systems/Compute, nvprof) and experience with NVTX‑instrumented workflows.
- GPU programming skills with Triton and/or CUDA, and the ability to write, test, and debug kernels.
- A solid understanding of NCCL collectives, as well as topology and fabric effects (IB/RoCE), and how they show up in traces.
- FlashAttention‑2 and 3, CUTLASS and CuTe, TransformerEngine and FP8, Inductor, AOTAutograd, and torch.compile.
- MoE at scale (expert parallel, router losses, capacity management) and long‑context tricks (ALiBi/YaRN/NTK scaling).
- Kubernetes or SLURM at scale, placement and affinity tuning, as well as AWS, GCP, and Azure GPU fleets.
- Web‑scale data plumbing (streaming datasets, Parquet and TFRecord, tokenizer perf), eval harnesses, and benchmarking.
- Safety and post‑training methods, such as DPO, ORPO, GRPO, and reward models.
- Inference ecosystems such as vLLM and paged KV.
We process the data provided in your job application in accordance with the Recruitment Privacy Policy.
Research Engineer (LLM Training and Performance) employer: JetBrains
Contact Detail:
JetBrains Recruiting Team
StudySmarter Expert Advice 🤫
We think this is how you could land Research Engineer (LLM Training and Performance)
✨Tip Number 1
Network like a pro! Reach out to folks in the industry, attend meetups, and connect with people on LinkedIn. You never know who might have the inside scoop on job openings or can put in a good word for you.
✨Tip Number 2
Show off your skills! Create a portfolio showcasing your projects, especially those related to LLM training and performance. This will give potential employers a taste of what you can do and set you apart from the crowd.
✨Tip Number 3
Prepare for interviews by brushing up on your technical knowledge and problem-solving skills. Practice coding challenges and be ready to discuss your past experiences with PyTorch and GPU programming. Confidence is key!
✨Tip Number 4
Don't forget to apply through our website! We love seeing applications directly from candidates who are passionate about joining us at StudySmarter. It shows initiative and enthusiasm, which we really appreciate.
We think you need these skills to ace Research Engineer (LLM Training and Performance)
Some tips for your application 🫡
Tailor Your CV: Make sure your CV is tailored to the Research Engineer role. Highlight your experience with PyTorch, GPU programming, and any relevant projects that showcase your skills in LLM training and performance.
Craft a Compelling Cover Letter: Your cover letter is your chance to shine! Use it to explain why you're passionate about the role and how your background aligns with our mission at JetBrains. Be specific about your achievements and how they relate to the job.
Showcase Your Technical Skills: Don’t shy away from showcasing your technical expertise. Mention your hands-on experience with tools like Megatron-LM, DeepSpeed, and any profiling expertise you have. We love seeing real-world applications of your skills!
Apply Through Our Website: We encourage you to apply through our website for a smoother application process. It helps us keep track of your application and ensures you don’t miss out on any important updates from us!
How to prepare for a job interview at JetBrains
✨Know Your Tech Inside Out
Make sure you’re well-versed in the technologies mentioned in the job description, especially PyTorch and GPU programming. Brush up on your experience with multi-node jobs and profiling tools like Nsight Systems. Being able to discuss specific projects where you've implemented these technologies will show your expertise.
✨Prepare for Technical Questions
Expect deep technical questions related to LLM training and performance. Be ready to explain concepts like MoE routing, activation checkpointing, and NCCL tuning. Practising coding problems or discussing past experiences can help you articulate your thought process clearly during the interview.
✨Showcase Your Problem-Solving Skills
During the interview, highlight how you've tackled challenges in previous roles. Discuss specific instances where you improved performance or resolved issues in training pipelines. This will demonstrate your ability to think critically and adapt in a fast-paced environment.
✨Ask Insightful Questions
Prepare thoughtful questions about the team’s current projects, challenges they face, or their approach to model architecture. This not only shows your interest in the role but also helps you gauge if the company aligns with your career goals.