Senior Machine Learning Systems Engineer (Frameworks & Tooling)

Senior Machine Learning Systems Engineer (Frameworks & Tooling)

Full-Time 80000 - 100000 € / year (est.) Home office (partial)
Deepstreamtech

At a Glance

  • Tasks: Design and maintain cutting-edge training frameworks for large-scale language models.
  • Company: Join a leading tech firm at the forefront of machine learning innovation.
  • Benefits: Competitive salary, flexible work options, and opportunities for professional growth.
  • Other info: Work in a dynamic environment with exciting projects and significant career advancement potential.
  • Why this job: Make a real impact on groundbreaking ML systems and collaborate with top-tier talent.
  • Qualifications: Strong engineering background in distributed systems and familiarity with ML frameworks.

The predicted salary is between 80000 - 100000 € per year.

Requirements

  • Strong engineering experience in large-scale distributed training or HPC systems
  • Deep familiarity with JAX internals, distributed training libraries, or custom kernels/fused ops
  • Experience with multi-node cluster orchestration (Slurm, Ray, Kubernetes, or similar)
  • Comfort debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines
  • Experience working with containerized environments (Docker, Singularity/Apptainer)
  • A track record of building tools that increase developer velocity for ML teams
  • Excellent judgment around trade-offs: performance vs complexity, research velocity vs maintainability
  • Strong collaboration skills — you’ll work closely with infra, research, and deployment teams
  • (Desirable) Experience with training LLMs or other large transformer architectures
  • (Desirable) Contributions to ML frameworks (PyTorch, JAX, DeepSpeed, Megatron, xFormers, etc.)
  • (Desirable) Familiarity with evaluation and serving frameworks (vLLM, TensorRT-LLM, custom KV caches)
  • (Desirable) Experience with data pipeline optimization, sharded datasets, or caching strategies
  • (Desirable) Background in performance engineering, profiling, or low-level systems
  • (Desirable) Bonus: paper at top-tier venues (such as NeurIPS, ICML, ICLR, AIStats, MLSys, JMLR, AAAI, Nature, COLING, ACL, EMNLP)

If some of the above doesn’t line up perfectly with your experience, we still encourage you to apply!

What the job involves

We’re looking for a senior engineer to help build, maintain and evolve the training framework that powers our frontier-scale language models. This role sits at the intersection of large-scale training, distributed systems, and HPC infrastructure.

You will design and maintain the core components that enable fast, reliable, and scalable model training — and build the tooling that connects research ideas to thousands of GPUs.

If you enjoy working across the full stack of ML systems, this role gives you the opportunity and autonomy to have massive impact.

  • Build and own the training framework responsible for large-scale LLM training
  • Design distributed training abstractions (data/tensor/pipeline parallelism, FSDP/ZeRO strategies, memory management, checkpointing)
  • Improve training throughput and stability on multi-node clusters (e.g., GB200/300, AMD, H200/100)
  • Develop and maintain tooling for monitoring, logging, debugging, and developer ergonomics
  • Collaborate closely with infra teams to ensure our cluster, container environments, and hardware configurations support high-performance training
  • Investigate and resolve performance bottlenecks across the ML systems stack
  • Build robust systems that ensure reproducible, debuggable, large-scale runs

You’ll work on some of the most challenging and consequential ML systems problems today. You’ll collaborate with a world-class team working fast and at scale. You’ll have end-to-end ownership over critical components of the training stack. You’ll shape the next generation of infrastructure for frontier-scale models. You’ll build tools and systems that directly accelerate research and model quality.

Sample Projects:

  • Build a high-performance data loading and caching pipeline
  • Implement performance profiling across the ML systems stack
  • Develop internal metrics and monitoring for training runs
  • Build reproducibility and regression testing infrastructure
  • Develop a performant fault-tolerant distributed checkpointing system

Senior Machine Learning Systems Engineer (Frameworks & Tooling) employer: Deepstreamtech

As a Senior Machine Learning Systems Engineer, you will join a dynamic and innovative team dedicated to pushing the boundaries of machine learning technology. Our company fosters a collaborative work culture that values creativity and encourages professional growth, offering opportunities to work on cutting-edge projects in a supportive environment. Located in a vibrant tech hub, we provide access to state-of-the-art resources and a network of industry leaders, making it an ideal place for those looking to make a significant impact in the field of AI.

Deepstreamtech

Contact Detail:

Deepstreamtech Recruiting Team

StudySmarter Expert Advice🤫

We think this is how you could land Senior Machine Learning Systems Engineer (Frameworks & Tooling)

Tip Number 1

Network like a pro! Attend industry meetups, conferences, or online webinars related to machine learning and distributed systems. Engaging with professionals in the field can lead to valuable connections and potential job opportunities.

Tip Number 2

Show off your skills! Create a portfolio showcasing your projects, especially those involving JAX, distributed training, or containerized environments. This will give potential employers a clear view of what you can bring to the table.

Tip Number 3

Don’t hesitate to reach out! If you see a role that excites you on our website, drop us a message. Expressing your enthusiasm and asking questions can set you apart from other candidates.

Tip Number 4

Prepare for technical interviews by brushing up on your debugging skills and understanding performance issues across CUDA/NCCL and data pipelines. Practising common interview questions can help you feel more confident when it’s time to shine.

We think you need these skills to ace Senior Machine Learning Systems Engineer (Frameworks & Tooling)

Large-scale Distributed Training
HPC Systems
JAX Internals
Distributed Training Libraries
Custom Kernels/Fused Ops
Multi-node Cluster Orchestration
Slurm

Some tips for your application 🫡

Tailor Your CV:Make sure your CV highlights your experience with large-scale distributed training and HPC systems. We want to see how your skills align with our needs, so don’t be shy about showcasing your familiarity with JAX internals and any relevant projects you've worked on.

Craft a Compelling Cover Letter:Your cover letter is your chance to shine! Use it to explain why you’re excited about the role and how your background in building tools for ML teams can help us at StudySmarter. Keep it conversational but professional, and let your passion show!

Showcase Your Collaboration Skills:Since this role involves working closely with various teams, make sure to highlight your collaboration experiences. Share examples of how you’ve successfully partnered with infra, research, or deployment teams in the past to achieve common goals.

Apply Through Our Website:We encourage you to apply directly through our website. It’s the best way for us to receive your application and ensures you’re considered for the role. Plus, it shows you’re keen on joining our team at StudySmarter!

How to prepare for a job interview at Deepstreamtech

Know Your Tech Inside Out

Make sure you’re well-versed in the technologies mentioned in the job description, especially JAX internals and distributed training libraries. Brush up on your knowledge of multi-node cluster orchestration tools like Kubernetes or Slurm, as well as debugging performance issues across CUDA/NCCL.

Showcase Your Collaboration Skills

Since this role involves working closely with infra, research, and deployment teams, be ready to discuss past experiences where you successfully collaborated on projects. Highlight how you navigated trade-offs between performance and complexity, and how you contributed to team success.

Prepare for Technical Challenges

Expect to face technical questions that test your problem-solving skills. Prepare to discuss specific challenges you’ve encountered in building tools for ML teams, and how you approached debugging and optimising data pipelines or containerised environments.

Demonstrate Your Passion for ML Systems

Share your enthusiasm for machine learning systems and any relevant projects you’ve worked on, especially those involving large-scale training or contributions to ML frameworks. If you have any papers published at top-tier venues, don’t hesitate to mention them!