At a Glance
- Tasks: Build and optimise large-scale ML systems with hands-on performance modelling.
- Company: Join a cutting-edge, research-driven organisation at the forefront of ML infrastructure.
- Benefits: Competitive salary, flexible work options, and opportunities for professional growth.
- Other info: Dynamic role with exciting challenges and excellent career advancement potential.
- Why this job: Make a real impact on ML architecture decisions and collaborate with top experts.
- Qualifications: Master’s or PhD in relevant fields and strong ML systems background required.
The predicted salary is between 70000 - 90000 £ per year.
We’re partnering with a well-funded, research-driven organisation at the frontier of large-scale ML infrastructure. This is a hands‑on technical role for someone who enjoys going deep on performance modelling, distributed systems, and real hardware behaviour — with direct influence over architecture decisions at scale.
What You’ll Do
- Build simulation models for compute, memory, interconnect, and communication behaviour across large-scale ML systems
- Develop tools to simulate training and inference workloads across distributed accelerator clusters
- Model distributed execution patterns including collectives, synchronisation, and communication bottlenecks
- Run experiments and benchmarks on real ML systems to calibrate and validate simulation models
- Analyse end‑to‑end performance: throughput, latency, scaling efficiency, and cost/performance tradeoffs
- Collaborate with hardware, software, networking, and ML teams to communicate findings through design recommendations
What We’re Looking For
- Master’s or PhD in CS, Electrical or Computer Engineering, or related field
- Strong background in ML systems, distributed systems, performance engineering, or simulation
- Experience analysing compute, communication, and memory behaviour in large-scale ML systems
- Hands‑on benchmarking, profiling, and measurement of ML systems
- Familiarity with distributed training concepts: data/tensor/pipeline parallelism, collectives, synchronisation
- Proficiency in Python, C++, or Rust
To find out more please reach out to Charles Duran.
Senior ML Systems Engineer employer: IC Resources
Join a pioneering organisation at the forefront of large-scale machine learning infrastructure, where your expertise will directly shape architecture decisions and influence cutting-edge technology. With a strong emphasis on collaboration and innovation, we offer a dynamic work culture that fosters professional growth through hands-on experience and cross-disciplinary teamwork. Located in a vibrant tech hub, our company provides unique opportunities to engage with leading experts while enjoying a supportive environment that values your contributions.