ML Systems Simulation Architect

ML Systems Simulation Architect

Full-Time 70000 - 90000 £ / year (est.) No working from home possible
Oriole Networks Ltd

At a Glance

  • Tasks: Create and validate simulation models for large-scale ML systems to optimise performance.
  • Company: Join a leading tech firm at the forefront of machine learning innovation.
  • Benefits: Enjoy competitive pay, flexible working options, and opportunities for professional growth.
  • Other info: Collaborative environment with a focus on innovation and career advancement.
  • Why this job: Make a real impact in ML by solving complex challenges with cutting-edge technology.
  • Qualifications: Master’s or PhD in relevant fields with strong ML systems experience required.

The predicted salary is between 70000 - 90000 £ per year.

We are looking for a Senior ML Systems Engineer to build and validate simulation infrastructure for large-scale machine learning systems. This role focuses on modelling the compute and communication behaviour of systems used for ML training and inference, and using simulation to guide architecture, performance optimization, and capacity planning. The ideal candidate combines strong systems experience with hands-on experience in measurement, benchmarking, and performance analysis of modern ML systems.

What You’ll Do:

  • Build simulation models for compute, memory, interconnect, and communication behaviour in ML systems.
  • Develop tools to simulate performance for training and inference workloads.
  • Model distributed execution across accelerators, hosts, and network fabrics, including collectives, synchronization, and communication bottlenecks.
  • Use simulation and analytical modelling to evaluate tradeoffs, identify bottlenecks, and guide system design.
  • Run performance experiments and benchmarks on real ML systems to calibrate and validate simulation models.
  • Analyze end-to-end performance, including throughput, latency, scaling efficiency, utilisation, and cost/performance tradeoffs.
  • Partner with hardware/software/Networking/ML teams to align simulation with real workloads and constraints.
  • Create reproducible benchmarking methodologies across models, system configurations, and compare against real system measurements to prove validity.
  • Communicate findings through technical reports and design recommendations.

Qualifications Required:

  • Master’s, or PhD in Computer Science, Electrical Engineering, Computer Engineering, or a related field.
  • Strong experience in ML systems, distributed systems, performance engineering, computer architecture, or simulation.
  • Understanding of systems used for machine learning training and inference.
  • Experience analyzing compute, communication, and memory behaviour in large-scale ML systems.
  • Hands-on experience with performance benchmarking, profiling, and measurement of ML systems.
  • Experience with distributed training concepts such as data parallelism, tensor/model parallelism, pipeline parallelism, collectives, and synchronization overheads.
  • Proficiency in one of the following: Python, C++, or Rust.
  • Strong analytical skills and the ability to connect simulation results to real system behaviour.

Preferred:

  • Experience with system performance modelling, network simulation, or architecture evaluation tools.
  • Familiarity with accelerator-based systems such as GPUs, TPUs, or custom ML hardware.
  • Experience with PyTorch, JAX, TensorFlow, NCCL, XLA, CUDA, or similar tools.
  • Knowledge of interconnect and networking technologies such as InfiniBand, Ethernet/RDMA, NVLink, PCIe, or equivalent.
  • Experience evaluating both training throughput and inference latency/serving efficiency.
  • Background in workload characterization, trace-driven simulation, or model calibration.
  • Ability to work across hardware and software boundaries in a cross-functional environment.

What Success Looks Like:

  • Build simulation models that accurately predict performance trends and inform architectural decisions.
  • Identify compute and communication bottlenecks in ML training and inference systems.
  • Correlate simulation outputs with real-world benchmark data.
  • Improve system efficiency, scalability, and cost effectiveness through data-driven insights.

ML Systems Simulation Architect employer: Oriole Networks Ltd

As a leading innovator in machine learning systems, we pride ourselves on fostering a collaborative and inclusive work culture that empowers our employees to excel. Our commitment to professional development is evident through tailored growth opportunities and access to cutting-edge resources, ensuring that you can thrive in your role as an ML Systems Simulation Architect. Located in a vibrant tech hub, we offer a dynamic environment where creativity meets technology, making it an ideal place for those seeking meaningful and impactful work.

Oriole Networks Ltd

Contact Details:

Oriole Networks Ltd Recruitment Team

StudySmarter Expert Advice🤫

We think this is how you could land ML Systems Simulation Architect

Network Like a Pro

Get out there and connect with folks in the industry! Attend meetups, conferences, or even online webinars related to ML systems. You never know who might have the inside scoop on job openings or can refer you directly.

Show Off Your Skills

Create a portfolio showcasing your simulation models and performance analyses. Use GitHub or a personal website to display your projects. This gives potential employers a taste of what you can do and sets you apart from the crowd.

Ace the Interview

Prepare for technical interviews by brushing up on your knowledge of ML systems and performance engineering. Practice explaining your past projects and how they relate to the role. Confidence and clarity can make a huge difference!

Apply Through Our Website

Don’t forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, we love seeing candidates who are genuinely interested in joining our team.

We think you need these skills to ace ML Systems Simulation Architect

Simulation Modelling
Performance Benchmarking
Distributed Systems
Machine Learning Systems
Analytical Modelling
Performance Analysis
Python

Some tips for your application 🫡

Tailor Your CV:Make sure your CV reflects the skills and experiences that match the job description. Highlight your hands-on experience with ML systems, performance benchmarking, and any relevant tools like Python or C++. We want to see how you fit into our team!

Craft a Compelling Cover Letter:Your cover letter is your chance to shine! Use it to explain why you're passionate about ML systems and how your background aligns with our needs. Don’t forget to mention specific projects or achievements that showcase your expertise.

Showcase Your Analytical Skills:Since this role involves a lot of performance analysis and simulation, be sure to include examples of how you've tackled similar challenges in the past. We love seeing how you connect simulation results to real-world behaviour!

Apply Through Our Website:We encourage you to apply directly through our website. It’s the best way for us to receive your application and ensures you’re considered for the role. Plus, it shows you’re keen on joining our team at StudySmarter!

How to prepare for a job interview at Oriole Networks Ltd

Know Your Simulation Models

Make sure you understand the different simulation models relevant to ML systems. Brush up on how to build and validate these models, as well as their impact on performance optimisation and capacity planning. Being able to discuss specific examples from your experience will show that you’re not just familiar with the theory but can apply it in practice.

Showcase Your Hands-On Experience

Prepare to talk about your hands-on experience with performance benchmarking and profiling of ML systems. Be ready to share specific projects where you analysed compute, communication, and memory behaviour. This will demonstrate your practical skills and how they relate to the role.

Understand Distributed Systems

Since this role involves distributed execution across various accelerators and hosts, make sure you can explain concepts like data parallelism and synchronization overheads. Bring examples of how you've tackled challenges in distributed systems to the table, as this will highlight your expertise.

Communicate Findings Effectively

Practice how you would communicate technical findings through reports or presentations. The ability to convey complex information clearly is crucial. Think of a time when you had to present your findings to a non-technical audience and how you made it accessible for them.