ML Systems Engineer: Scalable Cloud & Infra in Cambridge

ML Systems Engineer: Scalable Cloud & Infra in Cambridge

Cambridge Full-Time 60000 - 80000 £ / year (est.) No working from home possible
B

At a Glance

  • Tasks: Build and manage scalable ML systems and cloud infrastructure for cutting-edge AI research.
  • Company: Join a nonprofit AI research organisation focused on solving complex societal problems.
  • Benefits: Competitive salary, collaborative culture, and opportunities for professional growth.
  • Other info: Work in a dynamic environment with excellent career advancement opportunities.
  • Why this job: Make a real impact in AI while working with innovative technologies and passionate researchers.
  • Qualifications: Experience in ML systems engineering and cloud administration is essential.

The predicted salary is between 60000 - 80000 £ per year.

About Basis

Basis is a nonprofit applied AI research organization with two mutually reinforcing goals. The first is to understand and build intelligence. This means to establish the mathematical principles of what it means to reason, to learn, to make decisions, to understand, and to explain; and to construct software that implements these principles. The second is to advance society’s ability to solve intractable problems. This means expanding the scale, complexity, and breadth of problems that we can solve today, and even more importantly, accelerating our ability to solve problems in the future. To achieve these goals, we’re building both a new technological foundation that draws inspiration from how humans reason, and a new kind of collaborative organization that puts human values first.

About the Role

ML Systems Engineers at Basis ensure training and evaluation infrastructure is fast, reliable, and scalable. You will own the full stack from distributed training frameworks through cloud administration, making it possible for researchers to iterate quickly on complex models while managing computational resources efficiently. We are looking for engineers who combine deep understanding of ML systems with operational excellence. The ideal ML Systems Engineer has experience with distributed training at scale, understands the intricacies of debugging numerical instabilities, and can manage cloud infrastructure that scales from experiments to production. You will be the guardian of training stability, the optimizer of compute costs, and the enabler of reproducible research.

This role spans traditional ML engineering and cloud/DevOps responsibilities. You will manage GPU clusters, optimize cloud spending, ensure security and compliance, and build the infrastructure that lets researchers focus on algorithms rather than operations. We seek individuals who aspire to build robust ML infrastructure, maintain a "logbook culture" for documenting issues and solutions, and treat operational excellence as a first-class concern.

We expect you to:

  • Have demonstrated expertise in ML systems engineering. Examples include:
    • Managing distributed training jobs across hundreds of GPUs
    • Debugging and fixing numerical instabilities in large-scale training
    • Building infrastructure for reproducible ML experiments
    • Optimizing training throughput and resource utilization
  • Possess deep knowledge of distributed training frameworks including PyTorch/JAX distributed strategies (DDP, FSDP, ZeRO), gradient accumulation, mixed precision training, and checkpoint/recovery systems.
  • Have strong cloud administration skills including AWS/GCP/Azure services, infrastructure as code (Terraform), Kubernetes orchestration, cost optimization, security best practices, and compliance requirements.
  • Understand the full ML stack from hardware (GPUs, interconnects, storage) through frameworks (PyTorch, JAX) to high-level training loops and evaluation pipelines.
  • Be skilled at debugging complex failures across the stack—GPU/NCCL issues, data loading bottlenecks, memory leaks, gradient explosions, and convergence problems.
  • Value documentation and knowledge sharing. You maintain comprehensive logs of issues encountered, solutions found, and lessons learned, building institutional knowledge.
  • Progress with autonomy while coordinating closely with researchers. You can anticipate infrastructure needs, prevent problems before they occur, and respond quickly when issues arise.

In addition, the following would be an advantage:

  • Experience at organizations training large models (OpenAI, Anthropic, Google, Meta).
  • Background in both ML research and production systems.
  • Contributions to ML frameworks or distributed training libraries.
  • Experience with on-premise GPU cluster management.
  • Knowledge of optimization theory and numerical methods.
  • Understanding of robotics-specific infrastructure requirements.

Responsibilities:

  • Own distributed training infrastructure including job launchers, checkpointing systems, recovery mechanisms, and monitoring that ensures experiments run reliably at scale.
  • Debug and resolve training failures by diagnosing issues across GPUs, networking, numerics, and data pipelines, maintaining detailed logs of problems and solutions.
  • Profile and optimize training performance by identifying bottlenecks in data loading, gradient computation, communication overhead, and implementing solutions that improve step time.
  • Manage cloud infrastructure and costs including capacity planning, spot instance strategies, storage optimization, and building tools that give researchers visibility into resource usage.
  • Implement security and compliance measures including access controls, data encryption, audit logging, and ensuring infrastructure meets requirements for handling sensitive data.
  • Build evaluation and benchmarking infrastructure that enables consistent, reproducible measurement of model performance across different conditions and datasets.
  • Develop monitoring and alerting systems that detect anomalies in training metrics, resource utilization, or system health, enabling rapid response to issues.
  • Maintain development environments including containerization, dependency management, and tools that ensure researchers can reproduce results across different systems.
  • Document and share knowledge through runbooks, post-mortems, and training materials that help the team understand and operate ML infrastructure effectively.
  • Collaborate with researchers to understand requirements, suggest infrastructure solutions, and ensure systems support rather than constrain research goals.

Role Details

Exceptional candidates who may not meet all of the following criteria are still encouraged to apply.

  • FT/PT: Full-time.
  • In-person Policy: We are in the office four days a week. Be prepared to attend multi‑day Basis‑wide in‑person events.
  • Location: New York City or Cambridge, MA.
  • Salary range: Competitive salary.

Non-Discrimination Notice

Basis Research Institute provides equal employment opportunities without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability, or genetics and prohibits discrimination based on all protected characteristics.

Privacy Notice

By submitting your application, you grant Basis permission to use your materials for both hiring evaluation and recruitment‑related research and development purposes. Your information may be processed in different countries, including the US. You retain copyright while providing Basis a license to use these materials for the stated purposes.

ML Systems Engineer: Scalable Cloud & Infra in Cambridge employer: basis-research

Basis is an exceptional employer that prioritises human values and fosters a collaborative work culture, making it an ideal place for ML Systems Engineers to thrive. With a focus on employee growth, you will have the opportunity to work on cutting-edge AI research while enjoying competitive salaries and a supportive environment in vibrant locations like New York City or Cambridge, MA. Join us to contribute to meaningful advancements in technology and society, all while being part of a team that values operational excellence and knowledge sharing.

B

Contact Details:

basis-research Recruitment Team

StudySmarter Expert Advice🤫

We think this is how you could land ML Systems Engineer: Scalable Cloud & Infra in Cambridge

Tip Number 1

Network like a pro! Reach out to folks in the industry, attend meetups, and connect with people on LinkedIn. You never know who might have the inside scoop on job openings or can refer you directly.

Tip Number 2

Show off your skills! Create a portfolio showcasing your projects, especially those related to ML systems engineering. This gives potential employers a taste of what you can do and sets you apart from the crowd.

Tip Number 3

Prepare for interviews by brushing up on common technical questions and scenarios related to distributed training and cloud infrastructure. Practise explaining your thought process clearly; it’s all about demonstrating your problem-solving skills.

Tip Number 4

Don’t forget to apply through our website! It’s the best way to ensure your application gets seen. Plus, we love seeing candidates who are proactive about their job search!

We think you need these skills to ace ML Systems Engineer: Scalable Cloud & Infra in Cambridge

ML Systems Engineering
Distributed Training Frameworks
Debugging Numerical Instabilities
Cloud Administration (AWS/GCP/Azure)
Infrastructure as Code (Terraform)
Kubernetes Orchestration
Cost Optimization

Some tips for your application 🫡

Tailor Your Application:Make sure to customise your CV and cover letter to highlight your experience with ML systems and cloud infrastructure. We want to see how your skills align with the role, so don’t hold back on showcasing relevant projects!

Showcase Your Problem-Solving Skills:In your application, share specific examples of how you've debugged complex issues or optimised training performance in the past. We love candidates who can demonstrate their operational excellence and innovative thinking!

Highlight Your Collaborative Spirit:Since this role involves working closely with researchers, mention any experiences where you’ve collaborated effectively in a team. We value communication and teamwork, so let us know how you’ve contributed to group success!

Apply Through Our Website:Don’t forget to submit your application through our website! It’s the best way for us to receive your materials and ensures you’re considered for the role. We can’t wait to see what you bring to the table!

How to prepare for a job interview at basis-research

Know Your ML Systems Inside Out

Make sure you have a solid grasp of the ML systems engineering principles. Brush up on distributed training frameworks like PyTorch and JAX, and be ready to discuss your experience with debugging numerical instabilities and managing GPU clusters.

Showcase Your Cloud Skills

Familiarise yourself with cloud administration tools and practices. Be prepared to talk about your experience with AWS, GCP, or Azure, and how you've optimised costs and ensured security in previous roles. Real-world examples will make your case stronger!

Demonstrate Problem-Solving Abilities

Expect to face technical questions that test your troubleshooting skills. Prepare to discuss specific instances where you diagnosed and resolved complex failures in ML systems. Highlight your approach to maintaining detailed logs and documentation.

Emphasise Collaboration and Communication

This role involves working closely with researchers, so be ready to share how you've collaborated in the past. Discuss how you anticipate infrastructure needs and communicate effectively to ensure research goals are met without constraints.