Member of Technical Staff - Training Infrastructure Engineer in Boston
Member of Technical Staff - Training Infrastructure Engineer

Member of Technical Staff - Training Infrastructure Engineer in Boston

Boston Full-Time No home office possible
Go Premium
L

About Liquid AI

Spun out of MIT CSAIL, we build AI systems that run where others stall: on CPUs, with low latency, minimal memory, and maximum reliability. We partner with enterprises across consumer electronics, automotive, life sciences, and financial services. We are scaling rapidly and need exceptional people to help us get there.

The Opportunity

Our Training Infrastructure team is building the distributed systems that power our next-generation Liquid Foundation Models. As we scale, we need to design, implement, and optimize the infrastructure that enables large-scale training. This is a high-ownership role on a small team with fast feedback loops. We\’re looking for someone who wants to build critical systems from the ground up rather than inherit mature infrastructure.

While San Francisco and Boston are preferred, we are open to other locations.

What We\’re Looking For

We need someone who:

  • Loves distributed systems complexity: Our team debugs training failures across GPU clusters, optimizes communication patterns, and builds data pipelines that handle multimodal workloads.

  • Wants to build: We have strong researchers. We need builders who find satisfaction in robust, fast, reliable infrastructure.

  • Thrives in ambiguity: Our systems support model architectures that are still evolving. We make decisions with incomplete information and iterate fast.

  • Takes direction and delivers: Our best engineers align with team priorities while pushing back when they see problems.

The Work

  • Design and implement a scalable training infrastructure for our GPU clusters

  • Build data loading systems that eliminate I/O bottlenecks for multimodal datasets

  • Develop checkpointing mechanisms balancing memory constraints with recovery needs

  • Optimize communication patterns to minimize distributed training overhead

  • Create monitoring and debugging tools for training stability

Desired Experience

Must-have:

  • Hands‑on experience building distributed training infrastructure (PyTorch Distributed, DeepSpeed, or Megatron‑LM)

  • Understanding of hardware accelerators and networking topologies

  • Experience optimizing data pipelines for ML workloads

Nice-to-have:

  • MoE (Mixture of Experts) training experience

  • Large‑scale distributed training (100+ GPUs)

  • Open‑source contributions to training infrastructure projects

What Success Looks Like (Year One)

  • Training run stability has improved (fewer failures, faster recovery)

  • Data loading bottlenecks are eliminated for multimodal workloads

  • Time‑to‑recovery from training failures has decreased

What We Offer

  • Greenfield challenges: Build systems from scratch for novel architectures. High ownership from day one.

  • Compensation: Competitive base salary with equity in a unicorn‑stage company

  • Health: We pay 100% of medical, dental, and vision premiums for employees and dependents

  • Financial: 401(k) matching up to 4% of base pay

  • Time Off: Unlimited PTO plus company‑wide Refill Days throughout the year

#J-18808-Ljbffr

L

Contact Detail:

Liquid AI Recruiting Team

Member of Technical Staff - Training Infrastructure Engineer in Boston
Liquid AI
Location: Boston
Go Premium

Land your dream job quicker with Premium

You’re marked as a top applicant with our partner companies
Individual CV and cover letter feedback including tailoring to specific job roles
Be among the first applications for new jobs with our AI application
1:1 support and career advice from our career coaches
Go Premium

Money-back if you don't land a job in 6-months

>