About Liquid AI
Spun out of MIT CSAIL, we build AI systems that run where others stall: on CPUs, with low latency, minimal memory, and maximum reliability. We partner with enterprises across consumer electronics, automotive, life sciences, and financial services. We are scaling rapidly and need exceptional people to help us get there.
The Opportunity
Our Training Infrastructure team is building the distributed systems that power our next-generation Liquid Foundation Models. As we scale, we need to design, implement, and optimize the infrastructure that enables large-scale training. This is a high-ownership role on a small team with fast feedback loops. We\’re looking for someone who wants to build critical systems from the ground up rather than inherit mature infrastructure.
While San Francisco and Boston are preferred, we are open to other locations.
What We\’re Looking For
We need someone who:
-
Loves distributed systems complexity: Our team debugs training failures across GPU clusters, optimizes communication patterns, and builds data pipelines that handle multimodal workloads.
-
Wants to build: We have strong researchers. We need builders who find satisfaction in robust, fast, reliable infrastructure.
-
Thrives in ambiguity: Our systems support model architectures that are still evolving. We make decisions with incomplete information and iterate fast.
-
Takes direction and delivers: Our best engineers align with team priorities while pushing back when they see problems.
The Work
-
Design and implement a scalable training infrastructure for our GPU clusters
-
Build data loading systems that eliminate I/O bottlenecks for multimodal datasets
-
Develop checkpointing mechanisms balancing memory constraints with recovery needs
-
Optimize communication patterns to minimize distributed training overhead
-
Create monitoring and debugging tools for training stability
Desired Experience
Must-have:
-
Hands‑on experience building distributed training infrastructure (PyTorch Distributed, DeepSpeed, or Megatron‑LM)
-
Understanding of hardware accelerators and networking topologies
-
Experience optimizing data pipelines for ML workloads
Nice-to-have:
-
MoE (Mixture of Experts) training experience
-
Large‑scale distributed training (100+ GPUs)
-
Open‑source contributions to training infrastructure projects
What Success Looks Like (Year One)
-
Training run stability has improved (fewer failures, faster recovery)
-
Data loading bottlenecks are eliminated for multimodal workloads
-
Time‑to‑recovery from training failures has decreased
What We Offer
-
Greenfield challenges: Build systems from scratch for novel architectures. High ownership from day one.
-
Compensation: Competitive base salary with equity in a unicorn‑stage company
-
Health: We pay 100% of medical, dental, and vision premiums for employees and dependents
-
Financial: 401(k) matching up to 4% of base pay
-
Time Off: Unlimited PTO plus company‑wide Refill Days throughout the year
#J-18808-Ljbffr
Contact Detail:
Liquid AI Recruiting Team