Member of Engineering (Scalability) in London
Member of Engineering (Scalability)

Member of Engineering (Scalability) in London

London Full-Time 36000 - 60000 Β£ / year (est.) No home office possible
Go Premium
P

At a Glance

  • Tasks: Join our pre-training team to build and optimise Large Language Models.
  • Company: Poolside, a pioneering company in AI development and research.
  • Benefits: Enjoy fully remote work, flexible hours, and 37 days of vacation.
  • Why this job: Make a real impact in AI while working with cutting-edge technology.
  • Qualifications: Strong engineering skills and knowledge of LLMs and distributed systems.
  • Other info: Collaborate with a diverse team across Europe and North America.

The predicted salary is between 36000 - 60000 Β£ per year.

ABOUT POOLSIDE

In this decade, the world will create Artificial General Intelligence. There will only be a small number of companies who will achieve this. Their ability to stack advantages and pull ahead will define the winners. These companies will move faster than anyone else. They will attract the world's most capable talent. They will be on the forefront of applied research, engineering, infrastructure and deployment at scale. They will continue to scale their training to larger & more capable models. They will be given the right to raise large amounts of capital along their journey to enable this. They will create powerful economic engines. They will obsess over the success of their users and customers. Poolside exists to be this company - to build a world where AI will be the engine behind economically valuable work and scientific progress.

ABOUT OUR TEAM

We are a remote-first team that sits across Europe and North America and comes together once a month in-person for 3 days and for longer offsites twice a year. Our R&D and production teams are a combination of more research and more engineering-oriented profiles, however, everyone deeply cares about the quality of the systems we build and has a strong underlying knowledge of software development. We believe that good engineering leads to faster development iterations, which allows us to compound our efforts.

ABOUT THE ROLE

You would be working in our pre-training team focused on building out our distributed training and inference of Large Language Models (LLMs). This is a hands-on role that focuses on software reliability and fault tolerance. You will work on cross-platform checkpointing, NCCL recovery, and hardware fault detection. You will make high-level tools. You will not be afraid of debugging Linux kernel modules. You will have access to thousands of GPUs to test changes. Strong engineering skills are a prerequisite. We assume good knowledge of Torch, NVIDIA GPU architecture, reliability concepts, distributed systems, and best coding practices. A basic understanding of LLM training and inference principles is required. We look for fast learners who are prepared for a steep learning curve and are not afraid to step out of their comfort zone.

YOUR MISSION

To help train the best foundational models for source code generation in the world.

RESPONSIBILITIES

  • Identify, study, and troubleshoot hardware problems during training at scale
  • Minimise the GPU idle time during faults, both operationally and strategically
  • Design and develop tools and add-ons to accelerate the training recovery
  • Improve the performance and reliability of checkpointing
  • Write high-quality Python (PyTorch), Cython, C/C++, CUDA API code

SKILLS & EXPERIENCE

  • Understanding of Large Language Models (LLM)
  • Basic knowledge of Transformers
  • Knowledge of deep learning fundamentals
  • Strong engineering background
  • Programming experience
  • Linux API, Linux kernel
  • Strong algorithmic skills
  • Python with numpy, PyTorch, or Jax
  • C/C++
  • NCCL
  • Use modern tools and are always looking to improve
  • Strong critical thinking and ability to question code quality policies when applicable
  • Distributed systems
  • Reliability
  • Observability
  • Fault-tolerance
  • K8s stack

PROCESS

  • Intro call with one of our Founding Engineers
  • Technical Interview(s) with one of our Founding Engineers
  • Team fit call with the People team
  • Final interview with Eiso, our CTO & Co-Founder

BENEFITS

  • Fully remote work & flexible hours
  • 37 days/year of vacation & holidays
  • Health insurance allowance for you and dependents
  • Company-provided equipment
  • Wellbeing, always-be-learning and home office allowances
  • Frequent team get togethers
  • Great diverse & inclusive people-first culture

Member of Engineering (Scalability) in London employer: poolside

Poolside is an exceptional employer for those passionate about advancing Artificial General Intelligence, offering a fully remote work environment with flexible hours and a generous 37 days of vacation per year. Our inclusive culture prioritises employee wellbeing and continuous learning, while providing access to cutting-edge technology and opportunities for professional growth through hands-on experience in a dynamic team setting. Join us to be part of a pioneering journey that shapes the future of AI and contributes to meaningful scientific progress.
P

Contact Detail:

poolside Recruiting Team

StudySmarter Expert Advice 🀫

We think this is how you could land Member of Engineering (Scalability) in London

✨Tip Number 1

Get your networking game on! Reach out to current employees at Poolside or similar companies on LinkedIn. A friendly chat can give you insider info and might just get your foot in the door.

✨Tip Number 2

Show off your skills! If you’ve got a GitHub or portfolio, make sure it’s up to date with your best projects. Highlight any work related to distributed systems or LLMs to catch their eye.

✨Tip Number 3

Prepare for those technical interviews! Brush up on your coding skills, especially in Python and C++. Practice debugging and problem-solving scenarios that relate to reliability and fault tolerance.

✨Tip Number 4

Don’t forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, we love seeing candidates who are proactive about their job search.

We think you need these skills to ace Member of Engineering (Scalability) in London

Software Reliability
Fault Tolerance
Distributed Training
Large Language Models (LLM)
Linux Kernel Debugging
NVIDIA GPU Architecture
Python (PyTorch)
C/C++
CUDA API
Deep Learning Fundamentals
Transformers
Algorithmic Skills
Critical Thinking
Distributed Systems
Observability

Some tips for your application 🫑

Show Your Passion for AI: When writing your application, let us see your enthusiasm for AI and how it can drive economic value. Share any personal projects or experiences that highlight your interest in the field, as we love to see candidates who are genuinely excited about what they do!

Tailor Your Skills to the Role: Make sure to align your skills and experiences with the specific requirements of the Member of Engineering (Scalability) role. Highlight your knowledge of distributed systems, reliability concepts, and programming languages like Python and C/C++. We want to see how you can contribute to our mission!

Be Clear and Concise: Keep your application clear and to the point. Use straightforward language to describe your experiences and achievements. We appreciate well-structured applications that make it easy for us to understand your background and how it fits with our team.

Apply Through Our Website: Don’t forget to submit your application through our website! It’s the best way for us to receive your details and ensures you’re considered for the role. Plus, it shows you’re serious about joining our remote-first team at Poolside!

How to prepare for a job interview at poolside

✨Know Your Tech Inside Out

Make sure you brush up on your knowledge of Large Language Models, distributed systems, and the specific technologies mentioned in the job description. Be ready to discuss your experience with Python, C/C++, and any relevant frameworks like PyTorch. The more you can demonstrate your technical expertise, the better!

✨Showcase Problem-Solving Skills

Prepare to talk about past experiences where you've identified and solved complex engineering problems. Think of examples that highlight your ability to troubleshoot hardware issues or improve system reliability. This role is all about fault tolerance, so showing that you can think critically under pressure will impress the interviewers.

✨Familiarise Yourself with Their Culture

Since Poolside values a remote-first culture and teamwork, be ready to discuss how you thrive in such environments. Share examples of how you've collaborated with teams across different locations and how you maintain communication and productivity while working remotely.

✨Ask Insightful Questions

Prepare thoughtful questions about the company's vision for AI and how they plan to scale their training processes. This shows your genuine interest in their mission and gives you a chance to assess if their goals align with your career aspirations. Plus, it’s a great way to engage with the interviewers!

Member of Engineering (Scalability) in London
poolside
Location: London
Go Premium

Land your dream job quicker with Premium

You’re marked as a top applicant with our partner companies
Individual CV and cover letter feedback including tailoring to specific job roles
Be among the first applications for new jobs with our AI application
1:1 support and career advice from our career coaches
Go Premium

Money-back if you don't land a job in 6-months

P
  • Member of Engineering (Scalability) in London

    London
    Full-Time
    36000 - 60000 Β£ / year (est.)
  • P

    poolside

    50-100
Similar positions in other companies
UK’s top job board for Gen Z
discover-jobs-cta
Discover now
>