Pre-Training Data Acquisition Engineer (Web Crawling)

Job Board

Companies

poolside

Pre-Training Data Acquisition Engineer (Web Crawling)

Full-Time 60000 - 80000 £ / year (est.) No working from home possible

Apply Now

At a Glance

Tasks: Design and operate large-scale web crawlers to acquire high-quality pre-training data.
Company: Join a pioneering AI company on the path to Artificial General Intelligence.
Benefits: Enjoy fully remote work, flexible hours, and generous vacation time.
Other info: Collaborative team culture with a focus on innovation and personal growth.
Why this job: Be at the forefront of AI development and make a real impact in tech.
Qualifications: Experience in distributed systems and web crawling; Python proficiency required.

The predicted salary is between 60000 - 80000 £ per year.

About Poolside

Poolside exists to be a company that builds a world where AI will be the engine behind economically valuable work and scientific progress. We believe the fastest way to reach AGI lies in accelerating software development itself, by reshaping the developer experience with agentic systems, coding assistants, and the frontier models that power them. We deploy these systems directly into the development environments of security‑conscious enterprises.

About Our Team

We were founded in the US and have our home there, but our team is distributed across Europe and North America. We get our fix of in‑person collaboration in Paris each month for 3 days, always Monday‑Wednesday, with an open invitation to stay the whole week. We also do longer off‑sites once a year. Our team is a multidisciplinary blend of research, engineering, and business experts. What unites us is our deep care for what we build together. We’re in a race that requires hard work, intellectual curiosity, and obsession; to balance this intensity, we’ve assembled a team of low ego and kind‑hearted individuals who have built the special culture Poolside has.

About The Role

You’ll be working alongside our pre‑training data team, focused on one of the most foundational challenges in training frontier LLMs: acquiring the best possible pre‑training data. The data we collect is upstream of everything. It directly shapes the capability of the models we train. As our first dedicated data acquisition engineer, you will spearhead and evolve systems that crawl the web at massive scale, rapidly ingest data from strategic partnerships, and build specialized tooling to maximize recall from high‑value sources. You’ll collaborate closely with pre‑training data researchers and engineers to ensure that our sourcing of data maps to our training needs, to ensure we have the most capable pre‑trained models.

YOUR MISSION

To deliver the highest‑quality, diverse, and most comprehensive data corpus to fuel the pre‑training of frontier models for software development.

Responsibilities

Design, build, and operate a large‑scale web crawler responsible for acquiring all openly accessible data on the internet
Develop specialized deep crawlers targeting high‑value sources to improve recall and coverage
In collaboration with data researchers, own a long‑term road map for data acquisition
Build observability, monitoring, and debugging tooling to ensure reliability and transparency across crawl infrastructure
Collaborate with pre‑training, post‑training, and evaluations teams to align data acquisition priorities with model training needs
Build high‑throughput ingestion pipelines for rapidly onboarding partner data and evaluating it for quality

Skills & Experience

Strong distributed systems background with proven experience building and operating large‑scale infrastructure — data pipelines, web crawlers, or similar
Proficiency in Python, and comfortable optimizing performance and debugging complex systems under production conditions
Hands‑on experience with web crawling or large‑scale data extraction: understanding of HTTP protocols, distributed job queues, and data parsing at scale
Familiarity with cloud platforms (AWS) and container orchestration (Kubernetes, Docker) for deploying and managing high‑throughput workloads
Aware of the non‑technical dimensions of internet‑scale crawling: data privacy, robots.txt adherence, and responsible crawl practices

Nice to have:

Prior experience pre‑training LLMs
Experience in building trillion‑scale SOTA pre‑training datasets
Experience translating research to production at scale

Process

Intro call with one of our Founding Engineers
Technical Interview(s) with one of our Members of Engineering Team
Fit call with the People team
Final interview with one of our Founding Engineers

Benefits

Fully remote work & flexible hours
37 days/year of vacation & holidays
16 weeks of flexible, full‑pay parental leave
Health insurance allowance for you & dependents
Company‑provided equipment
Well‑being, always‑be‑learning & home office allowances
Frequent team get togethers
Diverse & inclusive people‑first culture

Pre-Training Data Acquisition Engineer (Web Crawling) employer: poolside

At Poolside, we pride ourselves on being at the forefront of AI development, offering a unique opportunity for our Pre-Training Data Acquisition Engineer to contribute to groundbreaking work in a supportive and collaborative environment. With fully remote work options, generous vacation policies, and a commitment to employee well-being, we foster a culture that values diversity and inclusivity while providing ample opportunities for professional growth. Join us in Paris for monthly team collaborations and experience a workplace where your contributions directly impact the future of AI.

Contact Details:

poolside Recruitment Team

View poolside profile

StudySmarter Expert Advice🤫

We think this is how you could land Pre-Training Data Acquisition Engineer (Web Crawling)

✨Tip Number 1

Network like a pro! Reach out to people in the industry, especially those at Poolside. Use LinkedIn or even Twitter to connect with current employees and ask about their experiences. A friendly chat can sometimes lead to job opportunities that aren't even advertised!

✨Tip Number 2

Prepare for your interviews by diving deep into the company’s mission and values. Understand how your skills as a Pre-Training Data Acquisition Engineer can contribute to their goal of reaching AGI. Show them you’re not just another candidate; you’re someone who genuinely cares about their mission.

✨Tip Number 3

Practice makes perfect! Get comfortable with common technical interview questions related to web crawling and data acquisition. You might even want to set up mock interviews with friends or use online platforms to sharpen your skills before the real deal.

✨Tip Number 4

Don’t forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, it shows you’re serious about joining the team at Poolside and contributing to their exciting journey towards AGI.

We think you need these skills to ace Pre-Training Data Acquisition Engineer (Web Crawling)

Web Crawling

Data Acquisition

Distributed Systems

Large-Scale Infrastructure

Data Pipelines

Python

Performance Optimisation

Debugging Complex Systems

HTTP Protocols

Data Parsing

Cloud Platforms (AWS)

Container Orchestration (Kubernetes, Docker)

Data Privacy Awareness

Responsible Crawl Practices

Collaboration with Data Researchers

Some tips for your application 🫡

Show Your Passion:When writing your application, let your enthusiasm for AI and data acquisition shine through. We want to see that you’re genuinely excited about the role and our mission at Poolside.

Tailor Your Experience:Make sure to highlight your relevant experience in web crawling and data pipelines. We’re looking for specific examples that demonstrate your skills and how they align with what we do here at Poolside.

Be Clear and Concise:Keep your application straightforward and to the point. We appreciate clarity, so avoid jargon and focus on communicating your ideas effectively. Remember, less is often more!

Apply Through Our Website:Don’t forget to submit your application through our website! It’s the best way for us to receive your details and ensures you’re considered for the role. We can’t wait to hear from you!

How to prepare for a job interview at poolside

✨Know Your Tech Inside Out

Make sure you’re well-versed in the technologies mentioned in the job description, especially Python and distributed systems. Brush up on your knowledge of web crawling, data pipelines, and cloud platforms like AWS. Being able to discuss your hands-on experience with these tools will show that you're ready to hit the ground running.

✨Understand the Company’s Mission

Dive deep into Poolside's mission and values. Familiarise yourself with their focus on AGI and how they aim to reshape the developer experience. This will not only help you align your answers with their goals but also demonstrate your genuine interest in being part of their journey.

✨Prepare for Technical Challenges

Expect technical questions that assess your problem-solving skills and understanding of large-scale infrastructure. Practice coding problems related to data extraction and web crawling. You might even want to simulate a debugging session to showcase your thought process during the interview.

✨Showcase Your Collaborative Spirit

Since the role involves working closely with researchers and engineers, be prepared to discuss your experience in collaborative environments. Share examples of how you've successfully worked in teams, tackled challenges together, and contributed to a positive team culture. This will resonate well with Poolside's emphasis on low ego and kind-hearted individuals.

Pre-Training Data Acquisition Engineer (Web Crawling)

poolside

Apply Now

Pre-Training Data Acquisition Engineer (Web Crawling)

At a Glance

Pre-Training Data Acquisition Engineer (Web Crawling) employer: poolside

StudySmarter Expert Advice🤫

We think you need these skills to ace Pre-Training Data Acquisition Engineer (Web Crawling)

Some tips for your application 🫡

How to prepare for a job interview at poolside

Company

Product

Help