At a Glance
- Tasks: Own and enhance datasets for speech systems, ensuring top-notch data quality.
- Company: Join Cantina Labs, a pioneering social AI company transforming storytelling and creativity.
- Benefits: Competitive salary, flexible work options, and opportunities for professional growth.
- Other info: Dynamic team environment with a focus on collaboration and cutting-edge technology.
- Why this job: Be at the forefront of AI innovation, shaping how we connect and create.
- Qualifications: Experience in ML-driven data quality systems and proficiency in Python and PyTorch.
The predicted salary is between 60000 - 80000 € per year.
About Cantina
Cantina Labs is a social AI company, developing a suite of advanced real‐time models that push the boundaries of expression, personality, and realism. We bring characters to life, transforming how people tell stories, connect, and create. We build and power ecosystems. Cantina, our flagship social AI platform, is just the beginning.
If you're excited about the potential AI has to shape human creativity and social interactions, join us in building the future!
About The Role
We're looking for an ML Engineer focused on Data Quality to own the datasets that power our speech systems. You will be hands‐on with audio and text data: auditing, denoising, filtering, labeling, and building the tooling and models that turn messy, large‐scale data into reliable training corpora for TTS and adjacent tasks. You'll develop data quality metrics and classifiers, run human‐in‐the‐loop annotation programs, and integrate quality gates into our training and evaluation pipelines. Your work will directly improve model performance, robustness, and cost by driving the model data eval flywheel from the data side.
What You'll Do
- Dataset ownership: define specs; audit and curate large‐scale audio/text; close corpus gaps and fix sample‐level issues.
- Quality instrumentation: build automated gates/metrics (e.g., SNR, clipping, VAD, WER, SV/LID, safety) with dashboards; validate against listening tests.
- Classifiers and filters: train lightweight models to tag, score, and filter data (VAD, ASR gating, LID, SV/diarization, noise/safety); calibrate to subjective outcomes.
- Cleaning and integrity: apply denoise/dereverb/de‐clip when beneficial; deduplicate and decontaminate; prevent leakage; maintain lineage and versioned releases.
- Data selection: optimize mixtures via sampling, weighting, curriculum, and active learning; mine hard negatives and long‐tail cases.
- Tooling and pipelines: ship reproducible ETL and validation; integrate quality gates into training/eval; add monitoring and alerts.
- Human‐in‐the‐loop and compliance: run MTurk/vendor annotation with strong QC; ensure consent/licensing/policy compliance; collaborate across teams and document datasets.
What You'll Bring
- Strong experience building ML‐driven data quality systems for audio/speech, or equivalent data‐centric ML experience with a track record of improving model outcomes via better data.
- Proficient in Python and PyTorch; training/finetuning SSL‐ASR (Whisper, Wav2Vec, BERT) models, CNN based classifiers and writing robust production code.
- Audio/speech fundamentals: torchaudio/librosa/ffmpeg, spectrogram features (e.g., log‐mel, MFCC), VAD/SAD, basic DSP, and audio QA.
- Scalable data engineering skills: Spark/Beam or similar, SQL, Airflow or equivalent orchestration, and cloud storage/computing (AWS/GCP).
- Familiarity with ASR/TTS metrics and tooling: WER, MOS/MOSNet, PESQ/STOI/ViSQOL, speaker verification (EER), diarization, language ID.
- Experience with dataset validation, versioning, and experiment tracking; comfort debugging data issues from single samples to fleet‐wide trends.
- Ability to balance rigor with speed, and to translate ambiguous requirements into measurable data improvements.
Preferred Experience
- Shipped datasets and/or data quality tooling that moved the needle for TTS/ASR/VC in production.
- Built and deployed classifiers for LID, SV/diarization, VAD, noise/glitch detection, or safety/content moderation for audio.
- Ran crowdsourcing/vendor annotation at scale with strong quality control (honeypots, IAA, label aggregation).
- Background in de‐noising/enhancement and their effects on downstream TTS quality.
- Contributions to open‐source or publications in speech/audio/ML.
- Experience with data governance, consent tracking, and policy enforcement.
Machine Learning Engineer, Core Data in London employer: Cantina Labs
Cantina Labs is an exceptional employer that fosters a culture of innovation and creativity, where employees are empowered to shape the future of social AI. With a focus on professional growth, we offer opportunities for hands-on experience in cutting-edge technology, collaborative projects, and a supportive environment that values diverse perspectives. Located in a vibrant tech hub, our team enjoys access to a dynamic work atmosphere, competitive benefits, and the chance to make a meaningful impact in the world of AI.
StudySmarter Expert Advice🤫
We think this is how you could land Machine Learning Engineer, Core Data in London
✨Tip Number 1
Network like a pro! Reach out to folks in the industry on LinkedIn or at meetups. A friendly chat can sometimes lead to job opportunities that aren't even advertised yet.
✨Tip Number 2
Show off your skills! Create a portfolio showcasing your projects, especially those related to data quality and machine learning. This gives potential employers a taste of what you can do.
✨Tip Number 3
Prepare for interviews by brushing up on common ML concepts and tools. Be ready to discuss your experience with Python, PyTorch, and any relevant audio/speech fundamentals.
✨Tip Number 4
Don't forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, we love seeing candidates who are proactive!
We think you need these skills to ace Machine Learning Engineer, Core Data in London
Some tips for your application 🫡
Tailor Your CV:Make sure your CV is tailored to the Machine Learning Engineer role. Highlight your experience with data quality systems and any relevant projects you've worked on. We want to see how your skills align with what we're looking for!
Showcase Your Projects:Include specific examples of projects where you've improved model outcomes through better data. If you've built classifiers or worked with audio/speech data, let us know! This is your chance to shine.
Be Clear and Concise:When writing your application, keep it clear and to the point. Use bullet points for easy reading and make sure to highlight your key achievements. We appreciate a well-structured application that gets straight to the good stuff!
Apply Through Our Website:Don't forget to apply through our website! It’s the best way for us to receive your application and ensures you’re considered for the role. We can’t wait to see what you bring to the table!
How to prepare for a job interview at Cantina Labs
✨Know Your Data Inside Out
Make sure you understand the datasets you'll be working with. Familiarise yourself with audio and text data, and be ready to discuss how you've audited, filtered, or labelled similar datasets in the past. This will show your potential employer that you can take ownership of their data quality.
✨Show Off Your Technical Skills
Brush up on your Python and PyTorch skills, especially around training and fine-tuning models like Whisper or Wav2Vec. Be prepared to talk about your experience with scalable data engineering tools like Spark or Airflow, as well as any cloud computing platforms you've used. This is your chance to demonstrate your technical prowess!
✨Prepare for Practical Scenarios
Expect to face practical questions or case studies during the interview. Think about how you would approach building automated quality gates or validating datasets against listening tests. Practising these scenarios can help you articulate your thought process clearly.
✨Highlight Collaboration and Compliance
Since this role involves working across teams and ensuring compliance, be ready to share examples of how you've collaborated with others in previous roles. Discuss any experience you have with crowdsourcing or vendor annotation, and how you maintained quality control throughout the process.