Machine Learning Engineer, Core Data

Machine Learning Engineer, Core Data

Full-Time 60000 - 80000 € / year (est.) No home office possible
Cantina Labs

At a Glance

  • Tasks: Own and enhance datasets for speech systems, ensuring top-notch data quality.
  • Company: Join Cantina Labs, a pioneering social AI company transforming storytelling and creativity.
  • Benefits: Competitive salary, flexible work options, and opportunities for professional growth.
  • Other info: Dynamic team environment with a focus on collaboration and cutting-edge technology.
  • Why this job: Be at the forefront of AI innovation, shaping how we connect and create.
  • Qualifications: Experience in ML-driven data quality systems and proficiency in Python and PyTorch.

The predicted salary is between 60000 - 80000 € per year.

About Cantina

Cantina Labs is a social AI company, developing a suite of advanced real‑time models that push the boundaries of expression, personality, and realism. We bring characters to life, transforming how people tell stories, connect, and create. We build and power ecosystems. Cantina, our flagship social AI platform, is just the beginning. If you’re excited about the potential AI has to shape human creativity and social interactions, join us in building the future!

About The Role

We’re looking for an ML Engineer focused on Data Quality to own the datasets that power our speech systems. You will be hands‑on with audio and text data: auditing, denoising, filtering, labeling, and building the tooling and models that turn messy, large‑scale data into reliable training corpora for TTS and adjacent tasks. You’ll develop data quality metrics and classifiers, run human‑in‑the‑loop annotation programs, and integrate quality gates into our training and evaluation pipelines. Your work will directly improve model performance, robustness, and cost by driving the model data eval flywheel from the data side.

What You’ll Do

  • Dataset ownership: define specs; audit and curate large‑scale audio/text; close corpus gaps and fix sample‑level issues.
  • Quality instrumentation: build automated gates/metrics (e.g., SNR, clipping, VAD, WER, SV/LID, safety) with dashboards; validate against listening tests.
  • Classifiers and filters: train lightweight models to tag, score, and filter data (VAD, ASR gating, LID, SV/diarization, noise/safety); calibrate to subjective outcomes.
  • Cleaning and integrity: apply denoise/dereverb/de‑clip when beneficial; deduplicate and decontaminate; prevent leakage; maintain lineage and versioned releases.
  • Data selection: optimize mixtures via sampling, weighting, curriculum, and active learning; mine hard negatives and long‑tail cases.
  • Tooling and pipelines: ship reproducible ETL and validation; integrate quality gates into training/eval; add monitoring and alerts.
  • Human‑in‑the‑loop and compliance: run MTurk/vendor annotation with strong QC; ensure consent/licensing/policy compliance; collaborate across teams and document datasets.

What You’ll Bring

  • Strong experience building ML‑driven data quality systems for audio/speech, or equivalent data‑centric ML experience with a track record of improving model outcomes via better data.
  • Proficient in Python and PyTorch; training/finetuning SSL‑ASR (Whisper, Wav2Vec, BERT) models, CNN based classifiers and writing robust production code.
  • Audio/speech fundamentals: torchaudio/librosa/ffmpeg, spectrogram features (e.g., log‑mel, MFCC), VAD/SAD, basic DSP, and audio QA.
  • Scalable data engineering skills: Spark/Beam or similar, SQL, Airflow or equivalent orchestration, and cloud storage/computing (AWS/GCP).
  • Familiarity with ASR/TTS metrics and tooling: WER, MOS/MOSNet, PESQ/STOI/ViSQOL, speaker verification (EER), diarization, language ID.
  • Experience with dataset validation, versioning, and experiment tracking; comfort debugging data issues from single samples to fleet‑wide trends.
  • Ability to balance rigor with speed, and to translate ambiguous requirements into measurable data improvements.

Preferred Experience

  • Shipped datasets and/or data quality tooling that moved the needle for TTS/ASR/VC in production.
  • Built and deployed classifiers for LID, SV/diarization, VAD, noise/glitch detection, or safety/content moderation for audio.
  • Ran crowdsourcing/vendor annotation at scale with strong quality control (honeypots, IAA, label aggregation).
  • Background in de‑noising/enhancement and their effects on downstream TTS quality.
  • Contributions to open‑source or publications in speech/audio/ML.
  • Experience with data governance, consent tracking, and policy enforcement.

Machine Learning Engineer, Core Data employer: Cantina Labs

Cantina Labs is an exceptional employer for those passionate about the intersection of AI and creativity. With a vibrant work culture that fosters innovation and collaboration, employees are encouraged to explore their ideas while contributing to groundbreaking projects in social AI. The company offers ample opportunities for professional growth, competitive benefits, and the chance to be part of a team that is shaping the future of storytelling and human interaction through advanced technology.

Cantina Labs

Contact Detail:

Cantina Labs Recruiting Team

StudySmarter Expert Advice🤫

We think this is how you could land Machine Learning Engineer, Core Data

Tip Number 1

Network like a pro! Reach out to folks in the industry, attend meetups, and connect with people on LinkedIn. You never know who might have the inside scoop on job openings or can refer you directly.

Tip Number 2

Show off your skills! Create a portfolio showcasing your projects, especially those related to audio and speech data. This is your chance to demonstrate your expertise in ML-driven data quality systems and grab attention.

Tip Number 3

Prepare for interviews by brushing up on your technical knowledge and problem-solving skills. Be ready to discuss your experience with Python, PyTorch, and any relevant tools you've used. Practice common ML interview questions to boost your confidence.

Tip Number 4

Apply through our website! We love seeing candidates who are genuinely interested in joining us at Cantina Labs. Tailor your application to highlight how your skills align with our mission of pushing the boundaries of AI in storytelling.

We think you need these skills to ace Machine Learning Engineer, Core Data

Machine Learning
Data Quality Systems
Python
PyTorch
SSL-ASR Models
CNN Classifiers
Audio/Speech Fundamentals

Some tips for your application 🫡

Tailor Your CV:Make sure your CV is tailored to the role of Machine Learning Engineer. Highlight your experience with data quality systems, audio/speech fundamentals, and any relevant projects that showcase your skills in Python and PyTorch.

Craft a Compelling Cover Letter:Your cover letter is your chance to shine! Use it to explain why you're excited about the role and how your background aligns with Cantina's mission. Don't forget to mention specific experiences that demonstrate your ability to improve model outcomes through better data.

Showcase Your Projects:If you've worked on any relevant projects, whether personal or professional, make sure to include them. This could be anything from building classifiers to running crowdsourcing annotation. Real-world examples can really set you apart!

Apply Through Our Website:We encourage you to apply directly through our website. It’s the best way for us to see your application and ensures you’re considered for the role. Plus, it shows you’re keen on joining our team at Cantina!

How to prepare for a job interview at Cantina Labs

Know Your Data Inside Out

Make sure you’re familiar with the datasets relevant to the role. Understand the nuances of audio and text data, and be ready to discuss how you’ve audited, curated, or improved data quality in your previous projects.

Showcase Your Technical Skills

Be prepared to demonstrate your proficiency in Python and PyTorch. Bring examples of your work with ML-driven data quality systems, especially those that have improved model outcomes. If you’ve worked with tools like torchaudio or Spark, mention specific projects where you applied these skills.

Discuss Quality Metrics and Tools

Familiarise yourself with key metrics like WER, MOS, and VAD. Be ready to explain how you’ve implemented quality instrumentation in past roles, and how you’ve used dashboards to monitor data quality. This shows you understand the importance of data integrity in machine learning.

Prepare for Scenario-Based Questions

Expect questions that assess your problem-solving skills. Think about scenarios where you had to clean messy data or implement human-in-the-loop processes. Prepare to discuss how you approached these challenges and what the outcomes were.