At a Glance
- Tasks: Design and scale infrastructure for AI, building data pipelines and optimising system performance.
- Company: CoreWeave, a pioneering cloud platform for AI, trusted by top innovators.
- Benefits: Comprehensive health insurance, pension contributions, tuition reimbursement, and a fun work culture.
- Other info: Hybrid work environment with opportunities for collaboration and career growth.
- Why this job: Join a fast-growing team and make a real impact in the AI space.
- Qualifications: 7+ years in data engineering or MLOps, with strong Python skills and experience in distributed systems.
The predicted salary is between 80000 - 100000 € per year.
CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups, and global enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs and turn compute into capability. Founded in 2017, CoreWeave became a publicly traded company (Nasdaq: CRWV) in March 2025. Learn more at www.coreweave.com. We’re proud to be a Living Wage accredited Employer.
What You’ll Do
The Data Science team is focused on developing an advanced reliability platform. This system covers various aspects of data processing and analysis, including data intake, deriving meaningful metrics, identifying unusual patterns, predicting potential issues, finding slow processes in distributed systems, and using automated analysis to determine causes. We collaborate closely with internal teams like Fleet, Infrastructure, and AI Platform to enhance system stability, optimize resource use, shorten resolution times, and maintain service availability and financial performance.
About The Role
As a Senior Data & MLOps Engineer, you will design and scale the infrastructure supporting the GPU Intelligence Platform. This involves building pipelines for handling data, features, model training, and delivering insights and predictions for system health and optimization. You will transition the system from initial prototypes to a production environment operating across the fleet, focusing on scalability, separating real‑time service from periodic processing, and dynamic resource management based on system load and data frequency. You will architect and deploy these scalable distributed services using orchestration technologies.
Key Responsibilities
- Design and implement scalable data ingestion pipelines.
- Build feature processing and baseline computation systems.
- Productionize models for prediction and detection.
- Develop and operate low‑latency service and robust offline workflows.
- Architect horizontally scalable services with clear separation between components, leveraging orchestration for distribution.
- Implement monitoring and feedback loops for continuous model and signal improvement.
- Collaborate with Platform teams to integrate operational signals into monitoring and diagnostics.
- Implement a scalable solution for mitigation and structured analysis.
Who You Are
- 7+ years of experience in data engineering, distributed systems, MLOps, or infrastructure ML roles in production environments.
- Proven experience building high‑throughput streaming or telemetry pipelines (e.g., Kafka, Pulsar, Kinesis, or equivalent).
- Strong experience designing time‑series feature pipelines and operating large‑scale observability systems.
- Experience building and maintaining feature stores and ensuring offline/online feature parity.
- Hands‑on experience deploying ML models to production, including versioning, monitoring, rollback, and drift detection.
- Experience designing scalable microservices deployed in Kubernetes‑based environments.
- Strong proficiency in Python and at least one systems language (Go, Rust, or C++).
- Experience working with distributed compute or training systems (e.g., NCCL, PyTorch Distributed, Spark, Ray, Slurm).
- Familiarity with GPU telemetry systems such as NVML or DCGM and hardware‑level monitoring concepts.
- Demonstrated experience scaling systems from Proof‑of‑Concept to production‑grade, fleet‑level deployments.
Preferred
- Experience working on GPU fleet management, hyperscale infrastructure, or AI training clusters.
- Experience building anomaly detection or failure prediction systems for hardware or distributed systems.
- Experience implementing distributed straggler detection or collective‑level performance analysis systems.
- Experience developing agentic or LLM‑powered reasoning systems for diagnostics or operational intelligence.
- Background in reliability engineering or SRE practices.
Wondering if you’re a good fit?
You love building systems that turn raw infrastructure telemetry into actionable intelligence. You’re curious about distributed systems failure modes, GPU performance pathologies, and reliability engineering at scale. You’re excited by the idea of moving from anomaly detection to prediction to autonomous root cause reasoning. You enjoy designing platforms that protect uptime, revenue, and customer trust through proactive systems thinking.
Why CoreWeave?
At CoreWeave, we work hard, have fun, and move fast! We’re in an exciting stage of hyper‑growth that you will not want to miss out on. We’re not afraid of a little chaos, and we’re constantly learning.
What We Offer
- Family‑level Medical Insurance
- Family‑level Dental Insurance
- Generous Pension Contribution
- Life Assurance at 4x Salary
- Critical Illness Cover
- Employee Assistance Programme
- Tuition Reimbursement
- Work culture focused on innovative disruption
Our Workplace
While we prioritize a hybrid work environment, remote work may be considered for candidates located more than 30 miles from an office, based on role requirements for specialized skill sets. New hires will be invited to attend onboarding at one of our hubs within their first month. Teams also gather quarterly to support collaboration.
CoreWeave is an equal opportunity employer, committed to fostering an inclusive and supportive workplace. All qualified applicants and candidates will receive consideration for employment without regard to race, color, religion, sex, disability, age, sexual orientation, gender identity, national origin, veteran status, or genetic information.
Export Control Compliance
This position requires access to export controlled information. To conform to U.S. Government export regulations applicable to that information, applicant must either be (A) a U.S. person, defined as a (i) U.S. citizen or national, (ii) U.S. lawful permanent resident (green card holder), (iii) refugee under 8 U.S.C. 1157, or (iv) asylee under 8 U.S.C. 1158, (B) eligible to access the export controlled information without a required export authorization, or (C) eligible and reasonably likely to obtain the required export authorization from the applicable U.S. government agency. CoreWeave may, for legitimate business reasons, decline to pursue any export licensing process.
Senior Data & MLOps Engineer employer: CoreWeave
CoreWeave is an exceptional employer that champions innovation and growth in the AI sector, offering a dynamic work culture where curiosity and ownership are highly valued. With comprehensive benefits including family-level medical and dental insurance, generous pension contributions, and a commitment to employee development through tuition reimbursement, CoreWeave fosters an inclusive environment that empowers its team members to thrive. Located in a hybrid work setting, employees enjoy flexibility while collaborating with talented professionals dedicated to pushing the boundaries of technology.
StudySmarter Expert Advice🤫
We think this is how you could land Senior Data & MLOps Engineer
✨Tip Number 1
Network like a pro! Reach out to folks in the industry, attend meetups, and connect with people on LinkedIn. You never know who might have the inside scoop on job openings or can refer you directly.
✨Tip Number 2
Show off your skills! Create a portfolio or GitHub repository showcasing your projects, especially those related to data engineering and MLOps. This gives potential employers a taste of what you can do beyond your CV.
✨Tip Number 3
Prepare for interviews by practising common technical questions and scenarios relevant to the role. Mock interviews with friends or using online platforms can help you feel more confident and ready to impress.
✨Tip Number 4
Don’t forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, it shows you’re genuinely interested in joining our team at CoreWeave.
We think you need these skills to ace Senior Data & MLOps Engineer
Some tips for your application 🫡
Tailor Your Application:Make sure to customise your CV and cover letter for the Senior Data & MLOps Engineer role. Highlight your experience with data pipelines, distributed systems, and any relevant projects that showcase your skills in building scalable solutions.
Showcase Your Technical Skills:We want to see your technical prowess! Include specific examples of your work with Python, Kubernetes, and any streaming technologies like Kafka or Kinesis. Don’t forget to mention your hands-on experience with ML models and observability systems.
Be Clear and Concise:When writing your application, keep it clear and to the point. Use bullet points where possible to make your achievements stand out. We appreciate a well-structured application that’s easy to read!
Apply Through Our Website:Don’t forget to submit your application through our website! It’s the best way for us to receive your details and ensures you’re considered for the role. We can’t wait to see what you bring to the table!
How to prepare for a job interview at CoreWeave
✨Know Your Tech Stack
Make sure you’re well-versed in the technologies mentioned in the job description, like Kafka, Kubernetes, and Python. Brush up on your experience with distributed systems and MLOps practices, as these will likely come up during the interview.
✨Showcase Your Problem-Solving Skills
Prepare to discuss specific examples where you've tackled complex issues in data engineering or infrastructure. Think about times when you’ve built scalable solutions or improved system reliability, and be ready to explain your thought process.
✨Understand CoreWeave's Mission
Familiarise yourself with CoreWeave’s focus on AI and how they support innovators. Being able to articulate how your skills align with their mission will show that you’re genuinely interested in the role and the company.
✨Ask Insightful Questions
Prepare thoughtful questions about the team dynamics, ongoing projects, or future challenges CoreWeave might face. This not only shows your interest but also helps you gauge if the company culture is a good fit for you.