Senior Researcher

Job Board

Companies

CoreWeave

Senior Researcher

Full-Time 70000 - 90000 £ / year (est.) No working from home possible

Apply Now

At a Glance

Tasks: Lead innovative research in AI infrastructure and optimise GPU systems for real-world impact.
Company: Join CoreWeave, a pioneering cloud platform for AI, trusted by top innovators.
Benefits: Enjoy family-level medical and dental insurance, generous pension contributions, and tuition reimbursement.
Other info: Be part of an inclusive team focused on innovative disruption and career growth.
Why this job: Make a difference in AI reliability and performance while working with cutting-edge technology.
Qualifications: 8+ years in machine learning or applied AI; strong Python skills required.

The predicted salary is between 70000 - 90000 £ per year.

CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups, and global enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs and turn compute into capability. Founded in 2017, CoreWeave became a publicly traded company (Nasdaq: CRWV) in March 2025. Learn more at www.coreweave.com. We’re proud to be a Living Wage accredited Employer.

Role Overview

We are looking for a Senior Researcher to join Monolith’s Research team, now part of CoreWeave. This is a high-impact, high-ownership role for a researcher who combines deep technical expertise in machine learning, statistical modelling, optimisation, and large-scale systems data with the ability to take complex, ambiguous problems from first principles through to production. The Monolith Data Science team is building a layered reliability and intelligence platform that shifts CoreWeave from reactive troubleshooting to proactive reliability engineering. The platform spans telemetry ingestion, feature engineering, anomaly detection, failure prediction, distributed straggler detection, performance modelling, workload optimisation, and agentic root‑cause analysis. You will work closely with Fleet, Infrastructure, AI Platform, engineering, product, and client‑facing teams to improve cluster reliability, increase effective utilisation, reduce MTTR, protect uptime, and turn large‑scale GPU infrastructure telemetry into measurable operational and commercial impact.

What You’ll Do

Research Leadership & Strategy: Contribute meaningfully to Monolith and CoreWeave’s research direction by identifying high‑leverage problems in GPU infrastructure analytics, cluster reliability, workload performance, scheduling, and utilisation. Originate novel research directions for turning raw infrastructure telemetry into actionable intelligence. Evaluate emerging methods across statistical modelling, machine learning, observability, optimisation, simulation, reinforcement learning, anomaly detection, and autonomous diagnostics. Champion rigour, reproducibility, and scientific integrity across research outputs, experiments, prototypes, and production validation. Help establish a research foundation for understanding how large‑scale GPU systems behave, why workloads underperform, where bottlenecks emerge, and how reliability can be improved proactively.
Technical Depth & Execution: Lead the design and development of sophisticated statistical, machine learning, and optimisation systems for large‑scale GPU infrastructure telemetry. Develop advanced models and methodologies to optimise GPU utilisation, workload scheduling, infrastructure efficiency, and system reliability. Build models and methods for anomaly detection, failure prediction, distributed straggler detection, degraded workload identification, bottleneck diagnosis, and agentic root‑cause analysis. Design experiments, analyse large‑scale system telemetry, and prototype predictive and optimisation algorithms that directly inform production systems. Drive technical decisions on difficult modelling problems involving noisy time‑series data, high‑dimensional telemetry, causal inference, uncertainty, robustness, generalisation, and out‑of‑distribution behaviour. Explore simulation, digital‑twin, reinforcement learning, and adaptive scheduling approaches where they can improve understanding or optimisation of GPU clusters and distributed training environments. Take end‑to‑end ownership of research work from problem framing and exploratory analysis through prototype development, validation, and collaboration with engineering teams on production deployment. Maintain deep personal technical expertise; remain a hands‑on contributor in Python and modern scientific computing / machine learning tooling.
Organisational Influence & Collaboration: Serve as a strong technical voice within the research organisation, helping shape how Monolith approaches complex infrastructure intelligence problems. Work closely with Fleet, Infrastructure, AI Platform, engineering, product, and customer‑facing teams to ensure research work lands with real operational and commercial impact. Translate research findings into production‑ready prototypes, deployable solutions, and technical recommendations that improve performance, reliability, utilisation, and cost efficiency. Contribute to research practices and norms that improve how the team handles ambiguous, high‑dimensional, real‑world systems problems. Communicate complex technical work and its implications clearly to a range of audiences, from close technical collaborators to senior leadership and external stakeholders. Help build a shared understanding of how large‑scale AI infrastructure behaves, where it fails, and how it can be made more reliable, efficient, and intelligent.

Technical Focus

Applied machine learning for GPU infrastructure and distributed systems
Large‑scale telemetry ingestion, feature engineering, and infrastructure analytics
GPU cluster reliability, utilisation, observability, and performance analysis
Anomaly detection, degradation detection, and failure prediction
Distributed straggler detection and workload performance diagnosis
Agentic root‑cause analysis and autonomous diagnostic systems
Time‑series, high‑dimensional, structured, and operational systems data
Performance modelling for distributed workloads and AI training jobs
Workload scheduling, capacity planning, forecasting, and resource allocation modelling
Optimisation techniques including stochastic optimisation, convex optimisation, reinforcement learning, and adaptive scheduling
Simulation and digital‑twin approaches for complex infrastructure systems
Causal inference, controlled experiments, hypothesis testing, and statistical validation
End‑to‑end research systems: data pipelines, prototypes, validation, deployment, and monitoring

What We’re Looking For

8+ years of experience, or equivalent research experience, applying statistical modelling, machine learning, optimisation, or applied AI to large‑scale datasets.
MS or PhD in Computer Science, Statistics, Applied Mathematics, Machine Learning, Physics, Engineering, or a related quantitative field.
Strong proficiency in Python and scientific computing libraries such as NumPy, pandas, SciPy, scikit‑learn, PyTorch, or TensorFlow.
Experience working with large‑scale structured datasets, time‑series data, infrastructure telemetry, performance data, sensor data, or other complex operational data.
Experience designing and analysing controlled experiments, including A/B testing, hypothesis testing, causal inference, or rigorous model validation.
Experience building and validating predictive models in production or research environments.
Experience with distributed data systems such as Spark, Ray, Dask, or similar.
Proficiency in SQL and working with large‑scale structured data.
Strong understanding of optimisation techniques such as linear programming, convex optimisation, stochastic optimisation, reinforcement learning, or adaptive scheduling.
Demonstrated ability to solve ambiguous technical problems where the right approach is not already known.
Ability to translate research findings into production‑ready prototypes, deployable workflows, or operational tooling.
Strong scientific judgement, including experimental design, reproducibility, validation, and awareness of uncertainty.
The ability to communicate clearly and influence across research, engineering, product, infrastructure, and leadership audiences.

Preferred Experience

PhD with published research in systems optimisation, distributed computing, ML systems, performance modelling, reliability engineering, scientific computing, or a related area.
Experience with GPU workloads, distributed training, AI infrastructure, HPC, or large‑scale compute environments.
Familiarity with Kubernetes, containerised workloads, cloud‑native systems, or distributed infrastructure.
Experience developing reinforcement learning, adaptive scheduling, autonomous diagnostics, or agentic systems.
Background in capacity planning, forecasting, resource allocation modelling, or infrastructure efficiency.
Experience with observability, hardware telemetry, performance monitoring, root cause analysis, or failure prediction.
Contributions to open‑source machine learning, systems, infrastructure, or scientific computing projects.

What We Offer

Family‑level Medical Insurance
Family‑level Dental Insurance
Generous Pension Contribution
Life Assurance at 4x Salary
Critical Illness Cover
Employee Assistance Programme
Tuition Reimbursement
Work culture focused on innovative disruption

Equal Opportunity

CoreWeave is an equal opportunity employer, committed to fostering an inclusive and supportive workplace. All qualified applicants and candidates will receive consideration for employment without regard to race, color, religion, sex, disability, age, sexual orientation, gender identity, national origin, veteran status, or genetic information.

Senior Researcher employer: CoreWeave

CoreWeave is an exceptional employer, offering a dynamic work environment in Greater London where innovation meets reliability. With comprehensive benefits such as family-level medical and dental insurance, pension contributions, and tuition reimbursement, employees are supported both personally and professionally. The company fosters a culture of growth and collaboration, ensuring that every Data Centre Technician has the opportunity to develop their skills while contributing to cutting-edge data centre operations.

Contact Details:

CoreWeave Recruitment Team

View CoreWeave profile

We think you need these skills to ace Senior Researcher

Machine Learning

Statistical Modelling

Optimisation

Large-Scale Systems Data Analysis

Anomaly Detection

Failure Prediction

Distributed Systems

Python Programming

Scientific Computing Libraries (NumPy, pandas, SciPy, scikit-learn, PyTorch, TensorFlow)

SQL

Experimental Design

Causal Inference

Performance Modelling

Data Pipeline Development

Communication Skills

Senior Researcher

CoreWeave

Apply Now

Senior Researcher

At a Glance

Senior Researcher employer: CoreWeave

We think you need these skills to ace Senior Researcher

Company

Product

Help