Inference System & Performance Engineer - Member of Technical Staff

Job Board

Companies

jobr.pro

Inference System & Performance Engineer - Member of Technical Staff

Full-Time 80000 - 100000 £ / year (est.) No working from home possible

Apply Now

At a Glance

Tasks: Design and optimise cutting-edge inference systems across diverse hardware environments.
Company: Join Callosum, a pioneering Intelligent Systems Company in London.
Benefits: Competitive salary, equity, private healthcare, and relocation support.
Other info: Inclusive workplace committed to equal opportunities and personal growth.
Why this job: Be at the forefront of AI innovation and tackle complex system challenges.
Qualifications: Strong background in systems engineering and experience with GPU workloads.

The predicted salary is between 80000 - 100000 £ per year.

About Us

The last era of AI scaled on a single bet: bigger models, more identical chips, more data. As problems grow more complex and the requirements of intelligence more diverse, that bet is breaking down. Real-world problems are heterogeneous: no single model or chip can solve them alone. The next era of AI requires heterogeneity at the infrastructure level - diverse models on diverse chips, each with distinct strengths, co-evolving into systems of capability that move the Pareto frontier of what is possible. That's what we are building. Callosum is the Intelligent Systems Company. We started from questioning what actually creates intelligence. We believe there is no single answer, but rather a system-level solution. We co-evolve models, workflows, and silicon together to show that intelligence does not come from a single component, but it emerges from the diversity of co-optimised mechanisms working together and aware of each other. Heterogeneity will define the next era of compute, and is a principle that holds in biological, neuronal, and economic systems alike. In early 2026 we launched with results showing orders of magnitude improvements in performance, and this is only the beginning. Agentic AI is the future of how intelligence is deployed: multi-step, long-horizon, and operating in changing environments. These systems are inherently heterogeneous, and can only be as powerful as the infrastructure that runs them. We are engineers and scientists based in London, working together across the full depth of the stack. We are curious, intellectually honest, and building what doesn't exist yet. If you thrive on uncharted territory and are energised by the scale of the challenge, we'd love to hear from you.

About the Role

Standard inference architectures typically focus on monolithic chip types and model classes. Callosum intentionally breaks this mold, operating heterogeneous hardware at scale across a diverse model portfolio. Success in this environment requires an inference layer built entirely from first principles. Sitting at the heart of our technical mission, this position owns end-to-end performance for our inference platforms. Your focus will span KV cache strategies, batching internals, memory management, and multi-node scheduling. You will develop the core software driving execution speed, silicon efficiency, and platform scalability as we expand our hardware and model footprint. This is a high-leverage role tackling complex system challenges across the entire stack.

What You’ll Build

Design and optimise inference serving systems across heterogeneous multi-GPU and multi-node environments
Own KV cache lifecycle management, batching strategies, and memory allocation to maximise throughput and minimise latency
Profile and tune GPU kernels, identify bottlenecks across compute, memory, and network, and implement targeted optimisations
Build and improve scheduling logic for continuous batching, disaggregated prefill/decode, and speculative decoding
Work with networking primitives - NCCL, NVLink, RDMA, InfiniBand, RoCE - to optimise communication across distributed inference workloads
Develop tooling for performance visibility, regression detection, and benchmarking across hardware configurations

What you Bring

Deep understanding of LLM inference internals: KV cache lifecycle, memory management, attention mechanisms, and serving architectures
Strong systems engineering background with proven experience optimising distributed GPU workloads
Proficiency in C++, CUDA, Python, Rust, or similar - and the instinct to go low-level when it matters
Hands‑on debugging skills across GPU, networking, and Linux systems - able to work from first principles with limited tooling

What Sets You Apart

Experience building or significantly optimising production‑grade, high‑throughput model serving stacks
Multi‑GPU and multi‑node inference optimisation using NCCL, NVLink, RDMA, InfiniBand, or RoCE
GPU memory profiling, CUDA or Triton kernel optimisation
Linux performance analysis and optimisation

What We Offer

Competitive Salary, determined by skills and experience
Equity & Ownership
Private healthcare
We offer Visa sponsorship and relocation benefits to hire the best in the world
We work in person at our London office. You'll have the tools, space and setup to do your best work, and if you have specific needs, just tell us

We're committed to building an inclusive workplace where everyone feels welcome, and believe in equal opportunities for all.

Inference System & Performance Engineer - Member of Technical Staff employer: jobr.pro

At Callosum, we are not just building AI; we are redefining the future of intelligent systems in a collaborative and innovative environment. Located in London, we offer a dynamic work culture that encourages curiosity and intellectual honesty, alongside competitive salaries, equity ownership, and comprehensive healthcare benefits. Our commitment to inclusivity and employee growth ensures that every team member has the opportunity to thrive while tackling complex challenges in a cutting-edge field.

Contact Details:

jobr.pro Recruitment Team

View jobr.pro profile

We think you need these skills to ace Inference System & Performance Engineer - Member of Technical Staff

Inference Layer Development

KV Cache Lifecycle Management

Batching Strategies

Memory Management

Multi-Node Scheduling

GPU Kernel Profiling and Tuning

Distributed GPU Workload Optimisation

Networking Primitives (NCCL, NVLink, RDMA, InfiniBand, RoCE)

Performance Visibility Tooling

C++ Proficiency

CUDA Proficiency

Python Proficiency

Rust Proficiency

Hands-on Debugging Skills

Linux Performance Analysis

Inference System & Performance Engineer - Member of Technical Staff

jobr.pro

Apply Now

Inference System & Performance Engineer - Member of Technical Staff

At a Glance

Inference System & Performance Engineer - Member of Technical Staff employer: jobr.pro

We think you need these skills to ace Inference System & Performance Engineer - Member of Technical Staff

Company

Product

Help