At a Glance
- Tasks: Build cutting-edge inference engines for diverse AI systems and optimise performance across hardware.
- Company: Join Callosum, a pioneering company in intelligent systems and AI infrastructure.
- Benefits: Competitive salary, flexible work options, and opportunities for professional growth.
- Other info: Dynamic team environment focused on solving the impossible.
- Why this job: Tackle complex challenges and shape the future of AI with innovative technology.
- Qualifications: Experience with SGLang, vLLM, high-performance Python, and C++/CUDA.
The predicted salary is between 60000 - 80000 £ per year.
About Us
Artificial intelligence scaled on a bet - that bigger models, more identical chips, and more data would keep delivering. As problems grow more complex and the requirements of intelligence more diverse, that bet is breaking down. The next era belongs to heterogeneous intelligence: diverse models on diverse chips, each with distinct strengths, co-evolving into systems of capability unreachable by any single model or accelerator. Callosum is the Intelligent Systems company. We built the infrastructure to make that possible. Our co-evolution engine optimises simultaneously across workflows, agents, and silicon. We launched in early 2026 showing orders of magnitude improvements in performance and a shift in the cost-performance frontier that no single chip or model provider can provide. We believe intelligence comes from the system, not the model. We are scientists and engineers solving what others consider impossible. If you thrive on hard problems, and are passionate and energised by the scale of the challenge, we'd love to hear from you.
About the Role
Callosum believes that orders of magnitude improvements in AI systems will come through application-aware orchestration across heterogeneous hardware. We are building that vision: infrastructure that treats the full landscape of compute as a unified, co-evolving system, evolved beyond GPUs. Inference engines were designed for single-model inference on homogeneous GPU clusters - this role builds them beyond that. Working directly on systems like vLLM and SGLang, you will adapt and extend them for heterogeneous resources, making them hardware-aware, with deeper optimisation around scheduling, memory, and execution. The execution strategies you design - parallelism, disaggregation, caching - will define what heterogeneous inference looks like at production scale. Your work ensures that the capabilities exposed by the lower layers of the stack translate into real, measurable gains, the new standard for how inference runs on diverse hardware.
What You'll Build
- Contribute upstream to SGLang and vLLM, and maintain internal forks where our requirements diverge
- Improve hardware-awareness within inference engines so that scheduling, memory management, and execution adapt to the capabilities of the underlying accelerator
- Design and implement bespoke parallelism and disaggregation strategies that go beyond default configurations to better exploit heterogeneous hardware
- Work closely with an Accelerator Systems Software engineer to ensure engine-level abstractions map cleanly onto diverse hardware capabilities
What You Bring
- Deep familiarity with the internals of SGLang, vLLM, or comparable inference serving frameworks - scheduler design, memory management, and execution pipelines
- Strong background in high-performance Python and C++/CUDA systems, particularly in the context of ML inference
- Experience designing or implementing parallelism strategies for large model serving
- Understanding of disaggregated serving architectures and the tradeoffs involved in separating modules of a workflow
- Demonstrable record of working effectively in fast-moving open source codebases with evolving APIs and design conventions
Inference Engine Development - Member of Technical Staff employer: Callosum
At Callosum, we pride ourselves on being at the forefront of artificial intelligence innovation, offering a dynamic work environment that fosters creativity and collaboration. Our commitment to employee growth is evident through our focus on cutting-edge projects and the opportunity to work alongside industry-leading scientists and engineers. Located in a vibrant tech hub, we provide a unique chance to tackle complex challenges while enjoying a supportive culture that values diverse perspectives and encourages continuous learning.
StudySmarter Expert Advice🤫
We think this is how you could land Inference Engine Development - Member of Technical Staff
✨Tip Number 1
Network like a pro! Reach out to folks in the industry, attend meetups, and connect with people on LinkedIn. You never know who might have the inside scoop on job openings or can put in a good word for you.
✨Tip Number 2
Show off your skills! Create a portfolio or GitHub repository showcasing your projects related to inference engines or heterogeneous systems. This gives potential employers a taste of what you can do and sets you apart from the crowd.
✨Tip Number 3
Prepare for technical interviews by brushing up on your knowledge of SGLang, vLLM, and parallelism strategies. Practice coding challenges and system design questions that relate to high-performance Python and C++/CUDA systems.
✨Tip Number 4
Don’t forget to apply through our website! We love seeing passionate candidates who resonate with our mission. Tailor your application to highlight how your experience aligns with our vision of heterogeneous intelligence.
We think you need these skills to ace Inference Engine Development - Member of Technical Staff
Some tips for your application 🫡
Show Your Passion:When you're writing your application, let your enthusiasm for AI and complex problem-solving shine through. We want to see that you're not just ticking boxes but genuinely excited about the challenges we tackle at Callosum.
Tailor Your Experience:Make sure to highlight your experience with SGLang, vLLM, or similar frameworks. We’re looking for specific examples of how you've worked on scheduling, memory management, or execution pipelines, so don’t hold back!
Be Clear and Concise:While we love detail, clarity is key! Keep your application straightforward and to the point. Use bullet points if it helps to make your achievements stand out – we want to see what you can bring to the table without wading through fluff.
Apply Through Our Website:We encourage you to apply directly through our website. It’s the best way to ensure your application gets into the right hands. Plus, it shows us you’re keen on being part of our team at Callosum!
How to prepare for a job interview at Callosum
✨Know Your Frameworks
Make sure you’re well-versed in SGLang, vLLM, or similar inference serving frameworks. Brush up on scheduler design, memory management, and execution pipelines. Being able to discuss your experience with these frameworks will show that you’re ready to hit the ground running.
✨Showcase Your Coding Skills
Since high-performance Python and C++/CUDA systems are crucial for this role, prepare to demonstrate your coding skills. You might be asked to solve a problem on the spot, so practice coding challenges related to ML inference to build your confidence.
✨Understand Heterogeneous Systems
Familiarise yourself with the concept of heterogeneous hardware and how it impacts inference engines. Be ready to discuss your understanding of disaggregated serving architectures and the trade-offs involved. This knowledge will help you stand out as someone who can contribute to the team’s vision.
✨Be Ready for Open Source Discussions
Since the role involves working with fast-moving open source codebases, prepare to talk about your past experiences in this area. Highlight any contributions you've made, how you adapted to evolving APIs, and how you collaborated with others in the community.