Realtime AI Inference Engineer for On-Device, Low-Latency in London

Job Board

Companies

Kindredventures

Realtime AI Inference Engineer for On-Device, Low-Latency

Realtime AI Inference Engineer for On-Device, Low-Latency in London

London Full-Time 60000 - 80000 £ / year (est.) No working from home possible

Apply Now

At a Glance

Tasks: Architect and build high-performance inference engines for real-time AI in gaming.
Company: Join Iconic, a pioneering startup at the forefront of AI and interactive entertainment.
Benefits: Competitive salary, equity, 25 days leave, private healthcare, and hybrid work options.
Other info: Be part of a friendly, inclusive culture with exciting team socials and game breaks.
Why this job: Shape the future of storytelling with cutting-edge AI technology and creative freedom.
Qualifications: MSc/PhD in Computer Science or equivalent experience; strong skills in model optimization and C/C++.

The predicted salary is between 60000 - 80000 £ per year.

The Mission

At Iconic, our virtual actors don't just generate “text” or “actions”—they perform. They need to speak, move, and perceive in milliseconds, often running locally on a player's machine alongside a rendering engine. You will bridge the gap between massive research models and the constraints of real-time interactive entertainment.

The Role

You will architect and build the inference engine that powers our digital entities. Your main task will be tearing apart the model architecture to make it run as fast as possible on consumer hardware while keeping their abilities intact for the intended usage. As part of a small, focused team, you'll have significant autonomy and end-to-end ownership. You will work at the intersection of System ML and Game Tech. You might spend one day implementing a custom pruning algorithm for our TTS model, and the next day writing a C++ wrapper to expose that model to our game engine. You will work closely with our Character Research team to ensure that optimization never comes at the cost of the character's soul.

Key Responsibilities

Architect Low-Latency Runtimes: Build and maintain high-performance inference pipelines for Multimodal LLMs, TTS, and Vision models, targeting both server-side (H100/A100) and consumer edge (RTX 5090, Apple Silicon) environments.
State-of-the-Art Optimization: Implement advanced techniques like Speculative Decoding, KV-Cache quantization, PagedAttention, and Layer Pruning to minimize Time-To-First-Token (TTFT) and Time-Per-Output-Token (TPOT), maximizing throughput.
Model Compression: Lead our efforts in post-training quantization (AWQ, GPTQ, GGUF) and distillation to fit massive models into consumer VRAM budgets.
Engine Integration: Collaborate with the game engineering team to ensure thread-safe, non-blocking asynchronous inference within the game loop.
Custom Kernel Development: Write custom ops in CUDA, Triton, or Metal when off-the-shelf kernels aren't fast enough.

Requirements

MSc or PhD in Computer Science, Machine Learning, or a related field (or equivalent industry experience)
Strong experience with model optimization techniques (quantization, pruning, distillation, knowledge transfer)
Experience with LLM-specific inference optimizations (KV-cache management, speculative decoding, attention mechanisms)
Proficiency in C/C++
Hands-on experience deploying ML models on-device or in latency-sensitive environments
Proficiency in Python and deep learning frameworks (PyTorch, JAX, or TensorFlow)
Experience with inference optimization tools and runtimes (TensorRT, ONNX Runtime, Core ML, or similar)
Strong systems and engineering skills
Excellent collaboration and communication skills

Nice to Have

Experience with On-Device AI stacks: ExecuTorch, CoreML, MLX, or ONNX Runtime
Experience in CUDA programming
Familiarity with non-NVIDIA compute (AMD/ROCm, DirectML, Vulkan Compute)
Background in real-time systems or game engines (Unreal, Unity) or Real-Time Rendering
Publications or demonstrated work in efficient ML or model compression (NeurIPS, ICML, MLSys, etc.) or open-source contributions to projects like vLLM, SGLang, llama.cpp, or bitsandbytes

Why Join Us

Be a foundational member of a team innovating at the intersection of AI, art, and storytelling. You'll help shape the research direction, culture, and technical foundations of a company building toward something genuinely new.

What we offer

Competitive salary and equity compensation
25 days annual leave + bank holidays
Private healthcare
Based in London with hybrid work
Inclusive & friendly company culture with socials and game breaks

About Iconic

Iconic Interactive is a seed-stage startup building AI that breathes life into virtual worlds. The future of entertainment is personal: entire universes shaped around each of us, where you are not watching a story but living at the center of one, shaping it. We're building every layer of intelligence these experiences need: characters that feel and convey meaning, narrators that weave your story, and world directors that act like an ever-present game master: adapting, orchestrating, surprising. We're a growing team tackling some of the most fascinating problems in AI: creating minds that inhabit and shape new worlds.

Realtime AI Inference Engineer for On-Device, Low-Latency in London employer: Kindredventures

At Iconic, we pride ourselves on being an innovative employer at the forefront of AI and interactive entertainment. Our inclusive and friendly culture fosters collaboration and creativity, allowing you to take ownership of your projects while working alongside passionate professionals in a hybrid environment based in London. With competitive salaries, generous leave policies, and opportunities for personal and professional growth, joining our team means becoming part of a groundbreaking journey in shaping the future of storytelling.

Contact Details:

Kindredventures Recruitment Team

View Kindredventures profile

StudySmarter Expert Advice🤫

We think this is how you could land Realtime AI Inference Engineer for On-Device, Low-Latency in London

✨Tip Number 1

Network like a pro! Reach out to folks in the industry on LinkedIn or at events. A friendly chat can open doors that a CV just can't.

✨Tip Number 2

Show off your skills! Create a portfolio or GitHub repo showcasing your projects, especially those related to AI and game tech. It’s a great way to demonstrate what you can do beyond the application.

✨Tip Number 3

Prepare for interviews by practicing common technical questions and coding challenges. We recommend using platforms like LeetCode or HackerRank to sharpen your skills before the big day.

✨Tip Number 4

Don’t forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, it shows you’re genuinely interested in joining our team!

We think you need these skills to ace Realtime AI Inference Engineer for On-Device, Low-Latency in London

Model Optimization Techniques

Quantization

Pruning

Distillation

Knowledge Transfer

LLM-specific Inference Optimizations

KV-cache Management

Speculative Decoding

Attention Mechanisms

C/C++ Proficiency

Python Proficiency

Deep Learning Frameworks (PyTorch, JAX, TensorFlow)

Inference Optimization Tools (TensorRT, ONNX Runtime, Core ML)

Custom Kernel Development (CUDA, Triton, Metal)

Real-time Systems or Game Engines Experience

Some tips for your application 🫡

Tailor Your CV:Make sure your CV is tailored to the role of Realtime AI Inference Engineer. Highlight your experience with model optimization techniques and any relevant projects that showcase your skills in low-latency environments.

Craft a Compelling Cover Letter:Your cover letter is your chance to shine! Use it to explain why you're passionate about AI and how your background aligns with our mission at Iconic. Don’t forget to mention specific experiences that relate to the job description.

Showcase Your Technical Skills:Be sure to include any hands-on experience you have with C/C++, Python, and deep learning frameworks like PyTorch or TensorFlow. Mention any custom kernel development or inference optimization tools you've worked with to stand out!

Apply Through Our Website:We encourage you to apply through our website for the best chance of getting noticed. It’s the easiest way for us to keep track of your application and ensure it reaches the right people!

How to prepare for a job interview at Kindredventures

✨Know Your Tech Inside Out

Make sure you’re well-versed in the specific model optimisation techniques mentioned in the job description, like quantisation and pruning. Brush up on your C/C++ skills and be ready to discuss how you've applied these techniques in real-world scenarios.

✨Showcase Your Problem-Solving Skills

Prepare to discuss past projects where you tackled latency issues or optimised models for on-device performance. Use concrete examples to illustrate your thought process and how you approached challenges, especially in real-time systems or game engines.

✨Collaborate Like a Pro

Since this role involves working closely with the Character Research team and game engineers, be ready to demonstrate your collaboration skills. Share experiences where you successfully worked in a team setting, highlighting how you communicated technical concepts to non-technical colleagues.

✨Ask Insightful Questions

Prepare thoughtful questions about the company’s approach to AI and game tech. Inquire about their current projects or challenges they face in building low-latency inference engines. This shows your genuine interest and helps you gauge if the company is the right fit for you.

Realtime AI Inference Engineer for On-Device, Low-Latency in London

Kindredventures

Location: London

Apply Now

Realtime AI Inference Engineer for On-Device, Low-Latency in London

At a Glance

Realtime AI Inference Engineer for On-Device, Low-Latency in London employer: Kindredventures

StudySmarter Expert Advice🤫

We think you need these skills to ace Realtime AI Inference Engineer for On-Device, Low-Latency in London

Some tips for your application 🫡

How to prepare for a job interview at Kindredventures

Company

Product

Help