Realtime AI Inference Engineer for On-Device, Low-Latency

Job Board

Companies

Kindredventures

Realtime AI Inference Engineer for On-Device, Low-Latency

Full-Time 60000 - 80000 £ / year (est.) No working from home possible

Apply Now

At a Glance

Tasks: Architect and build high-performance inference engines for real-time AI in gaming.
Company: Join Iconic, a pioneering startup at the forefront of AI and interactive entertainment.
Benefits: Competitive salary, equity, 25 days leave, private healthcare, and hybrid work.
Other info: Be part of a friendly, inclusive culture with exciting team socials and game breaks.
Why this job: Shape the future of storytelling with cutting-edge AI technology and creative freedom.
Qualifications: MSc or PhD in Computer Science or equivalent experience; strong C/C++ skills required.

The predicted salary is between 60000 - 80000 £ per year.

The Mission

At Iconic, our virtual actors don't just generate “text” or “actions”—they perform. They need to speak, move, and perceive in milliseconds, often running locally on a player's machine alongside a rendering engine. You will bridge the gap between massive research models and the constraints of real-time interactive entertainment.

The Role

You will architect and build the inference engine that powers our digital entities. Your main task will be tearing apart the model architecture to make it run as fast as possible on consumer hardware while keeping their abilities intact for the intended usage. As part of a small, focused team, you'll have significant autonomy and end-to-end ownership. You will work at the intersection of System ML and Game Tech. You might spend one day implementing a custom pruning algorithm for our TTS model, and the next day writing a C++ wrapper to expose that model to our game engine. You will work closely with our Character Research team to ensure that optimization never comes at the cost of the character's soul.

Key Responsibilities

Architect Low-Latency Runtimes: Build and maintain high-performance inference pipelines for Multimodal LLMs, TTS, and Vision models, targeting both server-side (H100/A100) and consumer edge (RTX 5090, Apple Silicon) environments.
State-of-the-Art Optimization: Implement advanced techniques like Speculative Decoding, KV-Cache quantization, PagedAttention, and Layer Pruning to minimize Time-To-First-Token (TTFT) and Time-Per-Output-Token (TPOT), maximizing throughput.
Model Compression: Lead our efforts in post-training quantization (AWQ, GPTQ, GGUF) and distillation to fit massive models into consumer VRAM budgets.
Engine Integration: Collaborate with the game engineering team to ensure thread-safe, non-blocking asynchronous inference within the game loop.
Custom Kernel Development: Write custom ops in CUDA, Triton, or Metal when off-the-shelf kernels aren't fast enough.

Requirements

MSc or PhD in Computer Science, Machine Learning, or a related field (or equivalent industry experience).
Strong experience with model optimization techniques (quantization, pruning, distillation, knowledge transfer).
Experience with LLM-specific inference optimizations (KV-cache management, speculative decoding, attention mechanisms).
Proficiency in C/C++.
Hands-on experience deploying ML models on-device or in latency-sensitive environments.
Proficiency in Python and deep learning frameworks (PyTorch, JAX, or TensorFlow).
Experience with inference optimization tools and runtimes (TensorRT, ONNX Runtime, Core ML, or similar).
Strong systems and engineering skills.
Excellent collaboration and communication skills.

Nice to Have

Experience with On-Device AI stacks: ExecuTorch, CoreML, MLX, or ONNX Runtime.
Experience in CUDA programming.
Familiarity with non-NVIDIA compute (AMD/ROCm, DirectML, Vulkan Compute).
Background in real-time systems or game engines (Unreal, Unity) or Real-Time Rendering.
Publications or demonstrated work in efficient ML or model compression (NeurIPS, ICML, MLSys, etc.) or open-source contributions to projects like vLLM, SGLang, llama.cpp, or bitsandbytes.

Why Join Us

Be a foundational member of a team innovating at the intersection of AI, art, and storytelling. You'll help shape the research direction, culture, and technical foundations of a company building toward something genuinely new.

What we offer

Competitive salary and equity compensation.
25 days annual leave + bank holidays.
Private healthcare.
Based in London with hybrid work.
Inclusive & friendly company culture with socials and game breaks.

About Iconic

Iconic Interactive is a seed-stage startup building AI that breathes life into virtual worlds. The future of entertainment is personal: entire universes shaped around each of us, where you are not watching a story but living at the center of one, shaping it. We're building every layer of intelligence these experiences need: characters that feel and convey meaning, narrators that weave your story, and world directors that act like an ever-present game master: adapting, orchestrating, surprising. We're a growing team tackling some of the most fascinating problems in AI: creating minds that inhabit and shape new worlds.

Realtime AI Inference Engineer for On-Device, Low-Latency employer: Kindredventures

At Iconic, we pride ourselves on being an innovative employer at the forefront of AI and interactive entertainment. Our inclusive and friendly culture fosters collaboration and creativity, allowing you to take ownership of your projects while working alongside passionate professionals in a hybrid environment based in London. With competitive salaries, equity compensation, and ample opportunities for personal and professional growth, joining our team means becoming part of a groundbreaking journey that shapes the future of storytelling.

Contact Details:

Kindredventures Recruitment Team

View Kindredventures profile

StudySmarter Expert Advice🤫

We think this is how you could land Realtime AI Inference Engineer for On-Device, Low-Latency

✨Tip Number 1

Network like a pro! Reach out to folks in the industry on LinkedIn or at events. A friendly chat can open doors that a CV just can't.

✨Tip Number 2

Show off your skills! Create a portfolio or GitHub repo showcasing your projects, especially those related to AI and game tech. It’s a great way to demonstrate what you can do beyond the application.

✨Tip Number 3

Prepare for interviews by practising common technical questions and coding challenges. We recommend using platforms like LeetCode or HackerRank to sharpen your skills before the big day.

✨Tip Number 4

Don’t forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, it shows you’re genuinely interested in joining our team!

We think you need these skills to ace Realtime AI Inference Engineer for On-Device, Low-Latency

Model Optimization Techniques

Quantization

Pruning

Distillation

Knowledge Transfer

LLM-specific Inference Optimizations

KV-cache Management

Speculative Decoding

Attention Mechanisms

C/C++ Proficiency

Python Proficiency

Deep Learning Frameworks (PyTorch, JAX, TensorFlow)

Inference Optimization Tools (TensorRT, ONNX Runtime, Core ML)

Custom Kernel Development (CUDA, Triton, Metal)

Real-time Systems or Game Engines Experience

Some tips for your application 🫡

Tailor Your CV:Make sure your CV reflects the skills and experiences that align with the role of Realtime AI Inference Engineer. Highlight your expertise in model optimization and any relevant projects you've worked on, especially those involving low-latency environments.

Craft a Compelling Cover Letter:Use your cover letter to tell us why you're passionate about AI and real-time systems. Share specific examples of how you've tackled similar challenges in the past, and don't forget to mention your enthusiasm for working at the intersection of AI and game tech!

Showcase Your Projects:If you've got any personal or professional projects that demonstrate your skills in C++, Python, or model optimization techniques, make sure to include them. We love seeing practical applications of your knowledge, so don’t hold back!

Apply Through Our Website:We encourage you to apply directly through our website. It’s the best way for us to receive your application and ensures you’re considered for this exciting opportunity to join our innovative team at Iconic!

How to prepare for a job interview at Kindredventures

✨Know Your Tech Inside Out

Make sure you’re well-versed in the specific model optimisation techniques mentioned in the job description, like quantisation and pruning. Brush up on your C/C++ skills and be ready to discuss how you've applied these in real-time systems or game engines.

✨Showcase Your Problem-Solving Skills

Prepare to discuss past projects where you tackled complex challenges, especially those involving low-latency environments. Be ready to explain your thought process and the steps you took to optimise performance while maintaining functionality.

✨Collaborate Like a Pro

Since this role involves working closely with the Character Research team, think of examples that highlight your collaboration and communication skills. Be prepared to discuss how you’ve successfully worked in teams to achieve common goals, especially in tech-focused projects.

✨Demonstrate Your Passion for AI and Gaming

Express your enthusiasm for the intersection of AI and interactive entertainment. Share any personal projects or contributions to open-source initiatives that align with the company’s mission. This will show your genuine interest and commitment to the field.

Realtime AI Inference Engineer for On-Device, Low-Latency

Kindredventures

Apply Now

Realtime AI Inference Engineer for On-Device, Low-Latency

At a Glance

Realtime AI Inference Engineer for On-Device, Low-Latency employer: Kindredventures

StudySmarter Expert Advice🤫

We think you need these skills to ace Realtime AI Inference Engineer for On-Device, Low-Latency

Some tips for your application 🫡

How to prepare for a job interview at Kindredventures

Company

Product

Help