AI Inference Engineer (Member of Technical Staff) in London

AI Inference Engineer (Member of Technical Staff) in London

London Full-Time 60000 - 80000 € / year (est.) No home office possible
Deepstreamtech

At a Glance

  • Tasks: Join our team to build and optimise AI inference engines for cutting-edge models.
  • Company: Dynamic tech company focused on innovative AI solutions.
  • Benefits: Competitive salary, flexible work options, and opportunities for professional growth.
  • Other info: Fast-paced environment with a focus on collaboration and continuous learning.
  • Why this job: Make a real impact in AI by working with advanced technologies and diverse projects.
  • Qualifications: 3+ years in software engineering with expertise in ML inference and GPU programming.

The predicted salary is between 60000 - 80000 € per year.

Requirements:

  • Deep experience with GPU programming and performance work (CUDA, Triton, CUTLASS, or similar).
  • Any other deep systems programming experience is a plus.
  • You understand modern LLM architectures and are able to bring them up reliably in a production environment.
  • You've built and operated production distributed systems under real load - ideally performance-critical ones.
  • Comfortable working across languages and layers: Rust for the serving runtime, Python for model code, CUDA/CuteDSL for kernels.
  • You own problems end-to-end. You can read a research paper on Monday, write a kernel on Wednesday, and debug a production incident on Friday.
  • Self-directed. You do well in fast-moving environments where the path forward isn't laid out for you.
  • (Desirable) ML compilers and framework internals: PyTorch internals, torch.compile, custom operators.
  • (Desirable) Distributed GPU communication: NCCL, NVLink, InfiniBand, RDMA libraries, model/tensor parallelism.
  • (Desirable) Low-precision inference: INT8/FP8/FP4 quantization, mixed-precision serving.
  • (Desirable) Profiling and debugging tools: Nsight Compute/Systems, CUDA-GDB, PTX/SASS analysis.
  • (Desirable) Container orchestration: Kubernetes, GPU scheduling, autoscaling inference workloads.
  • 3+ years of professional software engineering experience with meaningful work on ML inference or high-performance systems.
  • Familiarity with at least one deep learning framework (PyTorch, JAX, TensorFlow).
  • Understanding of GPU architectures (memory hierarchy, warp scheduling, tensor cores).
  • Understanding of common LLM architectures and inference optimization techniques (e.g. quantization, speculative decoding, prefill-decode disaggregation).

What the job involves:

  • We are looking for an AI Inference Engineer to join our growing team. We build and run the inference engine behind every Perplexity query and deploy dozens of model architectures at scale with tight latency and cost budgets.
  • Our stack is Rust, Python, CUDA, and CuTe DSL.
  • New models support: Support transformer-based retrieval, text-generation, and multimodal models in our inference infrastructure, from weight loading, request scheduling and KV-cache management to support in API Gateway.
  • GPU kernels migration to CuTe DSL: Port our in-house CUDA kernels to NVIDIA's CuTe DSL so they run on GB200 today and are portable to Vera Rubin racks tomorrow.
  • Rust-native serving runtime: Develop our internal Rust-based inference server to solve all Python pains and keep up with rapidly growing traffic.
  • Performance optimisation: Profile and fix bottlenecks from network ingress through continuous batching and GPU kernels interleaving.
  • Reliability and observability: Build dashboards, alerts, and automated remediation so we catch regressions before users do. Respond to and learn from production incidents.

AI Inference Engineer (Member of Technical Staff) in London employer: Deepstreamtech

Join a dynamic and innovative team as an AI Inference Engineer, where you'll have the opportunity to work with cutting-edge technologies in a fast-paced environment. Our company fosters a collaborative work culture that encourages self-direction and continuous learning, providing ample opportunities for professional growth and development. Located in a vibrant tech hub, we offer competitive benefits and a unique chance to contribute to impactful projects that shape the future of AI.

Deepstreamtech

Contact Detail:

Deepstreamtech Recruiting Team

StudySmarter Expert Advice🤫

We think this is how you could land AI Inference Engineer (Member of Technical Staff) in London

Tip Number 1

Network, network, network! Get out there and connect with folks in the industry. Attend meetups, webinars, or even just grab a coffee with someone who’s already in the role you want. You never know who might have the inside scoop on job openings!

Tip Number 2

Show off your skills! Create a portfolio or GitHub repository showcasing your projects, especially those involving GPU programming or ML inference. This gives potential employers a taste of what you can do and sets you apart from the crowd.

Tip Number 3

Prepare for technical interviews by brushing up on your knowledge of modern LLM architectures and performance optimisation techniques. Practice coding challenges and system design questions that relate to the job description to boost your confidence.

Tip Number 4

Don’t forget to apply through our website! We’re always on the lookout for talented individuals like you. Tailor your application to highlight your experience with Rust, Python, and CUDA, and let us know how you can contribute to our team.

We think you need these skills to ace AI Inference Engineer (Member of Technical Staff) in London

GPU Programming
CUDA
Triton
CUTLASS
Deep Systems Programming
Modern LLM Architectures
Production Distributed Systems

Some tips for your application 🫡

Show Off Your Skills:Make sure to highlight your deep experience with GPU programming and any performance work you've done. We want to see your expertise in CUDA, Triton, or similar technologies right from the get-go!

Tailor Your Application:Don’t just send a generic application! Tailor your CV and cover letter to reflect how your experience aligns with our needs, especially around modern LLM architectures and production environments. We love seeing how you can bring your unique skills to our team.

Be Yourself:Let your personality shine through in your written application. We’re looking for self-directed individuals who thrive in fast-moving environments, so don’t be afraid to show us how you tackle challenges and own problems end-to-end.

Apply Through Our Website:We encourage you to apply directly through our website. It’s the best way for us to keep track of your application and ensure it gets the attention it deserves. Plus, it shows you’re keen on joining our team!

How to prepare for a job interview at Deepstreamtech

Know Your Tech Inside Out

Make sure you’re well-versed in GPU programming and the specific tools mentioned in the job description, like CUDA and Triton. Brush up on your knowledge of modern LLM architectures and be ready to discuss how you've implemented them in production environments.

Showcase Your Problem-Solving Skills

Prepare examples that demonstrate your ability to own problems end-to-end. Think of a time when you read a research paper, wrote a kernel, and debugged an incident all in one week. This will show your versatility and self-direction in fast-paced settings.

Familiarise Yourself with the Stack

Get comfortable with the tech stack mentioned, especially Rust, Python, and CUDA. If you have experience with container orchestration tools like Kubernetes, be ready to discuss how you've used them to manage workloads effectively.

Prepare for Technical Questions

Expect deep technical questions about ML inference, performance optimisation, and debugging tools. Review profiling techniques and be prepared to explain how you’ve tackled performance bottlenecks in previous projects.