AI Inference Engineer (Member of Technical Staff)

AI Inference Engineer (Member of Technical Staff)

Full-Time 60000 - 80000 € / year (est.) No home office possible
Deepstreamtech

At a Glance

  • Tasks: Join our team to build and optimise AI inference engines for cutting-edge models.
  • Company: Dynamic tech company focused on innovative AI solutions.
  • Benefits: Competitive salary, flexible work options, and opportunities for professional growth.
  • Other info: Fast-paced environment with a focus on collaboration and continuous learning.
  • Why this job: Make a real impact in AI by working with advanced technologies and diverse projects.
  • Qualifications: 3+ years in software engineering with expertise in ML inference and GPU programming.

The predicted salary is between 60000 - 80000 € per year.

Requirements

  • Deep experience with GPU programming and performance work (CUDA, Triton, CUTLASS, or similar).
  • Any other deep systems programming experience is a plus.
  • You understand modern LLM architectures and are able to bring them up reliably in a production environment.
  • You've built and operated production distributed systems under real load - ideally performance-critical ones.
  • Comfortable working across languages and layers: Rust for the serving runtime, Python for model code, CUDA/CuteDSL for kernels.
  • You own problems end-to-end. You can read a research paper on Monday, write a kernel on Wednesday, and debug a production incident on Friday.
  • Self-directed. You do well in fast-moving environments where the path forward isn't laid out for you.
  • (Desirable) ML compilers and framework internals: PyTorch internals, torch.compile, custom operators.
  • (Desirable) Distributed GPU communication: NCCL, NVLink, InfiniBand, RDMA libraries, model/tensor parallelism.
  • (Desirable) Low-precision inference: INT8/FP8/FP4 quantization, mixed-precision serving.
  • (Desirable) Profiling and debugging tools: Nsight Compute/Systems, CUDA-GDB, PTX/SASS analysis.
  • (Desirable) Container orchestration: Kubernetes, GPU scheduling, autoscaling inference workloads.
  • 3+ years of professional software engineering experience with meaningful work on ML inference or high-performance systems.
  • Familiarity with at least one deep learning framework (PyTorch, JAX, TensorFlow).
  • Understanding of GPU architectures (memory hierarchy, warp scheduling, tensor cores).
  • Understanding of common LLM architectures and inference optimization techniques (e.g. quantization, speculative decoding, prefill-decode disaggregation).

What the job involves

  • We are looking for an AI Inference Engineer to join our growing team. We build and run the inference engine behind every Perplexity query and deploy dozens of model architectures at scale with tight latency and cost budgets.
  • Our stack is Rust, Python, CUDA, and CuTe DSL.
  • New models support. Support transformer-based retrieval, text-generation, and multimodal models in our inference infrastructure, from weight loading, request scheduling and KV-cache management to support in API Gateway.
  • GPU kernels migration to CuTe DSL. Port our in-house CUDA kernels to NVIDIA's CuTe DSL so they run on GB200 today and are portable to Vera Rubin racks tomorrow.
  • Rust-native serving runtime. Develop our internal Rust-based inference server to solve all Python pains and keep up with rapidly growing traffic.
  • Performance optimisation. Profile and fix bottlenecks from network ingress through continuous batching and GPU kernels interleaving.
  • Reliability and observability. Build dashboards, alerts, and automated remediation so we catch regressions before users do. Respond to and learn from production incidents.

AI Inference Engineer (Member of Technical Staff) employer: Deepstreamtech

Join a dynamic and innovative team as an AI Inference Engineer, where you'll have the opportunity to work with cutting-edge technologies in a fast-paced environment. Our company fosters a collaborative work culture that encourages self-direction and continuous learning, providing ample opportunities for professional growth and development. Located in a vibrant tech hub, we offer competitive benefits and a unique chance to contribute to impactful projects that shape the future of AI.

Deepstreamtech

Contact Detail:

Deepstreamtech Recruiting Team

StudySmarter Expert Advice🤫

We think this is how you could land AI Inference Engineer (Member of Technical Staff)

Tip Number 1

Network, network, network! Get out there and connect with folks in the industry. Attend meetups, webinars, or even online forums related to AI and GPU programming. You never know who might have a lead on your dream job!

Tip Number 2

Show off your skills! Create a portfolio showcasing your projects, especially those involving CUDA, Rust, or any deep learning frameworks. Having tangible examples of your work can really set you apart when chatting with potential employers.

Tip Number 3

Don’t just apply blindly! Tailor your approach for each company. Research their tech stack and mention how your experience aligns with their needs. When you apply through our website, make sure to highlight your relevant skills and experiences that match the job description.

Tip Number 4

Prepare for technical interviews by brushing up on your problem-solving skills. Practice coding challenges and be ready to discuss your past projects in detail. Remember, they want to see how you think and tackle problems, so be confident and show your passion for AI and performance optimisation!

We think you need these skills to ace AI Inference Engineer (Member of Technical Staff)

GPU Programming
CUDA
Triton
CUTLASS
Deep Systems Programming
Modern LLM Architectures
Production Distributed Systems

Some tips for your application 🫡

Show Off Your Skills:Make sure to highlight your deep experience with GPU programming and any performance work you've done. We want to see your expertise in CUDA, Triton, or similar technologies, so don’t hold back!

Tailor Your Application:When applying, customise your CV and cover letter to reflect the job description. Mention your experience with modern LLM architectures and how you’ve successfully operated production distributed systems under real load.

Be Yourself:We love self-directed individuals! Share examples of how you've tackled challenges in fast-moving environments. Let us know how you own problems end-to-end, from reading research papers to debugging production incidents.

Apply Through Our Website:Don’t forget to apply through our website! It’s the best way for us to receive your application and get to know you better. We can’t wait to see what you bring to the table!

How to prepare for a job interview at Deepstreamtech

Know Your Tech Inside Out

Make sure you’re well-versed in GPU programming and performance work. Brush up on CUDA, Triton, and any other relevant technologies. Be ready to discuss your experience with modern LLM architectures and how you've successfully deployed them in production.

Showcase Your Problem-Solving Skills

Prepare to share specific examples of how you've owned problems end-to-end. Whether it’s reading a research paper, writing a kernel, or debugging a production incident, have stories ready that highlight your self-directed approach and adaptability in fast-paced environments.

Demonstrate Your Cross-Language Proficiency

Since the role involves working across languages like Rust and Python, be prepared to discuss your experience with each. You might even want to brush up on how you’ve used these languages in real-world applications, especially in high-performance systems.

Familiarise Yourself with the Stack

Get to know the tools and frameworks mentioned in the job description, such as Kubernetes for container orchestration and profiling tools like Nsight Compute. Being able to talk about your familiarity with these technologies will show that you're ready to hit the ground running.