AI Inference Engineer | GPU-Scale Rust/Python | Equity

AI Inference Engineer | GPU-Scale Rust/Python | Equity

Full-Time 70000 - 90000 € / year (est.) No home office possible
Perplexity

At a Glance

  • Tasks: Join us to develop and optimise AI inference engines using Rust, Python, and CUDA.
  • Company: Dynamic tech company focused on cutting-edge AI solutions.
  • Benefits: Competitive salary, equity options, and opportunities for professional growth.
  • Other info: Fast-paced environment with exciting challenges and career advancement.
  • Why this job: Make a real impact in AI while working with the latest technologies.
  • Qualifications: 3+ years in software engineering with GPU programming experience.

The predicted salary is between 70000 - 90000 € per year.

We are looking for an AI Inference Engineer to join our growing team. We build and run the inference engine behind every Perplexity query and deploy dozens of model architectures at scale with tight latency and cost budgets. Our stack is Rust, Python, CUDA, and CuTe DSL.

Responsibilities

  • New models support. Support transformer-based retrieval, text-generation, and multimodal models in our inference infrastructure, from weight loading, request scheduling and KV-cache management to support in API Gateway.
  • GPU kernels migration to CuTe DSL. Port our in-house CUDA kernels to NVIDIA's CuTe DSL so they run on GB200 today and are portable to Vera Rubin racks tomorrow.
  • Rust-native serving runtime. Develop our internal Rust-based inference server to solve all Python pains and keep up with rapidly growing traffic.
  • Performance optimisation. Profile and fix bottlenecks from network ingress through continuous batching and GPU kernels interleaving.
  • Reliability and observability. Build dashboards, alerts, and automated remediation so we catch regressions before users do. Respond to and learn from production incidents.

Who We’re Looking For

  • Deep experience with GPU programming and performance work (CUDA, Triton, CUTLASS, or similar). Any other deep systems programming experience is a plus.
  • You understand modern LLM architectures and are able to bring them up reliably in a production environment.
  • You’ve built and operated production distributed systems under real load - ideally performance-critical ones.
  • Comfortable working across languages and layers: Rust for the serving runtime, Python for model code, CUDA/CuteDSL for kernels.
  • You own problems end-to-end. You can read a research paper on Monday, write a kernel on Wednesday, and debug a production incident on Friday.
  • Self-directed. You do well in fast-moving environments where the path forward isn’t laid out for you.

Nice-to-have

  • ML compilers and framework internals: PyTorch internals, torch.compile, custom operators.
  • Distributed GPU communication: NCCL, NVLink, InfiniBand, RDMA libraries, model/tensor parallelism.
  • Low-precision inference: INT8/FP8/FP4 quantization, mixed-precision serving.
  • Profiling and debugging tools: Nsight Compute/Systems, CUDA-GDB, PTX/SASS analysis.
  • Container orchestration: Kubernetes, GPU scheduling, autoscaling inference workloads.

Qualifications

  • 3+ years of professional software engineering experience with meaningful work on ML inference or high-performance systems.
  • Familiarity with at least one deep learning framework (PyTorch, JAX, TensorFlow).
  • Understanding of GPU architectures (memory hierarchy, warp scheduling, tensor cores).
  • Understanding of common LLM architectures and inference optimization techniques (e.g. quantization, speculative decoding, prefill-decode disaggregation).

Final offer amounts are determined by multiple factors including experience and expertise. Equity: In addition to the base salary, equity may be part of the total compensation package.

AI Inference Engineer | GPU-Scale Rust/Python | Equity employer: Perplexity

Join our innovative team as an AI Inference Engineer, where you'll be at the forefront of cutting-edge technology in a dynamic and collaborative work environment. We offer competitive equity options, a culture that fosters continuous learning and growth, and the opportunity to work with advanced GPU programming and modern LLM architectures. Located in a vibrant tech hub, we provide a unique chance to contribute to impactful projects while enjoying a supportive atmosphere that values creativity and problem-solving.

Perplexity

Contact Detail:

Perplexity Recruiting Team

StudySmarter Expert Advice🤫

We think this is how you could land AI Inference Engineer | GPU-Scale Rust/Python | Equity

Tip Number 1

Network, network, network! Reach out to folks in the industry, attend meetups, and connect with people on LinkedIn. You never know who might have a lead on your dream job!

Tip Number 2

Show off your skills! Create a portfolio or GitHub repository showcasing your projects, especially those related to GPU programming or AI inference. This gives potential employers a taste of what you can do.

Tip Number 3

Prepare for technical interviews by brushing up on your coding skills and understanding modern LLM architectures. Practice common algorithms and system design questions to impress during the interview.

Tip Number 4

Don’t forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, we love seeing candidates who are proactive about their job search.

We think you need these skills to ace AI Inference Engineer | GPU-Scale Rust/Python | Equity

GPU Programming
CUDA
Rust
Python
CuTe DSL
Performance Optimisation
Distributed Systems

Some tips for your application 🫡

Tailor Your CV:Make sure your CV highlights your experience with GPU programming and performance work. We want to see how your skills align with our tech stack, so don’t be shy about showcasing your Rust, Python, and CUDA expertise!

Craft a Compelling Cover Letter:Your cover letter is your chance to tell us why you’re the perfect fit for the AI Inference Engineer role. Share specific examples of your past projects, especially those involving modern LLM architectures or production distributed systems.

Show Off Your Problem-Solving Skills:We love candidates who can own problems end-to-end. In your application, mention instances where you’ve tackled complex issues, whether it’s debugging a production incident or optimising performance in a high-load environment.

Apply Through Our Website:Don’t forget to submit your application through our website! It’s the best way for us to keep track of your application and ensure it gets the attention it deserves. We can’t wait to hear from you!

How to prepare for a job interview at Perplexity

Know Your Tech Stack

Make sure you’re well-versed in Rust, Python, and CUDA. Brush up on how these technologies interact, especially in the context of AI inference. Being able to discuss your experience with GPU programming and performance optimisation will show that you’re ready to hit the ground running.

Demonstrate Problem-Solving Skills

Prepare to share specific examples of how you've tackled complex problems in production environments. Whether it’s debugging a kernel or optimising a model, having a few stories ready will highlight your end-to-end ownership of projects and your ability to thrive under pressure.

Familiarise Yourself with LLMs

Since the role involves working with modern LLM architectures, make sure you can discuss their intricacies. Understand common optimisation techniques like quantisation and be ready to explain how you’ve applied them in past projects. This will show your depth of knowledge and relevance to the role.

Ask Insightful Questions

Interviews are a two-way street! Prepare thoughtful questions about the company’s approach to AI inference and their tech stack. This not only shows your interest but also helps you gauge if the company is the right fit for you. Plus, it gives you a chance to demonstrate your understanding of the field.