AI Platform Support Engineer (EMEA) in London

AI Platform Support Engineer (EMEA) in London

London Full-Time 80000 - 100000 £ / year (est.) Home office (partial)
Lightning AI

At a Glance

  • Tasks: Support ML engineers with large-scale AI workloads and troubleshoot complex systems.
  • Company: Join Lightning AI, the creators of PyTorch Lightning, in a dynamic tech environment.
  • Benefits: Competitive salary, equity options, comprehensive health coverage, and flexible work arrangements.
  • Other info: Hybrid role in London with excellent career growth and a focus on diversity.
  • Why this job: Be a technical partner in shaping the future of AI technology and infrastructure.
  • Qualifications: Strong software engineering skills and experience with Kubernetes and ML systems.

The predicted salary is between 80000 - 100000 £ per year.

Who We Are

Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform for developing, training, and deploying AI systems—designed to take ideas from research to production with less friction. Through our merger with Voltage Park, a neocloud and AI Factory, Lightning AI combines developer-first software with cost-efficient, large-scale compute. Teams get the tools they need for experimentation, training, and production inference, with security, observability, and control built in. We serve solo researchers, startups, and large enterprises. Lightning AI operates globally with offices in New York City, San Francisco, Seattle, and London, and is backed by Coatue, Index Ventures, Bain Capital Ventures, and Firstminute.

What We’re Looking For

Lightning AI is looking to hire AI Platform Support Engineers to join our EMEA Customer Experience team, supporting ML engineers running large-scale training and inference workloads across cloud infrastructure, Kubernetes, and GPU platforms in production environments. This role is not a ticket router or traditional support engineer. You are a technical partner to ML teams - helping diagnose failures, improve reliability, and guide customers through complex distributed systems problems. The problems range from Kubernetes scheduling and GPU orchestration to distributed PyTorch failures, inference latency, networking bottlenecks, storage performance, and platform reliability. You’ll gain exposure to a wide variety of real world AI workloads across industries and help shape the infrastructure powering the next generation of ML applications.

We are currently hiring for two EMEA shifts (9AM–7PM CET/CEST):

  • Sunday–Wednesday
  • Saturday–Tuesday OR Thursday–Sunday

This role is hybrid out of our London office, with an in-office requirement of at least 2 days per week and occasional team and company offsites. We are not able to provide visa sponsorship for this role at this time.

What You'll Do

  • Work Directly With ML Engineers
    • Partner directly with customer engineering teams running training and inference workloads in production
    • Help customers diagnose and resolve complex distributed systems and ML infrastructure issues
    • Act as a technical advisor during high impact incidents and platform degradation events
    • Translate infrastructure level issues into actionable guidance for ML engineers
    • Build credibility with customers through strong technical reasoning and clear communication
  • Debug ML Infrastructure & Distributed Workloads
    • Investigate failures involving distributed training, Kubernetes orchestration, GPU allocation, networking, and storage systems
    • Troubleshoot PyTorch, CUDA, NCCL, and inference serving related issues
    • Analyze logs, metrics, traces, and system behavior to isolate root causes
    • Debug containerized workloads running across Kubernetes and bare metal GPU environments
    • Support customers scaling workloads across multi node GPU systems
    • Diagnose performance bottlenecks involving compute, memory, networking, or storage
  • Improve Reliability & Platform Operations
    • Identify recurring patterns across customer issues and drive long term reliability improvements
    • Contribute to post incident reviews and operational improvements
    • Build internal tooling, automation, documentation, and runbooks
    • Partner closely with infrastructure, networking, and platform engineering teams
    • Help improve observability, operational visibility, and troubleshooting workflows
    • Improve the customer experience through better processes and technical guidance

What This Role Is Not To set clear expectations:

  • This is not a traditional help desk or ticket routing support role
  • This is not purely customer success or account management
  • This is not a backend engineering role
  • This is not a passive escalation position

This role is for engineers who enjoy solving difficult technical problems while working closely with other engineers.

What You’ll Need

  • Required Qualifications
    • Infrastructure & Systems
      • Strong software engineering and systems troubleshooting background
      • Experience with Kubernetes and containerized environments
      • Linux systems knowledge, including networking, storage, process management, and performance tuning
      • Experience with cloud infrastructure and distributed systems
      • Experience with observability and debugging tools such as Prometheus, Grafana, or OpenTelemetry
    • ML Infrastructure Experience
      • Hands on experience operating machine learning workloads in production or research environments
      • Experience with distributed ML systems and tooling such as PyTorch, CUDA, or NCCL
      • Familiarity with GPU infrastructure and orchestration
      • Experience troubleshooting performance, reliability, or scaling issues in ML infrastructure
      • Understanding of the operational challenges involved in running ML systems at scale
    • Collaboration
      • Strong communication skills and ability to work directly with highly technical customers and engineering teams
      • Comfortable operating in fast moving, highly ambiguous environments
      • Enjoys solving complex technical problems collaboratively
  • Nice-to-Haves
    • Experience with large scale model training or distributed inference systems
    • Familiarity with Ray, Kubeflow, Slurm, or similar distributed scheduling platforms
    • Experience with InfiniBand, RDMA, or high-performance networking
    • Experience operating bare metal infrastructure
    • Familiarity with storage systems commonly used in ML environments
    • Experience working at an AI infrastructure, cloud, MLOps, or developer tooling company
    • Contributions to platform engineering, developer infrastructure, or operational tooling projects
    • Experience writing automation, tooling, or scripts in Python or similar languages

Compensation

We are committed to offering competitive compensation that reflects the value each team member brings to our mission. Final offers are based on factors such as experience, skills, geographic location, and role expectations. In addition to base salary, our total rewards package for eligible roles includes a discretionary bonus, a meaningful equity component, and comprehensive benefits. The anticipated annual base salary range for this role is: £75,000 - £95,000 GBP.

Benefits and Perks

We offer a comprehensive and competitive benefits package designed to support our employees’ health, well-being, and long-term success. Benefits may vary by location, team, and role. Benefits include:

  • Comprehensive medical, dental and vision coverage (U.S.); Private medical and dental insurance (U.K.)
  • Retirement and financial wellness support (U.S.); Pension contribution (U.K.)
  • Generous paid time off, plus holidays
  • Paid parental leave
  • Professional development support
  • Wellness and work-from-home stipends
  • Flexible work environment

At Lightning AI, we are committed to fostering an inclusive and diverse workplace. We believe that diverse teams drive innovation and create better products. We provide equal employment opportunities to all employees and applicants without regard to race, color, religion, gender, sexual orientation, gender identity, national origin, age, disability, veteran status, or any other protected characteristic. We are dedicated to building a culture where everyone can thrive and contribute to their fullest potential.

AI Platform Support Engineer (EMEA) in London employer: Lightning AI

Lightning AI is an exceptional employer that prioritises employee growth and well-being, offering a hybrid work environment in the vibrant city of London. With a strong focus on collaboration and innovation, employees have the opportunity to work closely with cutting-edge AI technologies while enjoying comprehensive benefits, including private medical insurance and generous paid time off. The company fosters a diverse and inclusive culture, ensuring that every team member can thrive and contribute meaningfully to the future of machine learning.

Lightning AI

Contact Details:

Lightning AI Recruitment Team

StudySmarter Expert Advice🤫

We think this is how you could land AI Platform Support Engineer (EMEA) in London

Join Local Tech Meetups

Get out there and mingle with fellow developers by joining local tech meetups. It’s a fantastic way to meet people who might be working at Lightning AI or know someone who does. Plus, you can pick up some trendy tech skills and trends while you're at it!

Contribute to Open Source Projects

Show off your coding chops by jumping into open-source projects. Not only does this give you practical experience, but it also gets you noticed in the dev community. You'll create a killer portfolio that speaks volumes about your skills to Lightning AI.

Tap into Online Developer Communities

Don’t underestimate the power of online developer communities like GitHub, Stack Overflow, and even Reddit. Participate in discussions, share your projects, and build your visibility. We can often find opportunities through these channels that can lead to a full-time gig at companies like Lightning AI.

Explore Job Boards Specifically for Tech Roles

Keep your eyes peeled on job boards that focus on tech roles. Sites like TechCareers or Stack Overflow Jobs can often have listings for companies like Lightning AI that might not show up on broader job sites. Make it a habit to check these regularly, and don’t hesitate to apply directly through our website!

We think you need these skills to ace AI Platform Support Engineer (EMEA) in London

Kubernetes
Containerized Environments
Linux Systems Knowledge
Networking
Storage Management
Performance Tuning
Cloud Infrastructure

Some tips for your application 🫡

Show off your coding skills:When applying for a software engineering role, it's super important to showcase your coding skills. Make sure your CV includes your tech stack, any relevant programming languages you’re comfortable with, and examples of projects you've worked on. If you have a GitHub profile, link it up! We love to see code in action.

Tailor your portfolio:For a full-time role, we’d expect to see some solid examples of your work in your portfolio. Make sure to include at least two or three projects that highlight your problem-solving skills and your ability to work with different technologies. Focus on the projects that are most relevant to the position at Lightning AI.

Craft a killer cover letter:Your cover letter is your chance to stand out—make it personal! Explain why you want to work at Lightning AI and how your skills align with the role. Show us your passion for software development. We dig enthusiastic candidates who understand the value of collaboration and continuous learning!

Be clear and concise:When it comes to writing your CV and cover letter, clarity is key. Avoid jargon that could confuse us and stick to simple, direct language. Highlight your achievements with quantifiable results where possible, and keep everything easy to read. A well-organised application goes a long way!

How to prepare for a job interview at Lightning AI

Brush Up on Your Coding Skills

For a full-time software engineering role, it's crucial that we stay sharp with our coding abilities. Expect technical questions that might involve solving problems on the spot or discussing algorithms. Practise on platforms like LeetCode or HackerRank to get comfortable with the types of questions that often come up.

Know Your Tools and Frameworks

Make sure we’re well-acquainted with the tools and technologies listed in the job description. Familiarise ourselves with any specific frameworks or programming languages mentioned. If Lightning AI uses React or Node.js, for instance, be ready to discuss how we’ve used them in previous projects or coursework.

Showcase Your Projects

Bring along a portfolio that highlights our best work. This could be code samples, GitHub repositories, or any side projects we’ve built. Make sure we can talk through our thought process for each project, especially the challenges we faced and how we solved them—this shows our problem-solving skills in action.

Prepare for Behavioural Questions

While technical skills are key, full-time positions also require cultural fit. Be ready to discuss our previous experiences and how we handle teamwork, conflict, and deadlines. Brush up on the STAR method—Situation, Task, Action, Result—to clearly articulate our past experiences when discussing how we've contributed to a team.