AI Platform Engineer - HPC

AI Platform Engineer - HPC

Full-Time 42000 - 84000 £ / year (est.) No home office possible
Go Premium
C

At a Glance

  • Tasks: Provide technical support for an innovative AI platform and troubleshoot issues.
  • Company: Join a pioneering tech company focused on renewable energy and AI.
  • Benefits: Competitive salary, flexible working options, and opportunities for professional growth.
  • Why this job: Be part of the future of AI and make a real difference in technology.
  • Qualifications: Experience in technical support and a passion for AI and cloud technologies.
  • Other info: Dynamic work environment with a focus on collaboration and innovation.

The predicted salary is between 42000 - 84000 £ per year.

We are building the UK’s next generation AI platform, powered by renewable energy, rooted in sovereign capability, and designed to give enterprises and innovators the compute they need. We need a Support Engineer / Cluster Administrator to provide Level 1 and Level 2 support for the AI platform. This role will be customer facing, involve technical troubleshooting, and collaboration with vendor engineering teams to ensure seamless AI platform operations.

Key Responsibilities

  • L1 support for customer-reported issues and requests
  • L2 support by diagnosing, replicating, and troubleshooting issues across platform and infrastructure.
  • Coordinate resolution of complex issues (L3) to (vendor) product/engineering teams and manage vendor responses
  • Monitor system health, alerts, and customer usage patterns
  • Document solutions/workarounds, create and maintain knowledge, document support procedures
  • Automate common tasks and fixes
  • Configure and integrate tooling to support optimal operation of the platform, and support tool selection
  • Assist customers with platform configuration, onboarding, and usage best practices
  • Collaborate with platform and infrastructure support/engineering teams to resolve platform integration issues
  • Ensure SLAs and customer satisfaction targets are met
  • Work with customers and multiple stakeholders to understand requirements and challenges, provide reporting on usage, workflow and billing

Technical responsibilities

  • Cluster Infrastructure management: Managing the Nvidia GPU cluster.
  • High availability and resilience: Implement failover strategies and manage maintenance events to minimise downtime.
  • Resource allocation and optimisation: Resource partitioning (GPU resources), workload scheduling, capacity planning.
  • Performance monitoring and troubleshooting: Performance analysis, monitoring (realtime) with available Nvidia and HPE tools.
  • Incident response: node failure management, network issues, driver issues, troubleshooting common issues and then working with vendor support to resolve any critical issues.
  • Security and access control: Manage user permissions, RBAC, security hardening, data protection.

Required Skills & Experience

  • Extensive experience in technical support, system engineering, or platform operations.
  • Solid understanding of L1 and L2 support processes (ticketing, escalation, troubleshooting).
  • Familiarity with cloud-based platforms, APIs, and distributed systems.
  • Understanding of AI/ML concepts and tooling (model training, inference, data pipelines basics).
  • Experience with monitoring/logging tools (e.g., Grafana, Kibana, Splunk).
  • Excellent communication skills to interface with both customers and internal / vendor teams.
  • Good understanding of tools requirements for ML engineers and data scientists, and how to optimise the experience.

Core Technical skills

  • System administration experience with OS's like RHEL/CentOS, Ubuntu, tuning Linux kernel.
  • Proficiency with Ansible, Nvidia and CUDA toolkits, Kubernetes and container orchestration.
  • Understanding of automation, monitoring and security with GPU as a service.

Preferred experience

  • Experience supporting HPE PCAI or other AI/HPC infrastructure and platforms.
  • Experience with GPU resource allocation (across instances, GPUs count and time).
  • Advanced networking skills with High performance networking, troubleshooting and fine tuning.
  • Background in DevOps or SRE practices.

Success Metrics

  • Customers receive timely, effective support with minimal escalations.
  • Issues are resolved or routed correctly with high-quality documentation.
  • The platform maintains strong uptime and customer satisfaction.

AI Platform Engineer - HPC employer: Carbon3ai Limited.

Join us in shaping the future of AI technology at our cutting-edge facility, where we prioritise a collaborative and innovative work culture. As an AI Platform Engineer - HPC, you'll benefit from extensive professional development opportunities, a commitment to sustainability through renewable energy, and the chance to work with leading experts in the field. Our supportive environment fosters growth and ensures that you play a vital role in delivering exceptional service to our customers while maintaining high operational standards.
C

Contact Detail:

Carbon3ai Limited. Recruiting Team

StudySmarter Expert Advice 🤫

We think this is how you could land AI Platform Engineer - HPC

✨Tip Number 1

Network, network, network! Get out there and connect with people in the AI and HPC space. Attend meetups, webinars, or industry events. You never know who might have a lead on your dream job!

✨Tip Number 2

Show off your skills! Create a portfolio or GitHub repository showcasing your projects related to AI platforms or HPC. This gives potential employers a taste of what you can do and sets you apart from the crowd.

✨Tip Number 3

Prepare for interviews by brushing up on common technical questions related to L1 and L2 support processes. Practice troubleshooting scenarios and be ready to discuss how you’d handle customer issues effectively.

✨Tip Number 4

Don’t forget to apply through our website! We’re always looking for passionate individuals to join our team. Make sure your application stands out by tailoring it to the specific role and highlighting your relevant experience.

We think you need these skills to ace AI Platform Engineer - HPC

Technical Support
System Engineering
Platform Operations
L1 and L2 Support Processes
Cloud-Based Platforms
APIs
Distributed Systems
AI/ML Concepts
Monitoring Tools (e.g., Grafana, Kibana, Splunk)
Communication Skills
System Administration (RHEL/CentOS, Ubuntu)
Ansible
Nvidia and CUDA Toolkits
Kubernetes
Automation and Security with GPU as a Service

Some tips for your application 🫡

Tailor Your CV: Make sure your CV reflects the skills and experience mentioned in the job description. Highlight your technical support experience, especially with L1 and L2 processes, and any familiarity with AI/ML concepts. We want to see how you fit into our vision!

Craft a Compelling Cover Letter: Your cover letter is your chance to shine! Use it to explain why you're passionate about AI platforms and how your background aligns with our needs. Don’t forget to mention your experience with tools like Ansible or Kubernetes if you have them!

Showcase Problem-Solving Skills: In your application, give examples of how you've tackled technical issues in the past. We love seeing candidates who can diagnose and troubleshoot effectively, so share those success stories that demonstrate your skills!

Apply Through Our Website: We encourage you to apply directly through our website for the best chance of getting noticed. It’s the easiest way for us to keep track of your application and ensure it reaches the right team!

How to prepare for a job interview at Carbon3ai Limited.

✨Know Your Tech Inside Out

Make sure you brush up on your technical skills, especially around L1 and L2 support processes. Familiarise yourself with the Nvidia GPU cluster management and tools like Ansible and Kubernetes. Being able to discuss these confidently will show that you're ready to tackle the role head-on.

✨Show Off Your Troubleshooting Skills

Prepare to share specific examples of how you've diagnosed and resolved technical issues in the past. Think about times when you had to collaborate with vendor teams or manage complex incidents. This will demonstrate your problem-solving abilities and your experience in a customer-facing role.

✨Understand the Customer Perspective

Since this role is customer-facing, it's crucial to convey your understanding of customer needs and how to meet them. Be ready to discuss how you would handle customer queries and ensure their satisfaction while maintaining SLAs. This shows that you value the customer experience as much as the technical side.

✨Prepare Questions for Them

Interviews are a two-way street! Prepare insightful questions about the AI platform, the team dynamics, and the challenges they face. This not only shows your interest in the role but also helps you gauge if the company is the right fit for you.

AI Platform Engineer - HPC
Carbon3ai Limited.
Go Premium

Land your dream job quicker with Premium

You’re marked as a top applicant with our partner companies
Individual CV and cover letter feedback including tailoring to specific job roles
Be among the first applications for new jobs with our AI application
1:1 support and career advice from our career coaches
Go Premium

Money-back if you don't land a job in 6-months

C
Similar positions in other companies
UK’s top job board for Gen Z
discover-jobs-cta
Discover now
>