Senior GPU AI Compute Orchestration Engineer
Senior GPU AI Compute Orchestration Engineer

Senior GPU AI Compute Orchestration Engineer

Full-Time 80000 - 100000 £ / year (est.) No home office possible
Submer

At a Glance

  • Tasks: Design and operate a cutting-edge GPU-native cloud platform for AI workloads.
  • Company: Join Radian Arc, part of InferX, a leader in AI cloud infrastructure.
  • Benefits: Enjoy competitive pay, flexible remote work, and a vibrant team culture.
  • Why this job: Make a real impact in the AI and gaming industry with innovative technology.
  • Qualifications: Experience in distributed systems, CloudStack, Kubernetes, and strong coding skills.
  • Other info: Be part of a diverse team with excellent career growth opportunities.

The predicted salary is between 80000 - 100000 £ per year.

Location & work modality: EMEA (remote)

Start: ASAP

Type of Contract: Permanent, full-time

About Radian Arc

Radian Arc, now part of InferX, Submer's AI cloud and GPU infrastructure platform, provides an infrastructure-as-a-service (IaaS) platform for running cloud gaming, artificial intelligence and machine learning applications inside telecommunication carrier networks. Our teams across the USA, Australia, Central Europe, Malaysia, Singapore and Japan offer telecom operators a GPU-based edge computing platform without the need for capital expenditure, facilitating low latency and improved economics for value-added services and the monetization of 5G investments.

What impact you will have

Mission: Design, build, and operate the compute orchestration layer powering a GPU-native cloud platform for AI and high-performance workloads. (CloudStack, Kubernetes, Slurm, Argo). The platform orchestrates GPU clusters supporting large-scale AI training and inference workloads across distributed compute infrastructure. This role bridges the current production platform, based on CloudStack, with the next-generation orchestration architecture built around Kubernetes, modern batch scheduling frameworks, and workflow orchestration systems.

You will be responsible for maintaining and evolving the existing CloudStack-based deployments while actively contributing to the design and implementation of the next-generation compute platform supporting distributed AI workloads. The role combines deep hands-on engineering with ownership of critical orchestration components, including Kubernetes-based compute orchestration, Slurm-based distributed training and batch scheduling, and workflow automation through Argo. Working closely with networking, storage, and platform engineers, you will help implement the platform primitives that expose GPU infrastructure as a scalable, multi-tenant compute platform.

What you’ll do

  • CloudStack Platform Maintenance
    • Maintain the existing CloudStack code base used in current production deployments.
    • Integrate new upstream CloudStack releases into the internal platform fork.
    • Perform upgrades of existing customer environments to newer CloudStack versions.
    • Design and execute safe upgrade paths for running production environments.
    • Troubleshoot orchestration and provisioning issues in existing deployments.
  • CloudStack Networking & VPC Infrastructure
    • Maintain and troubleshoot CloudStack VPC networking.
    • Work with and understand CloudStack Debian VPC routers.
    • Manage networking implementations based on:
    • Open vSwitch (OVS)
    • OVN
  • Improve reliability of network orchestration components.
  • Manage hypervisor implementations based on:
  • KVM
  • QEMU
  • Maintain and evolve the code responsible for QEMU GPU passthrough, including PCI mapping and exposure of L40S, RTX 6000 Pro, and H200 GPUs to virtual machines.
  • Next-Generation Compute Orchestration
    • Design orchestration and scheduling primitives for the next-generation platform based on:
    • Kubernetes
    • Slurm
    • Argo Workflows
  • Build orchestration workflows that expose GPU and CPU compute resources to platform users.
  • Integrate compute orchestration with storage and networking services.
  • Work closely with networking, storage engineers, and platform software engineers to integrate platform primitives.
  • Kubernetes GPU Scheduling & Cluster Orchestration
    • Design and implement Kubernetes-based GPU/CPU scheduling infrastructure for multi-tenant AI workloads.
    • Configure and maintain GPU device plugins and resource allocation mechanisms.
    • Implement GPU scheduling strategies including:
    • GPU partitioning, such as MIG where supported
    • Multi-GPU job placement
    • Topology-aware scheduling for distributed training and inference.
  • Design node lifecycle automation for GPU clusters including:
    • Node provisioning
    • Node draining
    • Workload migration
  • Implement Kubernetes scheduling extensions where necessary such as custom schedulers or batch schedulers.
  • Slurm Integration and HPC Scheduling
    • Design and operate Slurm-based HPC scheduling environments integrated with Kubernetes clusters.
    • Implement Slurm compute partitions mapped to Kubernetes-managed GPU/CPU nodes.
    • Develop mechanisms to submit distributed training, fine tuning, or batch workloads from platform APIs into Slurm clusters.
    • Implement support for:
    • Multi-node distributed GPU training
    • Gang scheduling
    • GPU topology-aware scheduling
  • Build automation for:
    • Dynamic Slurm node registration
    • Elastic compute capacity
    • Node health monitoring and recovery
  • Integrate Slurm job lifecycle events with platform orchestration services.
  • Argo Workflow Orchestration
    • Design and implement workflow orchestration using Argo Workflows.
    • Develop reusable workflow templates for common platform workloads including:
    • AI training pipelines
    • Data preprocessing pipelines
    • Batch inference workloads
    • Platform operational workflows
  • Implement DAG-based execution pipelines coordinating compute workloads across Kubernetes and Slurm clusters.
  • Build workflow primitives that expose platform capabilities to users such as:
    • Distributed training workflows
    • Model evaluation pipelines
    • Batch GPU compute workflows
  • Integrate workflow execution with platform APIs and platform user interfaces.
  • Distributed AI Workload Orchestration
    • Implement orchestration support for distributed AI workloads including:
    • Multi-node training
    • Distributed inference
    • Large model fine-tuning workloads
  • Support execution environments such as:
    • PyTorch distributed training
    • MPI-based workloads
    • Containerized training jobs
  • Implement mechanisms to coordinate GPU workloads across nodes with low-latency networking.
  • Platform Multi-Tenancy & Resource Isolation
    • Design and maintain mechanisms for multi-tenant GPU resource allocation.
    • Implement quota and fairness policies for compute workloads.
    • Develop resource isolation strategies across tenants including:
    • Namespace isolation
    • Compute quotas
    • GPU allocation limits
  • Integrate compute orchestration with platform billing and metering systems.
  • Technical Stack

    • Programming languages
      • Java, Python + Bash, SQL for CloudStack-related work
      • Go for Kubernetes-related components
      • Python
    • Orchestration
      • CloudStack
      • Kubernetes
      • KubeVirt
      • Slurm/SUNK
      • Argo Workflows
      • Kubernetes CRDs and controllers
      • Batch scheduling frameworks
    • Networking
      • OVS
      • OVN
      • Linux networking
      • VPC networking
      • BlueField networking
    • Infrastructure
      • GPU infrastructure
      • Distributed compute clusters
      • High-performance networking for distributed AI workloads

    What you’ll need

    • Platform & Distributed Systems
      • Proven experience working with large-scale distributed compute environments at a neo-cloud, hyperscaler, or HPC provider.
      • Strong experience with CloudStack internals, including extending and maintaining platform functionality.
      • Experience operating cloud orchestration platforms in production environments.
      • Experience running GPU-heavy infrastructure for AI training, inference, or HPC workloads.
    • Software Engineering
      • Experience maintaining or extending large Java codebases, ideally within infrastructure platforms.
      • Strong programming skills in Go and Python, with experience building cloud-native platform components.
      • Experience designing and maintaining control-plane services for infrastructure platforms.
    • Compute Orchestration
      • Deep practical knowledge of Kubernetes internals and Slurm scheduling systems.
      • Experience building or operating compute orchestration layers for large-scale clusters.
      • Familiarity with workflow orchestration systems such as Argo Workflows.
    • Networking & Infrastructure
      • Familiar with virtual networking and distributed networking technologies such as OVS, OVN, VPC networking, RDMA, RoCE, ECMP, EVPN/VXLAN, and leaf-spine fabrics.
      • Understanding of GPU virtualization and passthrough mechanisms such as QEMU PCI passthrough and NVIDIA MIG.
      • Experience working with GPU infrastructure, including passthrough, NVIDIA MIG, scheduling, and lifecycle management of GPUs in distributed clusters.
    • Leadership & Architecture
      • Able to independently own major compute-orchestration initiatives from design through rollout and operational stabilization.
      • Comfortable solving difficult implementation and operational problems across CloudStack, Kubernetes, Slurm, and workflow orchestration; improving orchestration quality through code, automation, and practical design decisions; collaborating effectively across compute, networking, storage, and platform teams; and influencing engineering practices through expertise and delivery.
      • Comfortable mentoring peers and improving implementation quality, documentation, operational workflows, and platform reliability within the compute orchestration domain.

    What we offer

    • Attractive compensation package reflecting your expertise and experience.
    • A great work environment characterised by friendliness, international diversity, flexibility, and a hybrid-friendly approach.
    • You’ll be part of a fast-growing scale-up with a mission to make a positive impact, offering an exciting career evolution.
    • Our job titles may span more than one job level. The actual base pay is dependent on a number of factors, such as transferable skills, work experience, business needs and market demands.

    Our Inclusive Responsibility

    Radian Arc is committed to creating a diverse and inclusive environment and is proud to be an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, disability, age, veteran status, or any other protected category under applicable law.

    Senior GPU AI Compute Orchestration Engineer employer: Submer

    Radian Arc, now part of InferX, offers an exceptional work environment for the Senior GPU AI Compute Orchestration Engineer role, characterised by a commitment to innovation and inclusivity. With a flexible remote work modality across EMEA, employees benefit from a diverse culture, competitive compensation, and ample opportunities for professional growth within a fast-paced scale-up focused on cutting-edge AI and cloud technologies.
    Submer

    Contact Detail:

    Submer Recruiting Team

    StudySmarter Expert Advice 🤫

    We think this is how you could land Senior GPU AI Compute Orchestration Engineer

    ✨Tip Number 1

    Network like a pro! Reach out to folks in the industry, attend meetups, and connect with potential colleagues on LinkedIn. You never know who might have the inside scoop on job openings or can put in a good word for you.

    ✨Tip Number 2

    Show off your skills! Create a portfolio or GitHub repository showcasing your projects related to GPU orchestration, Kubernetes, or any relevant tech. This gives you a chance to demonstrate your expertise beyond just a CV.

    ✨Tip Number 3

    Prepare for interviews by brushing up on common technical questions and scenarios related to AI workloads and cloud platforms. Practice explaining your thought process clearly; it’s all about showing how you tackle problems!

    ✨Tip Number 4

    Don’t forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, we love seeing candidates who are proactive about their job search!

    We think you need these skills to ace Senior GPU AI Compute Orchestration Engineer

    CloudStack
    Kubernetes
    Slurm
    Argo Workflows
    GPU Scheduling
    Distributed Compute Environments
    Java
    Python
    Go
    Networking Technologies
    GPU Virtualization
    Multi-Tenancy
    Resource Allocation
    Workflow Automation
    Problem-Solving Skills

    Some tips for your application 🫡

    Tailor Your Application: Make sure to customise your CV and cover letter for the Senior GPU AI Compute Orchestration Engineer role. Highlight your experience with CloudStack, Kubernetes, and any relevant orchestration technologies. We want to see how your skills align with our mission!

    Showcase Your Projects: If you've worked on any projects involving distributed compute environments or GPU orchestration, don’t hold back! Share specific examples that demonstrate your hands-on experience and problem-solving skills. This helps us understand your practical knowledge.

    Be Clear and Concise: When writing your application, keep it clear and to the point. Use bullet points where possible to make it easy for us to read through your qualifications and experiences. We appreciate a well-structured application!

    Apply Through Our Website: We encourage you to apply directly through our website. It’s the best way for us to receive your application and ensures you’re considered for the role. Plus, it’s super easy to do!

    How to prepare for a job interview at Submer

    ✨Know Your Tech Stack

    Make sure you’re well-versed in the technologies mentioned in the job description, like CloudStack, Kubernetes, and Slurm. Brush up on your knowledge of GPU orchestration and distributed systems, as these will likely come up during technical discussions.

    ✨Showcase Your Problem-Solving Skills

    Prepare to discuss specific challenges you've faced in previous roles, especially related to cloud orchestration or GPU infrastructure. Use the STAR method (Situation, Task, Action, Result) to structure your answers and highlight how you tackled complex issues.

    ✨Understand the Company’s Mission

    Familiarise yourself with Radian Arc's mission and how they integrate AI and GPU technology into telecom networks. Being able to articulate how your skills align with their goals will demonstrate your genuine interest in the role and the company.

    ✨Ask Insightful Questions

    Prepare thoughtful questions about the team dynamics, ongoing projects, and future developments in their platform. This not only shows your enthusiasm but also helps you gauge if the company culture and work environment are a good fit for you.

    Senior GPU AI Compute Orchestration Engineer
    Submer

    Land your dream job quicker with Premium

    You’re marked as a top applicant with our partner companies
    Individual CV and cover letter feedback including tailoring to specific job roles
    Be among the first applications for new jobs with our AI application
    1:1 support and career advice from our career coaches
    Go Premium

    Money-back if you don't land a job in 6-months

    >