Lead Cluster Operations Support Engineer (m/f/d)
Lead Cluster Operations Support Engineer (m/f/d)

Lead Cluster Operations Support Engineer (m/f/d)

London Full-Time 48000 - 84000 £ / year (est.) No home office possible
T

At a Glance

  • Tasks: Lead a team providing 24x7 support for large GPU clusters during training.
  • Company: Join a cutting-edge tech company focused on machine learning and infrastructure.
  • Benefits: Enjoy flexible work hours, collaborative culture, and opportunities for professional growth.
  • Why this job: Be part of an innovative team shaping the future of AI with hands-on problem-solving.
  • Qualifications: Expertise in Kubernetes, large cluster management, and familiarity with ML tools required.
  • Other info: Work across time zones with a diverse team and engage in continuous learning.

The predicted salary is between 48000 - 84000 £ per year.

Lead Cluster Operations Support Engineer (m/f/d)

This team will provide 24×7 white-glove support to people using large blocks of GPUs (6,000+ contiguous GPUs) for a short period of time (eg: 6-weeks, 12-weeks etc) to perform Managed Post Training (MPT). This includes helping with preparation, 24×7 support during training to ensure full utilization of the GPU clusters and off-boarding. The team is in three timezones with hand-off protocols to enable 24×7 support: US, Europe and India. While you can be a specialist in Infra and cluster operations, you need to know enough about ML.

Job Responsibilities

  • You will help shape and iterate this new white glove model training support service on large GPU clusters.
  • You will work in a collaborative team with Machine Learning Engineers and Infrastructure Engineers.
  • You will contribute to accelerator development: find gaps in the tooling, or needed automation, or patterns we would develop accelerators to make the next round of this more efficient and faster. Eg: We need to improve observability, or we need to automate user onboarding, or we need to bring in a new tool which everyone seems to want to use etc. This will probably involve a combination of Terraform/Pulumi, Helm Charts, Python and Shell Scripts.
  • You will help assess the model training readiness and data preparation.
  • You will provide model training support rotating daytime weekend shifts – with pagers, to any issues they may encounter. These can range from infrastructure issues to data sciences issues or anything in between: eg: GCP changed a configuration in GKE that affects the training.
  • You will facilitate collaborative problem solving within the team by actively listening, communicating effectively and mentoring other engineers.
  • You will proactively identify and address challenges related to the white glove service for continued pre training, proposing solutions and implementing improvements.

Job Qualifications

Technical Skills

  • Deep expertise Kubernetes administration and debugging at scale.
  • Deep knowledge of managing large clusters with 1000s of nodes with K8s.
  • Knowledge of running training workloads on 1000s of GPUs.
  • Knowledge of working with the Lustre filesystem is a plus.
  • Knowledge of working with NVIDIA NeMo Framework (Docker image for model training).
  • Knowledge of working with NVIDIA NeMo NIMs (Docker images for inference).
  • Terraform / Pulumi, Helm Charts, Linux, other Infrastructure-as-code tools.
  • Nice to have: Run:ai, TrueFoundry, Huggingface platform etc (can provide training).
  • Knowledge of working with HPC technologies such as Slurm is a bonus.

Professional Skills

  • You will be part of a high value client facing white glove service, where a high level of professionalism is required.
  • You understand the importance of stakeholder management and can easily liaise between clients and other key stakeholders throughout projects, ensuring buy-in and gaining trust along the way.
  • You are resilient in ambiguous situations and can adapt your role to approach challenges from multiple perspectives.
  • You don’t shy away from risks or conflicts, instead you take them on and skillfully manage them.
  • You are eager to coach, mentor and motivate others and you aspire to influence teammates to take positive action and accountability for their work.
  • You enjoy influencing others and always advocate for technical excellence while being open to change when needed.
  • You have an insatiable curiosity and a drive to learn new things.

#J-18808-Ljbffr

Lead Cluster Operations Support Engineer (m/f/d) employer: Thoughtworks Inc.

At our company, we pride ourselves on fostering a dynamic and inclusive work culture that empowers our employees to thrive. As a Lead Cluster Operations Support Engineer, you will benefit from continuous learning opportunities, collaborative teamwork with top-tier Machine Learning and Infrastructure Engineers, and the chance to shape innovative support services for cutting-edge GPU clusters. Located in a vibrant tech hub, we offer competitive benefits, flexible work arrangements, and a commitment to professional growth, making us an exceptional employer for those seeking meaningful and rewarding careers.
T

Contact Detail:

Thoughtworks Inc. Recruiting Team

StudySmarter Expert Advice 🤫

We think this is how you could land Lead Cluster Operations Support Engineer (m/f/d)

✨Tip Number 1

Make sure to showcase your deep expertise in Kubernetes administration and debugging. Highlight any specific experiences you have managing large clusters, especially with thousands of nodes, as this is crucial for the role.

✨Tip Number 2

Demonstrate your knowledge of running training workloads on GPUs. If you have experience with NVIDIA NeMo Framework or Lustre filesystem, be sure to mention it, as these are valuable assets for the position.

✨Tip Number 3

Emphasize your ability to work collaboratively in a team environment. Share examples of how you've facilitated problem-solving and mentored others, as this aligns with the collaborative nature of the role.

✨Tip Number 4

Show your eagerness to learn and adapt by discussing any new technologies or tools you've recently explored. This will demonstrate your insatiable curiosity and drive to stay updated in the fast-evolving tech landscape.

We think you need these skills to ace Lead Cluster Operations Support Engineer (m/f/d)

Kubernetes Administration
Debugging at Scale
Cluster Management
GPU Workload Optimization
Lustre Filesystem Knowledge
NVIDIA NeMo Framework
NVIDIA NeMo NIMs
Terraform
Pulumi
Helm Charts
Linux
Infrastructure-as-Code Tools
HPC Technologies (e.g., Slurm)
Stakeholder Management
Professionalism in Client Interactions
Problem-Solving Skills
Adaptability
Coaching and Mentoring
Influencing Skills
Curiosity and Drive to Learn

Some tips for your application 🫡

Understand the Role: Make sure to thoroughly read the job description for the Lead Cluster Operations Support Engineer position. Understand the technical and professional skills required, as well as the responsibilities involved in providing 24x7 support for GPU clusters.

Highlight Relevant Experience: In your application, emphasize your experience with Kubernetes administration, managing large clusters, and any relevant tools like Terraform or Helm Charts. Be specific about your past roles and how they relate to the responsibilities of this position.

Showcase Problem-Solving Skills: Demonstrate your ability to handle challenges and propose solutions in your application. Provide examples of how you've successfully navigated ambiguous situations or conflicts in previous roles, especially in a client-facing environment.

Tailor Your Application: Customize your CV and cover letter to reflect the language and requirements mentioned in the job description. Use keywords related to the role, such as 'white glove service', 'collaborative problem solving', and 'stakeholder management' to make your application stand out.

How to prepare for a job interview at Thoughtworks Inc.

✨Show Your Technical Expertise

Be prepared to discuss your deep expertise in Kubernetes administration and managing large clusters. Highlight specific experiences where you've successfully debugged issues at scale, especially with GPU workloads.

✨Demonstrate Collaborative Problem Solving

Since the role involves working closely with Machine Learning Engineers and Infrastructure Engineers, share examples of how you've facilitated collaborative problem solving in past projects. Emphasize your active listening skills and effective communication.

✨Highlight Your Professionalism

This position requires a high level of professionalism in client-facing situations. Be ready to discuss how you've managed stakeholder relationships and gained trust in previous roles, especially in ambiguous or challenging scenarios.

✨Express Your Curiosity and Willingness to Learn

Convey your insatiable curiosity and drive to learn new technologies. Mention any relevant tools or frameworks you're eager to explore, such as NVIDIA NeMo or Terraform, and how you stay updated with industry trends.

Lead Cluster Operations Support Engineer (m/f/d)
Thoughtworks Inc.
T
  • Lead Cluster Operations Support Engineer (m/f/d)

    London
    Full-Time
    48000 - 84000 £ / year (est.)

    Application deadline: 2027-03-20

  • T

    Thoughtworks Inc.

Similar positions in other companies
Europas größte Jobbörse für Gen-Z
discover-jobs-cta
Discover now
>