Senior Infrastructure Engineer | Kubernetes | Docker | Terraform | Python | GPU | Onsite, London
Senior Infrastructure Engineer | Kubernetes | Docker | Terraform | Python | GPU | Onsite, London

Senior Infrastructure Engineer | Kubernetes | Docker | Terraform | Python | GPU | Onsite, London

Full-Time 43200 - 72000 ÂŁ / year (est.) No home office possible
Go Premium
Enigma

At a Glance

  • Tasks: Own and optimise Kubernetes for AI agent environments, ensuring stability and efficiency.
  • Company: Early-stage AI company focused on innovative infrastructure for reinforcement learning.
  • Benefits: Competitive salary, equity, and the chance to make a real impact.
  • Why this job: Join a pioneering team and shape the future of AI technology.
  • Qualifications: Experience with Kubernetes, Python, and distributed systems is essential.
  • Other info: Dynamic work environment with opportunities for personal and professional growth.

The predicted salary is between 43200 - 72000 ÂŁ per year.

About the company

We are an early-stage AI company building infrastructure for long-horizon reinforcement learning: agents that operate for extended periods and execute tools within high-fidelity environments. The team has deep experience in large-scale AI systems and open-source ML, and the company is well funded by experienced operators and technical leaders in the field. We build environment infrastructure to train and evaluate agents on frontier tasks such as automated research and scientific discovery. Our customers include leading AI research organisations and fast-growing, AI-native startups.

Technical stack

  • Managed Kubernetes (cloud-based)
  • Redis
  • Distributed compute frameworks (e.g. Ray)
  • Observability stack (OpenTelemetry-style)
  • 50+ containerised evaluation environments

What you’ll do

  • Own the Kubernetes runtime for agent environments
  • Own scheduling, lifecycle management, stability, and operations for long-running, failure-prone workloads
  • Operate and evolve a production Kubernetes platform supporting multi-hour or multi-day agent runs
  • Improve environment infrastructure for long-horizon training and evaluation
  • Maintain a large suite of containerised evaluation environments (ML benchmarks, code execution, scientific tasks) with fast cold-start times
  • Optimise GPU utilisation and scheduling for distributed workloads
  • Design storage patterns for large datasets, model checkpoints, and episodic session state
  • Improve environment bootstrap times and resource efficiency through image layering and caching strategies
  • Make observability excellent
  • Implement metrics, logs, and traces that enable fast root‑cause analysis
  • Build dashboards and alerting tied to SLOs (e.g. rollout success rate, environment health, tool latency, queue time)
  • Create debugging playbooks for common failure modes such as OOMs, memory leaks, performance regressions, and network or storage issues
  • Reliability engineering
  • Design retry and backoff strategies for long-running agent sessions that may fail mid‑execution
  • Implement session recovery mechanisms such as checkpointing and idempotent operations
  • Build graceful degradation paths for node failures, OOMs, and GPU errors without losing progress
  • Create runbooks for common failure modes (e.g. sidecar health timeouts, stream lag, pod eviction cascades)
  • Develop chaos‑testing strategies for multi‑hour runs (network partitions, node drains, API rate limits)
  • Define and track SLOs for session creation latency, environment availability, and tool execution success rates
  • Security and sandboxing for tool‑using agents
  • Harden container isolation for untrusted code execution (e.g. sandboxed runtimes or microVM‑based approaches)
  • Implement network policies to restrict outbound access from evaluation environments
  • Design secrets management for API keys used by agent tools, including rotation and least‑privilege access
  • Build audit logging for tool invocations and filesystem access
  • Implement rate limiting and circuit breakers for external API calls made by agents

Must-have experience

  • Resource requests and limits, affinity/taints, priorities, autoscaling, and preemption
  • Debugging networking, DNS, storage performance, and node health issues
  • Strong distributed‑systems fundamentals: idempotency, retries, failure domains, and incident response
  • Practical observability experience with metrics, structured logging, and tracing
  • Ability to build internal tools in Python and/or Go
  • Infrastructure‑as‑code and automation experience (Helm, scripting, GitOps‑style workflows; Terraform a plus)
  • Experience using Redis for high‑throughput, session‑oriented workloads

Nice-to-have experience

  • Experience with machine learning systems or language models
  • Expertise in a specific infrastructure domain
  • ML or reinforcement learning training infrastructure (checkpointing, distributed training, GPU scheduling)
  • Sandboxing technologies for untrusted code execution
  • Deep expertise in container runtimes, Linux performance tuning, or networking

Compensation

Competitive salary and meaningful equity. Early‑team impact with direct ownership and high leverage.

Senior Infrastructure Engineer | Kubernetes | Docker | Terraform | Python | GPU | Onsite, London employer: Enigma

As an early-stage AI company based in London, we offer a dynamic work environment where innovation thrives and your contributions directly impact the future of reinforcement learning. Our culture fosters collaboration and growth, providing employees with opportunities to work alongside industry leaders while developing cutting-edge infrastructure for AI applications. With competitive salaries, meaningful equity, and a focus on employee development, we are committed to creating a rewarding workplace that empowers you to excel in your career.
Enigma

Contact Detail:

Enigma Recruiting Team

StudySmarter Expert Advice 🤫

We think this is how you could land Senior Infrastructure Engineer | Kubernetes | Docker | Terraform | Python | GPU | Onsite, London

✨Tip Number 1

Network like a pro! Attend meetups, conferences, or even local tech events in London. Chatting with folks in the industry can lead to opportunities that aren’t even advertised yet.

✨Tip Number 2

Show off your skills! Create a GitHub repo showcasing your projects, especially those involving Kubernetes, Docker, or Python. This gives potential employers a taste of what you can do and sets you apart from the crowd.

✨Tip Number 3

Prepare for technical interviews by practicing common questions related to distributed systems and observability. Use platforms like LeetCode or HackerRank to sharpen your coding skills and get comfortable with problem-solving on the spot.

✨Tip Number 4

Don’t forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, we love seeing candidates who are genuinely interested in joining our team!

We think you need these skills to ace Senior Infrastructure Engineer | Kubernetes | Docker | Terraform | Python | GPU | Onsite, London

Kubernetes
Docker
Terraform
Python
GPU Utilisation
Distributed Systems Fundamentals
Observability (Metrics, Logging, Tracing)
Resource Management (Requests, Limits, Autoscaling)
Debugging Networking and Storage Performance
Infrastructure-as-Code
Redis
Session Recovery Mechanisms
Sandboxing Technologies
Chaos Testing Strategies
Security and Secrets Management

Some tips for your application 🫡

Tailor Your CV: Make sure your CV reflects the skills and experiences that match our job description. Highlight your experience with Kubernetes, Docker, and Python, as these are key for us. A tailored CV shows us you’re genuinely interested in the role!

Craft a Compelling Cover Letter: Your cover letter is your chance to shine! Use it to explain why you're excited about working with us and how your background fits into our mission. Be specific about your achievements and how they relate to the role.

Showcase Your Projects: If you've worked on relevant projects, don’t hesitate to mention them! Whether it's a personal project or something from your previous job, we love seeing practical examples of your skills in action, especially around infrastructure and AI.

Apply Through Our Website: We encourage you to apply directly through our website. It’s the best way for us to receive your application and ensures you’re considered for the role. Plus, it makes the process smoother for both of us!

How to prepare for a job interview at Enigma

✨Know Your Tech Stack

Make sure you’re well-versed in Kubernetes, Docker, and Terraform. Brush up on your Python skills too, as you'll likely be asked to demonstrate your understanding of these technologies. Familiarise yourself with how they interact within the context of AI infrastructure.

✨Showcase Problem-Solving Skills

Prepare to discuss specific challenges you've faced in previous roles, especially around reliability engineering and debugging. Think about examples where you implemented retry strategies or improved observability. This will show that you can handle the complexities of long-running workloads.

✨Understand the Company’s Mission

Research the company’s focus on long-horizon reinforcement learning and their customer base. Be ready to discuss how your experience aligns with their goals, particularly in building infrastructure for AI systems. This shows genuine interest and helps you stand out.

✨Prepare Questions

Have a list of insightful questions ready to ask at the end of the interview. Inquire about their current projects, team dynamics, or future challenges they foresee in AI infrastructure. This not only demonstrates your enthusiasm but also helps you gauge if the company is the right fit for you.

Senior Infrastructure Engineer | Kubernetes | Docker | Terraform | Python | GPU | Onsite, London
Enigma
Go Premium

Land your dream job quicker with Premium

You’re marked as a top applicant with our partner companies
Individual CV and cover letter feedback including tailoring to specific job roles
Be among the first applications for new jobs with our AI application
1:1 support and career advice from our career coaches
Go Premium

Money-back if you don't land a job in 6-months

>