Job Board

Companies

Enigma

Full-Time 43200 - 72000 £ / year (est.) No home office possible

At a Glance

Tasks: Own and optimise Kubernetes for AI agent environments, ensuring stability and efficiency.
Company: Early-stage AI company focused on innovative infrastructure for reinforcement learning.
Benefits: Competitive salary, equity, and the chance to make a real impact.
Why this job: Join a pioneering team and shape the future of AI technology.
Qualifications: Experience with Kubernetes, Python, and distributed systems is essential.
Other info: Dynamic work environment with opportunities for personal and professional growth.

The predicted salary is between 43200 - 72000 £ per year.

About the company

We are an early-stage AI company building infrastructure for long-horizon reinforcement learning: agents that operate for extended periods and execute tools within high-fidelity environments. The team has deep experience in large-scale AI systems and open-source ML, and the company is well funded by experienced operators and technical leaders in the field. We build environment infrastructure to train and evaluate agents on frontier tasks such as automated research and scientific discovery. Our customers include leading AI research organisations and fast-growing, AI-native startups.

Technical stack

Managed Kubernetes (cloud-based)
Redis
Distributed compute frameworks (e.g. Ray)
Observability stack (OpenTelemetry-style)
50+ containerised evaluation environments

What you’ll do

Own the Kubernetes runtime for agent environments
Own scheduling, lifecycle management, stability, and operations for long-running, failure-prone workloads
Operate and evolve a production Kubernetes platform supporting multi-hour or multi-day agent runs
Improve environment infrastructure for long-horizon training and evaluation
Maintain a large suite of containerised evaluation environments (ML benchmarks, code execution, scientific tasks) with fast cold-start times
Optimise GPU utilisation and scheduling for distributed workloads
Design storage patterns for large datasets, model checkpoints, and episodic session state
Improve environment bootstrap times and resource efficiency through image layering and caching strategies
Make observability excellent
Implement metrics, logs, and traces that enable fast root‑cause analysis
Build dashboards and alerting tied to SLOs (e.g. rollout success rate, environment health, tool latency, queue time)
Create debugging playbooks for common failure modes such as OOMs, memory leaks, performance regressions, and network or storage issues
Reliability engineering
Design retry and backoff strategies for long-running agent sessions that may fail mid‑execution
Implement session recovery mechanisms such as checkpointing and idempotent operations
Build graceful degradation paths for node failures, OOMs, and GPU errors without losing progress
Create runbooks for common failure modes (e.g. sidecar health timeouts, stream lag, pod eviction cascades)
Develop chaos‑testing strategies for multi‑hour runs (network partitions, node drains, API rate limits)
Define and track SLOs for session creation latency, environment availability, and tool execution success rates
Security and sandboxing for tool‑using agents
Harden container isolation for untrusted code execution (e.g. sandboxed runtimes or microVM‑based approaches)
Implement network policies to restrict outbound access from evaluation environments
Design secrets management for API keys used by agent tools, including rotation and least‑privilege access
Build audit logging for tool invocations and filesystem access
Implement rate limiting and circuit breakers for external API calls made by agents

Must-have experience

Resource requests and limits, affinity/taints, priorities, autoscaling, and preemption
Debugging networking, DNS, storage performance, and node health issues
Strong distributed‑systems fundamentals: idempotency, retries, failure domains, and incident response
Practical observability experience with metrics, structured logging, and tracing
Ability to build internal tools in Python and/or Go
Infrastructure‑as‑code and automation experience (Helm, scripting, GitOps‑style workflows; Terraform a plus)
Experience using Redis for high‑throughput, session‑oriented workloads

Nice-to-have experience

Experience with machine learning systems or language models
Expertise in a specific infrastructure domain
ML or reinforcement learning training infrastructure (checkpointing, distributed training, GPU scheduling)
Sandboxing technologies for untrusted code execution
Deep expertise in container runtimes, Linux performance tuning, or networking

Compensation

Competitive salary and meaningful equity. Early‑team impact with direct ownership and high leverage.

Senior Infrastructure Engineer | Kubernetes | Docker | Terraform | Python | GPU | Onsite, London employer: Enigma

As an early-stage AI company based in London, we offer a dynamic work environment where innovation thrives and your contributions directly impact the future of reinforcement learning. Our culture fosters collaboration and growth, providing employees with opportunities to work alongside industry leaders while developing cutting-edge infrastructure for AI applications. With competitive salaries, meaningful equity, and a focus on employee development, we are committed to creating a rewarding workplace that empowers you to excel in your career.

Contact Detail:

Enigma Recruiting Team

View Enigma Profile

StudySmarter Expert Advice 🤫

✨Tip Number 1

Network like a pro! Attend meetups, conferences, or even local tech events in London. Chatting with folks in the industry can lead to opportunities that aren’t even advertised yet.

✨Tip Number 2

Show off your skills! Create a GitHub repo showcasing your projects, especially those involving Kubernetes, Docker, or Python. This gives potential employers a taste of what you can do and sets you apart from the crowd.

✨Tip Number 3

Prepare for technical interviews by practicing common questions related to distributed systems and observability. Use platforms like LeetCode or HackerRank to sharpen your coding skills and get comfortable with problem-solving on the spot.

✨Tip Number 4

Don’t forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, we love seeing candidates who are genuinely interested in joining our team!

We think you need these skills to ace Senior Infrastructure Engineer | Kubernetes | Docker | Terraform | Python | GPU | Onsite, London

Kubernetes

Docker

Terraform

Python

GPU Utilisation

Distributed Systems Fundamentals

Observability (Metrics, Logging, Tracing)

Resource Management (Requests, Limits, Autoscaling)

Debugging Networking and Storage Performance

Infrastructure-as-Code

Redis

Session Recovery Mechanisms

Sandboxing Technologies

Chaos Testing Strategies

Security and Secrets Management

Some tips for your application 🫡

Tailor Your CV: Make sure your CV reflects the skills and experiences that match our job description. Highlight your experience with Kubernetes, Docker, and Python, as these are key for us. A tailored CV shows us you’re genuinely interested in the role!

Craft a Compelling Cover Letter: Your cover letter is your chance to shine! Use it to explain why you're excited about working with us and how your background fits into our mission. Be specific about your achievements and how they relate to the role.

Showcase Your Projects: If you've worked on relevant projects, don’t hesitate to mention them! Whether it's a personal project or something from your previous job, we love seeing practical examples of your skills in action, especially around infrastructure and AI.

Apply Through Our Website: We encourage you to apply directly through our website. It’s the best way for us to receive your application and ensures you’re considered for the role. Plus, it makes the process smoother for both of us!

How to prepare for a job interview at Enigma

✨Know Your Tech Stack

Make sure you’re well-versed in Kubernetes, Docker, and Terraform. Brush up on your Python skills too, as you'll likely be asked to demonstrate your understanding of these technologies. Familiarise yourself with how they interact within the context of AI infrastructure.

✨Showcase Problem-Solving Skills

Prepare to discuss specific challenges you've faced in previous roles, especially around reliability engineering and debugging. Think about examples where you implemented retry strategies or improved observability. This will show that you can handle the complexities of long-running workloads.

✨Understand the Company’s Mission

Research the company’s focus on long-horizon reinforcement learning and their customer base. Be ready to discuss how your experience aligns with their goals, particularly in building infrastructure for AI systems. This shows genuine interest and helps you stand out.

✨Prepare Questions

Have a list of insightful questions ready to ask at the end of the interview. Inquire about their current projects, team dynamics, or future challenges they foresee in AI infrastructure. This not only demonstrates your enthusiasm but also helps you gauge if the company is the right fit for you.

Senior Infrastructure Engineer | Kubernetes | Docker | Terraform | Python | GPU | Onsite, London

Enigma

Senior Infrastructure Engineer | Kubernetes | Docker | Terraform | Python | GPU | Onsite, London

At a Glance

Senior Infrastructure Engineer | Kubernetes | Docker | Terraform | Python | GPU | Onsite, London employer: Enigma

StudySmarter Expert Advice 🤫

✨Tip Number 1

✨Tip Number 2

✨Tip Number 3

✨Tip Number 4

We think you need these skills to ace Senior Infrastructure Engineer | Kubernetes | Docker | Terraform | Python | GPU | Onsite, London

Some tips for your application 🫡

How to prepare for a job interview at Enigma

Senior Infrastructure Engineer | Kubernetes | Docker | Terraform | Python | GPU | Onsite, London

Land your dream job quicker with Premium