At a Glance
- Tasks: Own and optimise Kubernetes for AI agent environments, ensuring stability and efficiency.
- Company: Early-stage AI company focused on innovative infrastructure for reinforcement learning.
- Benefits: Competitive salary, equity, and the chance to make a real impact.
- Why this job: Join a pioneering team and shape the future of AI technology.
- Qualifications: Experience with Kubernetes, Python, and distributed systems is essential.
- Other info: Dynamic work environment with opportunities for personal and professional growth.
The predicted salary is between 43200 - 72000 ÂŁ per year.
About the company
We are an early-stage AI company building infrastructure for long-horizon reinforcement learning: agents that operate for extended periods and execute tools within high-fidelity environments. The team has deep experience in large-scale AI systems and open-source ML, and the company is well funded by experienced operators and technical leaders in the field. We build environment infrastructure to train and evaluate agents on frontier tasks such as automated research and scientific discovery. Our customers include leading AI research organisations and fast-growing, AI-native startups.
Technical stack
- Managed Kubernetes (cloud-based)
- Redis
- Distributed compute frameworks (e.g. Ray)
- Observability stack (OpenTelemetry-style)
- 50+ containerised evaluation environments
What you’ll do
- Own the Kubernetes runtime for agent environments
- Own scheduling, lifecycle management, stability, and operations for long-running, failure-prone workloads
- Operate and evolve a production Kubernetes platform supporting multi-hour or multi-day agent runs
- Improve environment infrastructure for long-horizon training and evaluation
- Maintain a large suite of containerised evaluation environments (ML benchmarks, code execution, scientific tasks) with fast cold-start times
- Optimise GPU utilisation and scheduling for distributed workloads
- Design storage patterns for large datasets, model checkpoints, and episodic session state
- Improve environment bootstrap times and resource efficiency through image layering and caching strategies
- Make observability excellent
- Implement metrics, logs, and traces that enable fast root‑cause analysis
- Build dashboards and alerting tied to SLOs (e.g. rollout success rate, environment health, tool latency, queue time)
- Create debugging playbooks for common failure modes such as OOMs, memory leaks, performance regressions, and network or storage issues
- Reliability engineering
- Design retry and backoff strategies for long-running agent sessions that may fail mid‑execution
- Implement session recovery mechanisms such as checkpointing and idempotent operations
- Build graceful degradation paths for node failures, OOMs, and GPU errors without losing progress
- Create runbooks for common failure modes (e.g. sidecar health timeouts, stream lag, pod eviction cascades)
- Develop chaos‑testing strategies for multi‑hour runs (network partitions, node drains, API rate limits)
- Define and track SLOs for session creation latency, environment availability, and tool execution success rates
- Security and sandboxing for tool‑using agents
- Harden container isolation for untrusted code execution (e.g. sandboxed runtimes or microVM‑based approaches)
- Implement network policies to restrict outbound access from evaluation environments
- Design secrets management for API keys used by agent tools, including rotation and least‑privilege access
- Build audit logging for tool invocations and filesystem access
- Implement rate limiting and circuit breakers for external API calls made by agents
Must-have experience
- Resource requests and limits, affinity/taints, priorities, autoscaling, and preemption
- Debugging networking, DNS, storage performance, and node health issues
- Strong distributed‑systems fundamentals: idempotency, retries, failure domains, and incident response
- Practical observability experience with metrics, structured logging, and tracing
- Ability to build internal tools in Python and/or Go
- Infrastructure‑as‑code and automation experience (Helm, scripting, GitOps‑style workflows; Terraform a plus)
- Experience using Redis for high‑throughput, session‑oriented workloads
Nice-to-have experience
- Experience with machine learning systems or language models
- Expertise in a specific infrastructure domain
- ML or reinforcement learning training infrastructure (checkpointing, distributed training, GPU scheduling)
- Sandboxing technologies for untrusted code execution
- Deep expertise in container runtimes, Linux performance tuning, or networking
Compensation
Competitive salary and meaningful equity. Early‑team impact with direct ownership and high leverage.
Senior Infrastructure Engineer | Kubernetes | Docker | Terraform | Python | GPU | Onsite, London employer: Enigma
Contact Detail:
Enigma Recruiting Team
StudySmarter Expert Advice 🤫
We think this is how you could land Senior Infrastructure Engineer | Kubernetes | Docker | Terraform | Python | GPU | Onsite, London
✨Tip Number 1
Network like a pro! Attend meetups, conferences, or even local tech events in London. Chatting with folks in the industry can lead to opportunities that aren’t even advertised yet.
✨Tip Number 2
Show off your skills! Create a GitHub repo showcasing your projects, especially those involving Kubernetes, Docker, or Python. This gives potential employers a taste of what you can do and sets you apart from the crowd.
✨Tip Number 3
Prepare for technical interviews by practicing common questions related to distributed systems and observability. Use platforms like LeetCode or HackerRank to sharpen your coding skills and get comfortable with problem-solving on the spot.
✨Tip Number 4
Don’t forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, we love seeing candidates who are genuinely interested in joining our team!
We think you need these skills to ace Senior Infrastructure Engineer | Kubernetes | Docker | Terraform | Python | GPU | Onsite, London
Some tips for your application 🫡
Tailor Your CV: Make sure your CV reflects the skills and experiences that match our job description. Highlight your experience with Kubernetes, Docker, and Python, as these are key for us. A tailored CV shows us you’re genuinely interested in the role!
Craft a Compelling Cover Letter: Your cover letter is your chance to shine! Use it to explain why you're excited about working with us and how your background fits into our mission. Be specific about your achievements and how they relate to the role.
Showcase Your Projects: If you've worked on relevant projects, don’t hesitate to mention them! Whether it's a personal project or something from your previous job, we love seeing practical examples of your skills in action, especially around infrastructure and AI.
Apply Through Our Website: We encourage you to apply directly through our website. It’s the best way for us to receive your application and ensures you’re considered for the role. Plus, it makes the process smoother for both of us!
How to prepare for a job interview at Enigma
✨Know Your Tech Stack
Make sure you’re well-versed in Kubernetes, Docker, and Terraform. Brush up on your Python skills too, as you'll likely be asked to demonstrate your understanding of these technologies. Familiarise yourself with how they interact within the context of AI infrastructure.
✨Showcase Problem-Solving Skills
Prepare to discuss specific challenges you've faced in previous roles, especially around reliability engineering and debugging. Think about examples where you implemented retry strategies or improved observability. This will show that you can handle the complexities of long-running workloads.
✨Understand the Company’s Mission
Research the company’s focus on long-horizon reinforcement learning and their customer base. Be ready to discuss how your experience aligns with their goals, particularly in building infrastructure for AI systems. This shows genuine interest and helps you stand out.
✨Prepare Questions
Have a list of insightful questions ready to ask at the end of the interview. Inquire about their current projects, team dynamics, or future challenges they foresee in AI infrastructure. This not only demonstrates your enthusiasm but also helps you gauge if the company is the right fit for you.