AI Ops Platform SRE — Reliability & Security in Harrow

Job Board

Companies

duvo.ai

AI Ops Platform SRE — Reliability & Security

AI Ops Platform SRE — Reliability & Security in Harrow

Harrow Full-Time Home office (partial)

At a Glance

Tasks: Own the reliability and security of our AI operations platform for enterprise customers.
Company: Join a fast-moving team on a mission to revolutionise data operations.
Benefits: Competitive salary, equity options, unlimited AI budget, and autonomy in your work.
Other info: Be part of a motivated team that values ownership, feedback, and rapid iteration.
Why this job: Make a real impact by building cutting-edge AI solutions for retail and CPG enterprises.
Qualifications: Experience with distributed systems, security mindset, and automation skills are essential.

Who we are

Enterprise teams still copy data between systems all day. Work gets stuck in emails, legacy UIs, and handoffs. That chaos is costly, slow, and risky. We're a fast-moving team on a mission to end it for good. Traction is strong and we're solving real problems for real customers—but to win, we need exceptional talent. We stay humble, do the work, and let results speak.

What we are building

We're building the AI operations platform for retail and CPG enterprises—a horizontal platform where AI agents execute end-to-end work across UIs and APIs with governance built in. Where copilots stop, Duvo finishes the job. Business users specify the outcome; agents plan, act, request approvals on exceptions, and learn with every run. We start with a retail wedge (category management, supply chain, finance ops) where ROI is obvious, then expand to adjacent functions and sectors. Velocity is our moat: ship fast, iterate faster, compound learning.

The role

You will own the reliability, security, and infrastructure that lets our platform run AI agents for enterprise customers. This isn't traditional web app SRE — our agents execute arbitrary code in sandboxes, make unpredictable external API calls, and run for hours. Keeping this reliable, secure, and observable is the job. You'll be part of newly formed SRE team as one of the first team members. Infrastructure is currently owned collectively by product engineers — you'll take ownership, inherit real infrastructure (25+ Terraform modules, full OpenTelemetry pipeline, Prometheus/Grafana monitoring), and build the reliability practice from scratch. Your unit of ownership: platform reliability, infrastructure, observability, and incident response. You own sandbox infrastructure and capacity; the AI Platform Engineer owns sandbox behavior and runtime logic. We're a growing product team scaling into multiple initiatives, each with a lead, engineers, a design engineer, and an AI-focused engineer.

What we're looking for

Distributed systems experience. You've designed and operated systems that scale. You understand failure modes, capacity planning, and the trade-offs between consistency, availability, and latency in real production environments.
Security mindset. You'll handle enterprise data flowing through sandboxed environments, manage KMS encryption, configure Cloud Armor WAF rules, and ensure network isolation between tenant workloads. Security is a default consideration, not an afterthought.
Observability and incident response. You build monitoring and alerting that catches problems before customers do. When incidents happen, you lead structured responses, find root causes, and drive lasting fixes — not just restarts.
Infrastructure as code and automation. You automate everything you can. You've worked with IaC tools, CI/CD pipelines, and container orchestration in production. Manual runbooks make you uncomfortable.
Shipping and ownership. You don't just maintain systems — you improve them. You take ownership of reliability projects from proposal to production, and you measure the results.
Judgment on where to invest. You'll decide what to automate first, where to invest in reliability vs. ship speed, and make incident calls with incomplete information.

You might also

Have experience with GCP, Kubernetes, or similar cloud-native infrastructure.
Have worked with sandboxed execution environments or multi-tenant isolation.
Be comfortable with AI/ML production systems — understanding the unique reliability challenges of LLM-based applications.
Have a product engineering background — you've built features and understand the developer experience you're supporting.

This is not for you if

You want a traditional ops role where you follow runbooks — we're building the reliability practice, not maintaining one.
You want to build AI features — see AI Platform Engineer.

Our tech stack

GCP (Cloud Run, GKE, GCS)
Terraform, Docker
Prometheus, Grafana, Loki, OpenTelemetry
TypeScript and Python services (you'll read and occasionally modify application code, but deep language expertise isn't required)
Postgres, Redis

How we work

These Are Real Trade-offs We've Made, Not Aspirations Initiative-driven. We organize around customer problems, not org charts. Problems surface through product feedback, competitive analysis, and direct customer conversations — then we prioritise, build, and ship weekly. Customer-obsessed. We solve real problems, not hypothetical ones. Features that don't move customer metrics get cut. Iterative by default. We ship small, learn fast, and never get attached to yesterday's code. This means things break sometimes — we fix forward. AI-first leverage. We use AI to move faster and focus human time where it matters most. If a tool can do it, a person shouldn't. Direct feedback. We give each other actionable feedback immediately. This can feel uncomfortable — we think that's worth it. Autonomy with accountability. We trust people to make decisions and hold them to outcomes, not process.

What we offer

Unlimited AI budget. We don't just allow AI tools — we strongly encourage them. Want to try a new tool? Buy it. Want to automate part of your workflow? Do it.
Autonomy to do your best work. Want to meet someone to learn from? Set it up. Want a mentor? Go get one. Want to fly out to talk to an important customer? Just ask.
A real AI product with real customers. You're not building demos or internal tools. Enterprise customers use what you ship, and their feedback drives what you build next.
A sharp, motivated team that values ownership and candour.

Compensation 250.000,- CZK / month with a meaningful equity component. You can trade salary for additional equity if you prefer more upside.

How we hire

We Respect Your Time And Aim To Move Fast Hiring manager screen (30 min). We'll talk about systems you've built and operated, how you handle incidents, and whether there's mutual fit. Remote task (async, time-boxed, ~1 hour). A realistic infrastructure or reliability exercise — an incident response scenario, an IaC task, or a monitoring design challenge. Not LeetCode. Technical interview (Prague, ~1 hour). Meet the team. We'll go deeper on system design, security thinking, and incident response. No trick questions — we want to see how you reason about production systems. On-site trial day (2 days). Work on a real infrastructure problem with us and see how we operate. Fully compensated.

AI Ops Platform SRE — Reliability & Security in Harrow employer: duvo.ai

At Duvo, we pride ourselves on being an exceptional employer that fosters a culture of innovation and autonomy. Our team is driven by a shared mission to solve real problems for our enterprise customers, offering unlimited AI budgets and opportunities for personal growth in a fast-paced environment. With a focus on collaboration and direct feedback, we empower our employees to take ownership of their work while providing the support needed to thrive in their roles.

Contact Details:

duvo.ai Recruitment Team

View duvo.ai profile

StudySmarter Expert Advice🤫

We think this is how you could land AI Ops Platform SRE — Reliability & Security in Harrow

✨Tip Number 1

Network like a pro! Attend industry meetups, webinars, or even local tech events. Chatting with folks in the field can lead to opportunities that aren’t even advertised yet.

✨Tip Number 2

Show off your skills! Create a personal project or contribute to open-source. This not only sharpens your abilities but also gives you something tangible to discuss during interviews.

✨Tip Number 3

Prepare for those interviews! Research common questions for SRE roles and practice your responses. We want to see how you think on your feet, so be ready to tackle real-world scenarios.

✨Tip Number 4

Apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, it shows you’re genuinely interested in joining our mission to revolutionise AI operations.

We think you need these skills to ace AI Ops Platform SRE — Reliability & Security in Harrow

Distributed Systems Experience

Security Mindset

Observability and Incident Response

Infrastructure as Code (IaC)

Automation

Capacity Planning

Monitoring and Alerting

Cloud Infrastructure (GCP, Kubernetes)

Sandboxed Execution Environments

AI/ML Production Systems

Technical Judgement

Terraform

Docker

Prometheus

Grafana

Some tips for your application 🫡

Show Your Passion:When you're writing your application, let your enthusiasm for the role shine through! We want to see that you’re genuinely excited about the opportunity to work with us and tackle the challenges we face in AI operations.

Tailor Your Experience:Make sure to highlight your relevant experience with distributed systems, security, and observability. We’re looking for specific examples that demonstrate how you've tackled similar challenges in the past—this will help us see how you fit into our team!

Be Clear and Concise:Keep your application straightforward and to the point. We appreciate clarity, so avoid jargon and focus on what matters most. This will make it easier for us to understand your qualifications and how you can contribute to our mission.

Apply Through Our Website:Don’t forget to submit your application through our website! It’s the best way for us to keep track of your application and ensure it gets the attention it deserves. Plus, it shows you’re serious about joining our team!

How to prepare for a job interview at duvo.ai

✨Know Your Tech Stack

Familiarise yourself with the technologies mentioned in the job description, like Postgres, Redis, and Terraform. Be ready to discuss your experience with these tools and how you've used them in past projects. This shows you’re not just a fit on paper but also have practical knowledge.

✨Demonstrate Your Security Mindset

Since security is a key focus for this role, prepare examples of how you've implemented security measures in previous projects. Discuss your experience with KMS encryption and network isolation, and be ready to explain why security should be a default consideration in any system design.

✨Showcase Your Incident Response Skills

Be prepared to talk about specific incidents you've managed in the past. Highlight how you led structured responses, identified root causes, and implemented lasting fixes. This will demonstrate your ability to handle real-world challenges effectively.

✨Emphasise Your Ownership and Initiative

This role requires someone who takes ownership of reliability projects. Share examples of how you've improved systems from proposal to production, and discuss your thought process when deciding what to automate first. This will show that you’re proactive and results-driven.

AI Ops Platform SRE — Reliability & Security in Harrow

duvo.ai

Location: Harrow

AI Ops Platform SRE — Reliability & Security in Harrow

At a Glance

AI Ops Platform SRE — Reliability & Security in Harrow employer: duvo.ai

StudySmarter Expert Advice🤫

We think you need these skills to ace AI Ops Platform SRE — Reliability & Security in Harrow

Some tips for your application 🫡

How to prepare for a job interview at duvo.ai

Company

Product

Help