Graduate Engineer: AI Tooling and Site Reliability in Cardiff

Graduate Engineer: AI Tooling and Site Reliability in Cardiff

Cardiff Full-Time 60000 - 80000 £ / year (est.) No working from home possible
Critical Cloud Limited

At a Glance

  • Tasks: Build AI tooling and manage real production environments from day one.
  • Company: Join Critical Cloud, a pioneering tech company transforming cloud operations.
  • Benefits: Enjoy 25 days holiday, flexible working, and paid certifications.
  • Other info: Gain ownership and direct access to founders in a dynamic work environment.
  • Why this job: Make a real impact by automating operational challenges with cutting-edge AI technology.
  • Qualifications: Degree in Computer Science or related field; solid Python skills required.

The predicted salary is between 60000 - 80000 £ per year.

We're building an internal AI platform from scratch, the tooling that will define how Critical Cloud operates as we scale across Europe. This isn't a rotation or a shadow programme. From week one you'll be shipping real tooling and operating real production environments for real customers. The two tracks exist because they make each other better. That's the design.

About the Role

This isn't a rotation programme. From week one, you'll contribute to both tracks: shipping AI tooling that helps us run cloud operations better, and operating real production infrastructure for real customers. Two disciplines, one engineer, no siloes. Critical Cloud is the world's first "Powered by Datadog" accredited MSP, a Datadog-native cloud MSP built for European tech‑led SMBs. We're building an internal AI platform (the Critical Cloud Platform) to automate and augment how we operate customer environments. This role sits at the centre of that programme. Half your time will be engineering AI‑assisted tooling: LLM integrations, agents, and automation workflows that reduce toil and improve our operational quality. The other half will be hands‑on SRE work: monitoring, incident support, infrastructure‑as‑code, and customer‑facing operations. Each half makes you better at the other.

What You’ll Do

  • AI Tooling Track
    • Build and iterate on AI‑assisted automation workflows using LLM APIs (Claude, OpenAI) integrated with cloud and observability tooling
    • Develop tooling for automated infrastructure discovery, customer onboarding, and operational runbook generation
    • Contribute to the Critical Cloud Platform: our internal AI governance framework and agent operating model
    • Design and implement MCP (Model Context Protocol) integrations connecting AI agents to Datadog, AWS, and Azure APIs
    • Write evaluation harnesses and regression tests to keep AI tool output reliable and auditable
    • Document AI system behaviour against our constitutional operating framework and ISO 27001 controls
  • Site Reliability Track
    • Monitor and triage alerts across customer AWS and Azure environments using Datadog as the primary observability platform
    • Support incident response workflows and contribute to postmortem documentation alongside the SRE team
    • Support Datadog onboarding for new customers: instrumentation, dashboards, monitors, and SLO configuration
    • Write and maintain Terraform modules for infrastructure provisioning and change management
    • Produce and maintain operational runbooks, escalation guides, and change records to ISO 27001 standards
    • Contribute SRE context back into AI tooling: you'll know what's worth automating because you've done it manually

Requirements

  • A degree in Computer Science, Software Engineering, or a related technical field (2:1 or above)
  • Solid Python: comfortable writing scripts, working with APIs, and handling structured data
  • Familiarity with cloud fundamentals (AWS or Azure), ideally through coursework, personal projects, or placement
  • Experience consuming REST APIs or LLM APIs, whether through a project, dissertation, or side work
  • Clear written communication: you'll be writing docs and talking to customers

Nice to Have

  • Hands‑on LLM work: prompt engineering, tool use, agent frameworks, or evaluation pipelines
  • Terraform or any IaC tooling (even tutorials count)
  • Datadog experience, even a free tier account you've played with
  • Kubernetes or containerised workload exposure
  • Any cloud or AI certification (AWS, Azure, Google, or Datadog)
  • A GitHub profile with something worth showing us

AI & Automation

  • Claude / Anthropic API – Primary LLM platform
  • Datadog – Core observability platform
  • AWS – Primary cloud, multi‑account
  • Azure – Secondary cloud workloads
  • Terraform – Infrastructure as code
  • GitHub Actions – CI/CD pipelines

Benefits

  • 25 days holiday + bank holidays plus a paid day off in your birthday month, taken in the month it falls
  • Holiday grows with tenure: +1 day per year after your second work anniversary, up to 28 days total
  • Enhanced maternity pay: 26 weeks at your full basic salary
  • Enhanced paternity pay: 2 weeks at your full basic salary
  • Datadog, AWS, Azure, and AI tooling certifications paid by the company, contractual obligation, not a discretionary budget
  • Flexible working requests from your first day of employment, statutory right, supported in full
  • Company‑provided laptop and peripherals, set up before you start

Who Thrives Here

The ideal candidate doesn't have to choose between writing code and running infrastructure. They're curious about both and understand that the two inform each other. You'll build AI tooling that automates real operational problems precisely because you've experienced those problems hands‑on in the SRE track. We operate to ISO 27001. Everything we build, including AI systems, has to be explainable, auditable, and consistent with our governance framework. If you care about building AI tools that are reliable, not just impressive demos, you'll fit right in. This is an early career role, but we don't run it like one. You'll have genuine ownership, direct access to founders, and the chance to shape a platform that will define how Critical Cloud operates at scale.

How We Work

  • Own the Problem: When something breaks in a customer environment, you take it through to resolution and document it properly. Not "I raised a ticket." Not "I told the senior." You own it.
  • Stay Curious: The AI tooling track exists because engineers asked "what if we automated that?" This role rewards people who look at repetitive manual work and immediately start thinking about whether they could build their way out of it. The worst automation is the one nobody trusts because it's too complicated. Build for the on‑call engineer picking it up at 3am without context. A runbook anyone can follow is worth more than one only you understand.
  • Be Resourceful: You’ll hit problems on both tracks where the answer isn't in a tutorial. The engineers who thrive here figure things out, with what they have, in the time they have, to the standard required.

Graduate Engineer: AI Tooling and Site Reliability in Cardiff employer: Critical Cloud Limited

At Critical Cloud, we pride ourselves on fostering a dynamic work environment where innovation meets collaboration. As a Graduate Engineer in AI Tooling and Site Reliability, you'll have the unique opportunity to contribute to cutting-edge projects from day one, supported by a culture that encourages curiosity and ownership. With generous benefits including 25 days of holiday, enhanced parental leave, and fully funded certifications, we are committed to your professional growth and well-being as we scale our operations across Europe.

Critical Cloud Limited

Contact Details:

Critical Cloud Limited Recruitment Team

We think you need these skills to ace Graduate Engineer: AI Tooling and Site Reliability in Cardiff

Python
API Integration
Cloud Fundamentals (AWS, Azure)
LLM APIs (Claude, OpenAI)
Infrastructure as Code (Terraform)
Datadog
Monitoring and Incident Response