Infrastructure Tooling & Observability Engineer( UK) in London

Infrastructure Tooling & Observability Engineer( UK) in London

London Full-Time 70000 - 90000 £ / year (est.) No working from home possible
Radiant

At a Glance

  • Tasks: Design and build internal tooling for large-scale infrastructure operations.
  • Company: Fast-growing GPU-as-a-Service provider with a focus on AI and HPC workloads.
  • Benefits: Competitive salary, remote work options, and opportunities for professional growth.
  • Other info: Collaborative culture with high ownership and exposure to large-scale distributed systems.
  • Why this job: Make a real impact by improving observability and automation in cutting-edge tech environments.
  • Qualifications: Experience in infrastructure engineering or DevOps, strong programming skills, and familiarity with observability systems.

The predicted salary is between 70000 - 90000 £ per year.

About Us

We’re a fast-growing GPU-as-a-Service provider, delivering scalable, high-performance compute infrastructure purpose-built for AI and HPC workloads. Operating across global data centres, we run mission-critical environments where uptime, throughput, and ultra-low latency are non-negotiable.

Role Overview

We are seeking an Infrastructure Tooling & Observability Engineer to act as a key engineering force within our global Infrastructure Operations organisation. Working closely with our SRE teams, you will translate high-level reliability objectives into scalable, production-ready systems that directly improve the resilience, efficiency, and performance of our global infrastructure.

This role goes beyond traditional monitoring. You will help design and build the internal control plane that enables operations at scale across a rapidly growing GPU fleet. Your work will focus on transforming complex, high-volume telemetry—spanning logs, metrics, and events across HPC, networking, and platform layers—into actionable insight that drives operational excellence and proactive reliability.

A core part of your responsibility will be developing intelligent observability and automation systems, including advanced alerting strategies, anomaly detection, and AI-driven tooling that reduces L1/L2 escalations and removes operational toil. You will also contribute to Continual Service Improvement (CSI) initiatives by building frameworks for reliability measurement, automated remediation, and system health evaluation.

In addition, you will play a central role in turning SRE reliability initiatives into scalable engineering solutions. This includes designing and delivering capabilities such as inventory management systems, performance testing frameworks, and automated performance result collection. You will also help eliminate manual workflows involved in onboarding new regions, facilities, and clusters, embedding automation and standardisation into every stage of infrastructure deployment.

As the organisation scales, you will act as a critical interface between operations and engineering teams. You will evaluate and mature internally built tooling—from capacity planning systems to autonomous remediation pipelines—and help integrate these capabilities into core infrastructure platforms to ensure consistent, high-performance, and highly reliable global operations.

What’s In it for you?

Join a team building the internal platforms that enable large-scale infrastructure to operate reliably, efficiently, and at speed. As an Infrastructure Tooling & Observability Engineer, you will design and develop the systems that power visibility, automation, and operational intelligence across complex distributed environments.

This role goes beyond traditional monitoring. You will build the internal control plane that transforms high-volume telemetry—logs, metrics, and events—into actionable insight for engineering and operations teams. Your work will improve observability across infrastructure systems, strengthen signal quality, and help teams understand and respond to system behaviour in real time.

Working closely with SRE and infrastructure engineering teams, you will translate reliability goals into scalable, production-grade tooling. This includes frameworks for observability, alerting, anomaly detection, capacity planning, and service health tracking.

A key focus of the role is automation. You will help eliminate manual processes across infrastructure operations, including environment provisioning, cluster onboarding, inventory management, and recurring operational workflows. You will also contribute to performance engineering initiatives, building tooling for testing, benchmarking, and automated results collection at scale.

You will play a central role in turning SRE reliability initiatives into reusable engineering solutions, including automated remediation systems and tooling that reduces operational toil while improving system resilience.

You can also expect:

  • Exposure to large-scale distributed infrastructure systems
  • Opportunities to shape foundational internal platforms
  • A collaborative, engineering-led culture with strong ownership
  • High-impact work spanning observability, automation, and reliability
  • Close partnership with SRE and infrastructure engineering teams
  • A fast-moving environment where tooling directly improves operational performance

Key Responsibilities

  • Design, build, and evolve internal tooling and observability platforms that support large-scale infrastructure operations across distributed environments.
  • Develop systems that turn high-volume telemetry (logs, metrics, events) into actionable insight, improving visibility, alerting quality, and operational decision-making.
  • Translate SRE reliability requirements into scalable, production-ready software solutions, including automation for incident detection, prevention, and remediation.
  • Drive automation across infrastructure operations, reducing manual effort in areas such as environment provisioning, cluster onboarding, inventory management, and lifecycle workflows.
  • Build tooling for capacity management, performance testing, benchmarking, and automated collection and analysis of results.
  • Contribute to Continual Service Improvement (CSI) initiatives by identifying operational inefficiencies and delivering durable engineering solutions.
  • Work closely with SRE and infrastructure engineering teams to embed observability and reliability into core platform workflows.
  • Interface with Platform Engineering teams to ensure tooling aligns with broader orchestration and infrastructure strategy.
  • Integrate and extend existing systems written in Ruby/Rails and Go, contributing to a consistent and maintainable engineering ecosystem.
  • Develop and maintain automation workflows using Ansible and AWX.
  • Support CI/CD-driven operational tooling, including GitHub Actions and self-hosted runners.

Essential Skills & Experience

  • Degree in Computer Science/Software Engineering, or equivalent experience
  • 6–8 years of experience in infrastructure engineering, DevOps, SRE, and/or software engineering roles, with a strong focus on operational systems.
  • Proven experience in at least one recent DevOps or software engineering role, building or maintaining production infrastructure tooling or platform systems.
  • Experience working in large-scale or distributed infrastructure environments (hyperscale, enterprise, or similarly complex systems).
  • Strong programming ability in at least one of: Ruby (Rails), Go, or similar systems languages, with willingness and ability to work across multiple languages and codebases.
  • Hands-on experience with infrastructure automation tools such as Ansible and orchestration platforms such as AWX.
  • Strong experience with observability systems, including the Grafana stack (Prometheus, Loki, Mimir, and Grafana Alloy).
  • Familiarity with low-level telemetry and infrastructure protocols such as SNMP and syslog.
  • Experience working with Kubernetes or similar orchestration platforms in production environments.
  • Understanding of API design and integration patterns, particularly REST-based services and service-to-service communication.
  • Experience building and maintaining CI/CD pipelines, including GitHub Actions and self-hosted runners.
  • Strong understanding of operational reliability concepts, including monitoring, alerting, capacity planning, and incident response.
  • Comfortable working closely with SRE, Platform Engineering, and infrastructure teams to translate operational needs into maintainable software systems.

Preferred Qualifications

  • Kubernetes Certified Administrator
  • Cloud-native observability training courses, attendance at industry conferences in this field
  • CompTIA+ Security Qualifications
  • LPI/LPIC certification

Infrastructure Tooling & Observability Engineer( UK) in London employer: Radiant

As a fast-growing GPU-as-a-Service provider, we offer an exceptional work environment that fosters innovation and collaboration. Our team is dedicated to building cutting-edge infrastructure solutions, providing employees with opportunities for professional growth and the chance to make a significant impact in the AI and HPC sectors. With a strong focus on automation and observability, we empower our engineers to take ownership of their projects in a dynamic and supportive culture, all while working in a location that is at the forefront of technological advancement.

Radiant

Contact Details:

Radiant Recruitment Team

StudySmarter Expert Advice🤫

We think this is how you could land Infrastructure Tooling & Observability Engineer( UK) in London

Tip Number 1

Network, network, network! Get out there and connect with folks in the industry. Attend meetups, webinars, or conferences related to infrastructure and observability. You never know who might have a lead on your dream job!

Tip Number 2

Show off your skills! Create a portfolio or GitHub repository showcasing your projects, especially those involving automation and observability tools. This gives potential employers a taste of what you can do and sets you apart from the crowd.

Tip Number 3

Don’t just apply blindly! Tailor your approach for each role. Research the company’s tech stack and mention how your experience aligns with their needs. This shows you’re genuinely interested and not just sending out cookie-cutter applications.

Tip Number 4

Leverage our website to apply! We’ve got a streamlined application process that makes it easy for you to showcase your skills and experience. Plus, it helps us get to know you better right from the start!

We think you need these skills to ace Infrastructure Tooling & Observability Engineer( UK) in London

Infrastructure Engineering
DevOps
Site Reliability Engineering (SRE)
Software Engineering
Observability Systems
Telemetry Analysis
Automation Tools (Ansible, AWX)

Some tips for your application 🫡

Tailor Your CV:Make sure your CV reflects the skills and experiences that align with the role of Infrastructure Tooling & Observability Engineer. Highlight your experience with observability systems, automation tools, and any relevant programming languages like Ruby or Go.

Craft a Compelling Cover Letter:Use your cover letter to tell us why you're passionate about infrastructure engineering and how your background makes you a great fit for our team. Be sure to mention specific projects or achievements that demonstrate your expertise in building scalable systems.

Showcase Your Problem-Solving Skills:In your application, include examples of how you've tackled complex challenges in previous roles. We love seeing candidates who can think critically and come up with innovative solutions, especially in high-pressure environments.

Apply Through Our Website:We encourage you to apply directly through our website. This way, your application will be reviewed by our team promptly, and you'll have the best chance of making a great first impression!

How to prepare for a job interview at Radiant

Know Your Tech Stack

Make sure you’re well-versed in the technologies mentioned in the job description, especially Ruby, Go, and Ansible. Brush up on your knowledge of observability systems like Grafana and be ready to discuss how you've used these tools in past projects.

Showcase Your Problem-Solving Skills

Prepare examples of how you've tackled complex infrastructure challenges. Think about specific instances where you improved system reliability or automated processes, and be ready to explain your thought process and the impact of your solutions.

Understand the Company’s Mission

Familiarise yourself with the company’s focus on GPU-as-a-Service and high-performance compute infrastructure. Be prepared to discuss how your skills can contribute to their goals, particularly in enhancing operational excellence and reliability.

Ask Insightful Questions

Prepare thoughtful questions that demonstrate your interest in the role and the company. Inquire about their current challenges in observability and automation, or ask how they measure success in their engineering initiatives. This shows you’re engaged and thinking critically about the position.