Observability Platform Engineer New UK

Observability Platform Engineer New UK

Full-Time 60000 - 80000 £ / year (est.) No working from home possible
Nscale Ltd.

At a Glance

  • Tasks: Design and operate observability platforms for our global AI datacentre infrastructure.
  • Company: Join Nscale, the GPU cloud powering AI innovation with a culture of ownership and accountability.
  • Benefits: Competitive salary, inclusive workplace, and opportunities for professional growth.
  • Other info: Diverse and inclusive environment encouraging applications from all backgrounds.
  • Why this job: Make a real impact in AI development while working with cutting-edge technologies.
  • Qualifications: Experience with observability tools and cloud-native infrastructure is essential.

The predicted salary is between 60000 - 80000 £ per year.

Nscale is the GPU cloud engineered for AI. We provide cost‑effective, high-performance infrastructure for AI start‑ups and large enterprise customers. Nscale enables AI‑focused companies to achieve superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business outcomes, including cost management, rapid innovation, and environmental responsibility. We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you’ll build trust through openness and transparency, where everyone is inspired to do their best work. If you join our team, you’ll be contributing to building the technology that powers the future.

About The Role

Nscale is seeking an Observability Platform Engineer to design, deploy, and operate the monitoring, logging, and tracing systems that power observability across our global AI datacentre infrastructure. You will focus on scalability, automation, and integration of observability platforms, ensuring that metrics, logs, and traces are accurate, accessible, and actionable. You’ll work closely with SRE, Infrastructure, and Engineering teams to ensure our GPU‑powered cloud is fully instrumented for reliability and performance.

What You’ll Be Doing

  • Design, build, and maintain observability platforms (monitoring, logging, tracing, alerting) at global scale.
  • Deploy and manage tools such as Prometheus, Grafana, Datadog, ELK/Opensearch, OpenTelemetry, and Jaeger.
  • Automate observability infrastructure using Infrastructure‑as‑Code and CI/CD pipelines.
  • Partner with engineering and SRE teams to instrument applications and systems for telemetry.
  • Develop dashboards, alerts, and analytics to provide real‑time visibility into infrastructure health.
  • Ensure observability data is accurate, reliable, and retained per compliance requirements.
  • Troubleshoot observability platform issues, ensuring high availability and performance.
  • Drive adoption of best practices for monitoring, logging, and tracing across the company.
  • Contribute to continuous improvement of incident detection, response, and resolution.
  • Document observability standards, tools, and processes.

About You (Skills / Qualifications)

  • Strong experience in designing and operating observability platforms at scale.
  • Hands‑on expertise with monitoring, logging, and tracing tools (Prometheus, Grafana, Datadog, ELK/Opensearch, Splunk, OpenTelemetry, Jaeger).
  • Experience with cloud‑native infrastructure (Kubernetes, containers, service meshes).
  • Proficiency in scripting/automation (e.g., Python, Go, Bash).
  • Knowledge of Infrastructure‑as‑Code (Terraform, Ansible, Pulumi) and CI/CD practices.
  • Strong understanding of distributed systems reliability and incident management.
  • Excellent problem‑solving skills with the ability to diagnose performance issues across systems.
  • Good collaboration skills to work with engineering, operations, and product teams.

Nice to have:

  • Experience with AI/ML workload observability.
  • Familiarity with hyperscale datacentre environments.
  • Knowledge of AIOps and advanced telemetry analytics.
  • Exposure to sustainability monitoring (e.g., power usage effectiveness, efficiency metrics).

At Nscale, we are committed to fostering an inclusive, diverse, and equitable workplace. We believe that a variety of perspectives enriches our work environment, and we encourage applications from candidates of all backgrounds, experiences, and abilities. We strongly encourage applications from people of colour, the LGBTQ+ community, people with disabilities, neurodivergent people, parents, carers, and people from lower socio‑economic backgrounds.

Observability Platform Engineer New UK employer: Nscale Ltd.

Nscale is an exceptional employer that champions a culture of innovation, ownership, and accountability, making it an ideal workplace for those passionate about AI technology. With a commitment to employee growth and a collaborative environment, Nscale offers opportunities to work on cutting-edge observability platforms while contributing to meaningful projects that shape the future of AI. Located in the UK, employees benefit from a diverse and inclusive atmosphere that values every perspective, ensuring a rewarding and fulfilling career path.

Nscale Ltd.

Contact Details:

Nscale Ltd. Recruitment Team

We think you need these skills to ace Observability Platform Engineer New UK

Observability Platforms Design
Monitoring Tools (Prometheus, Grafana, Datadog, ELK/Opensearch, Splunk, OpenTelemetry, Jaeger)
Cloud-Native Infrastructure (Kubernetes, containers, service meshes)
Scripting/Automation (Python, Go, Bash)
Infrastructure-as-Code (Terraform, Ansible, Pulumi)
CI/CD Practices
Distributed Systems Reliability