Senior Observability & Telemetry Engineer - Radian Arc
Senior Observability & Telemetry Engineer - Radian Arc

Senior Observability & Telemetry Engineer - Radian Arc

Full-Time 70000 - 90000 £ / year (est.) No home office possible
Submer

At a Glance

  • Tasks: Design and build observability platforms for cutting-edge GPU cloud infrastructure.
  • Company: Join Radian Arc, part of InferX, a leader in AI cloud solutions.
  • Benefits: Competitive salary, flexible remote work, and a vibrant international team.
  • Why this job: Make a real impact on the future of cloud gaming and AI technology.
  • Qualifications: Experience in observability systems and strong programming skills in Go or Python.
  • Other info: Be part of a fast-growing scale-up with excellent career evolution opportunities.

The predicted salary is between 70000 - 90000 £ per year.

Location & work modality: EMEA (remote)

Start: ASAP

Type of Contract: Permanent, full-time

About Radian Arc

Radian Arc, now part of InferX, Submer's AI cloud and GPU infrastructure platform, provides an infrastructure-as-a-service (IaaS) platform for running cloud gaming, artificial intelligence and machine learning applications inside telecommunication carrier networks. Our teams across the USA, Australia, Central Europe, Malaysia, Singapore and Japan offer telecom operators a GPU-based edge computing platform without the need for capital expenditure, facilitating low latency and improved economics for value-added services and the monetization of 5G investments.

What impact you will have

Mission: Design and build the observability platform that powers visibility, reliability, and performance insights for large-scale GPU cloud infrastructure as well as smaller edge deployments. This role is responsible for designing and implementing key parts of the observability architecture across the platform, enabling engineering, operations, and customers to understand system behavior in real time across distributed AI workloads, GPU clusters, networking fabrics, storage systems, and edge inference environments. You will design and operate low-latency, high-scale telemetry pipelines that collect, process, and analyze metrics, logs, and traces from infrastructure running across core datacenter clusters and smaller edge deployments. The platform you build will support internal operations, automated reliability mechanisms, and customer-facing observability experiences. As a senior engineer, you will lead delivery of major observability initiatives, contribute to the evolution of telemetry standards and SLO implementation, and work with other teams to ensure observability is effectively integrated into the platform architecture from infrastructure to application layers. You will collaborate closely with infrastructure, networking, storage, and platform engineering teams to provide clear visibility into performance bottlenecks, infrastructure degradation, and distributed workload behavior across both hyperscale GPU environments and smaller edge installations. This role contributes directly to improving platform reliability by analyzing production telemetry, identifying systemic issues, and driving improvements in performance, efficiency, and operational stability across the stack.

What you’ll do

  • Observability Platform Architecture
  • Design and implement scalable telemetry pipelines for metrics, logs, and traces across distributed GPU infrastructure.
  • Architect observability systems capable of ingesting high-cardinality telemetry from thousands of nodes and services.
  • Build and operate telemetry storage systems optimized for large-scale time-series and event data.
  • Contribute to observability standards across services, including metrics, tracing instrumentation, logging, and SLO implementation.
  • Infrastructure and Platform Observability
    • Build visibility across compute, storage, and networking layers of the platform.
    • Instrument GPU clusters, inference workloads, and distributed training environments.
    • Detect infrastructure degradation such as GPU throttling, Network congestion, Storage latency, Hardware degradation.
    • Implement telemetry pipelines for GPU, CPU, network, and storage performance metrics.
  • Customer-Facing Observability
    • Build dashboards and monitoring tools that expose system health and performance to both internal teams and customers.
    • Provide insights into workload performance including GPU utilization, Storage throughput, Network latency, Distributed inference performance.
    • Develop performance analysis tools that help customers understand system bottlenecks.
  • Network and Infrastructure Telemetry
    • Develop and maintain network observability platforms.
    • Build telemetry collectors and exporters using Python or Go.
    • Ingest telemetry from infrastructure components including NVIDIA Cumulus Linux, VyOS routers, Citrix NetScaler / WAF.
    • Design telemetry ingestion pipelines using protocols such as gNMI, SNMP, Streaming telemetry.
  • Reliability Engineering
    • Design advanced alerting and anomaly detection systems.
    • Contribute to platform SLOs, SLIs, and reliability metrics.
    • Build automated detection of infrastructure anomalies.
    • Integrate observability signals with operational workflows and incident management systems.
    • Participate in on-call rotations supporting platform observability and telemetry infrastructure.
  • Cross-Team Collaboration
    • Partner with platform, networking, storage, and compute teams to instrument services.
    • Work closely with operations teams to improve monitoring and incident response.
    • Provide guidance and mentorship to engineers on observability best practices.
    • Promote good observability practices across teams and help engineers adopt effective instrumentation and monitoring patterns.

    What you’ll need

    • Required Experience
    • Proven experience operating large distributed infrastructure platforms.
    • Strong background in observability systems and telemetry pipelines.
    • Experience building metrics, logging, tracing, alerting, and dashboards at production scale.
    • Strong programming skills in Go, Python, or Rust.
    • Experience with large-scale time-series data platforms.
    • Experience with large-scale GPU cloud platforms, HPC environments, or AI infrastructure.
    • Experience monitoring AI workloads such as training or inference clusters.
  • Infrastructure Knowledge
    • Deep understanding of distributed systems observability.
    • Familiarity with cloud-native infrastructure such as Kubernetes, automation, and CI/CD.
    • Experience operating observability systems for high-performance or large-scale environments.
  • Networking and Infrastructure Telemetry
    • Experience monitoring complex networking environments.
    • Familiarity with telemetry protocols such as gNMI, SNMP, and streaming telemetry.
    • Experience integrating network and system telemetry into centralized monitoring platforms.
  • Analytical Skills
    • Strong data analysis capabilities.
    • Ability to interpret complex telemetry signals and translate them into actionable insights.
    • Ability to diagnose performance issues across distributed systems.
  • Technical Stack
    • Observability Framework: Prometheus, OpenTelemetry, Grafana, Distributed logging systems, High-scale telemetry databases such as ClickHouse or similar.
    • Hardware and Infrastructure Telemetry: Redfish / BMC telemetry, IPMI, Linux system metrics, Hardware health monitoring and node lifecycle telemetry.
    • NVIDIA GPU Telemetry: NVIDIA DCGM, DCGM Exporter, NVML, NVIDIA GPU Operator telemetry stack, NVSwitch / NVLink telemetry.
    • AI Workload Telemetry: Distributed training telemetry, Inference latency and throughput metrics, NCCL communication health, GPU synchronization latency, KV-cache access latency for inference workloads, Dataset loading and storage I/O performance.
    • Networking Telemetry: NVIDIA NetQ, gNMI streaming telemetry, SNMP, Network flow telemetry, RDMA / RoCE performance monitoring.

    What we offer

    Attractive compensation package reflecting your expertise and experience. A great work environment characterised by friendliness, international diversity, flexibility, and a hybrid-friendly approach. You'll be part of a fast-growing scale-up with a mission to make a positive impact, offering an exciting career evolution.

    Our job titles may span more than one job level. The actual base pay is dependent on a number of factors, such as transferable skills, work experience, business needs and market demands.

    Our Inclusive Responsibility

    Radian Arc is committed to creating a diverse and inclusive environment and is proud to be an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, disability, age, veteran status, or any other protected category under applicable law.

    Senior Observability & Telemetry Engineer - Radian Arc employer: Submer

    Radian Arc, part of InferX, offers an exceptional work environment for Senior Observability & Telemetry Engineers, characterised by a commitment to innovation and inclusivity. With a flexible remote work modality across the EMEA region, employees benefit from a diverse culture, competitive compensation, and opportunities for professional growth within a fast-paced scale-up focused on cutting-edge AI and cloud technologies.
    Submer

    Contact Detail:

    Submer Recruiting Team

    StudySmarter Expert Advice 🤫

    We think this is how you could land Senior Observability & Telemetry Engineer - Radian Arc

    ✨Tip Number 1

    Network like a pro! Reach out to folks in the industry, especially those at Radian Arc or similar companies. Use LinkedIn to connect and engage with them; you never know who might have the inside scoop on job openings.

    ✨Tip Number 2

    Prepare for interviews by diving deep into observability and telemetry topics. Brush up on your knowledge of GPU cloud platforms and be ready to discuss how you can contribute to building scalable telemetry pipelines. Show us your passion!

    ✨Tip Number 3

    Don’t just wait for job postings—create opportunities! If you see a gap in the market or a problem that needs solving, pitch your ideas directly to us through our website. We love innovative thinkers!

    ✨Tip Number 4

    Follow up after interviews! A quick thank-you email can go a long way. Mention something specific from your conversation to remind us why you’re the perfect fit for the Senior Observability & Telemetry Engineer role.

    We think you need these skills to ace Senior Observability & Telemetry Engineer - Radian Arc

    Observability Systems
    Telemetry Pipelines
    Metrics Collection
    Logging
    Tracing
    Alerting
    Dashboards
    Programming in Go
    Python
    Rust
    Distributed Systems Observability
    Cloud-Native Infrastructure
    Kubernetes
    Data Analysis
    NVIDIA GPU Telemetry

    Some tips for your application 🫡

    Tailor Your Application: Make sure to customise your CV and cover letter for the Senior Observability & Telemetry Engineer role. Highlight your experience with observability systems and telemetry pipelines, as well as any relevant programming skills in Go or Python. We want to see how your background aligns with our mission!

    Showcase Your Projects: If you've worked on any projects related to large-scale GPU cloud platforms or AI infrastructure, don’t hold back! Share specific examples that demonstrate your ability to design and implement scalable telemetry pipelines. This will help us understand your hands-on experience.

    Be Clear and Concise: When writing your application, keep it clear and to the point. Use bullet points where possible to make your achievements stand out. We appreciate a well-structured application that makes it easy for us to see your qualifications at a glance.

    Apply Through Our Website: We encourage you to apply directly through our website. It’s the best way for us to receive your application and ensures you’re considered for the role. Plus, it shows us you’re keen on joining our team at Radian Arc!

    How to prepare for a job interview at Submer

    ✨Know Your Tech Stack

    Make sure you’re well-versed in the technologies mentioned in the job description, like Prometheus, Grafana, and telemetry protocols. Brush up on your programming skills in Go or Python, as these will likely come up during technical discussions.

    ✨Showcase Your Experience

    Prepare to discuss your previous experience with large distributed infrastructure platforms. Be ready to share specific examples of how you've designed and implemented observability systems or telemetry pipelines, highlighting any challenges you overcame.

    ✨Understand the Business Impact

    Radian Arc is focused on improving platform reliability and performance. Think about how your work can directly impact customer experiences and operational efficiency. Be prepared to discuss how you can contribute to their mission of enhancing visibility and reliability.

    ✨Ask Insightful Questions

    Prepare thoughtful questions that show your interest in the role and the company. Inquire about their current observability challenges or how they measure success in their telemetry initiatives. This not only demonstrates your enthusiasm but also helps you gauge if the company is the right fit for you.

    Senior Observability & Telemetry Engineer - Radian Arc
    Submer

    Land your dream job quicker with Premium

    You’re marked as a top applicant with our partner companies
    Individual CV and cover letter feedback including tailoring to specific job roles
    Be among the first applications for new jobs with our AI application
    1:1 support and career advice from our career coaches
    Go Premium

    Money-back if you don't land a job in 6-months

    >