Site Reliability Engineer (Observability)
London- Hybrid/ 3 Days
Contract Inside IR35- 6 Months initially
We’re looking for a Site Reliability Engineer (SRE) to join our client to build and maintain observability systems and to ensure their core services remain reliable, scalable, and high-performing.
Responsibilities:
- Deploy and manage observability tools using a Prometheus like metrics store and Grafana Enterprise.
- Automate monitoring, alerting, and incident response.
- Build Grafana dashboards for system insights.
- Apply Infrastructure as Code (IaC) principles.
- Develop tooling in Golang (preferred) or Python.
- Advocate for SRE principles like SLOs, SLIs, and error budgets.
- Integrate monitoring with incident management workflows.
Requirements:
- SRE principles and reliability engineering expertise.
- Solid familiarity with Linux
- Strong experience in deploying and building containers using Podman or Docker
- Golang (preferred) or Python for automation and API integration.
- Experience with Grafana, VictoriaMetrics, and PromQL
- Experience with centralized logs solutions deployment and management
- Strong Infrastructure as Code (IaC) knowledge.
Nice to Have:
- OpenTelemetry experience.
- Terraform, Ansible, or CI/CD knowledge.
- Background in datacentre and compute hardware services.
- AWS infrastructure configuration and deployment
- Familiarity with Kubernetes and cloud-native systems.
- Incident response automation expertise.
Contact Detail:
Levy Global Recruiting Team