Site Reliability Engineer in Penarth

Site Reliability Engineer in Penarth

Penarth Full-Time 36000 - 60000 £ / year (est.) No home office possible
E

At a Glance

  • Tasks: Own and evolve enterprise observability and reliability platforms for cloud-native applications.
  • Company: Join a forward-thinking tech company focused on reliability engineering.
  • Benefits: Competitive salary, flexible hours, and opportunities for professional growth.
  • Why this job: Make a real impact by enhancing the reliability of cutting-edge applications.
  • Qualifications: Experience with Kubernetes, Prometheus, and a passion for reliability engineering.
  • Other info: Dynamic team environment with a strong focus on innovation and collaboration.

The predicted salary is between 36000 - 60000 £ per year.

We are looking for a highly skilled Site Reliability Engineer (SRE) to own and evolve our enterprise observability and reliability platforms. This role is responsible for ensuring availability, performance, scalability, and reliability of large-scale, cloud-native applications running on Kubernetes and OpenShift. The SRE will partner closely with application and platform teams to embed reliability engineering, SLO-driven operations, and automation-first practices.

Key Responsibilities

  • Reliability Engineering & SRE Practices: Define, implement, and continuously improve SLIs, SLOs, and error budgets for enterprise applications. Drive reliability-focused decision making using error budgets, MTTD, MTTR, and service health metrics. Proactively identify reliability risks and performance bottlenecks and drive remediation. Lead incident response, post-incident reviews (blameless postmortems), and reliability improvements.
  • Observability Platform Ownership: Own and operate open-source–based observability platforms covering metrics, logging, and distributed tracing. Enhance, optimize, and migrate observability solutions to improve scalability, resilience, and cost efficiency. Maintain and tune Prometheus and other TSDBs, including cardinality management and resource optimization. Operate distributed tracing platforms such as OpenTelemetry, Jaeger, and Zipkin, including tuning sampling strategies and troubleshooting microservices traces.
  • Kubernetes & OpenShift Reliability: Support and enable application teams to migrate workloads to newer OpenShift/Kubernetes versions. Deploy, manage, and troubleshoot stateful and stateless workloads on Kubernetes platforms. Improve platform reliability through automation, self-healing, and standardized deployment patterns. Partner with developers to implement application instrumentation and reliability best practices.
  • Logging, Alerting & Incident Response: Operate enterprise logging platforms such as ELK Stack and Grafana Loki, including Elasticsearch cluster management and index lifecycle management. Design and maintain actionable alerting aligned to SLOs and business impact. Integrate alerting platforms with PagerDuty, Microsoft Teams, and other incident management tools. Reduce alert fatigue by implementing alert hygiene and signal-to-noise optimization.
  • Dashboards & Service Visibility: Deploy and administer visualization tools such as Grafana and Kibana. Create standardized, reusable dashboards for service health, reliability, and capacity planning. Implement and manage RBAC across observability platforms.
  • Infrastructure, Security & Automation: Troubleshoot observability infrastructure issues across Linux VMs and Kubernetes pods. Secure observability and platform endpoints using TLS, reverse proxies, and authentication mechanisms (MFA, LDAPS, OAuth). Build and maintain CI/CD pipelines for observability and reliability tooling. Extend pipelines to support multiple environments and regions with consistency and repeatability.
  • Reliability Culture & Enablement: Champion an SRE and observability-first culture across engineering teams. Coach teams on golden signals, service health modeling, and reliability trade-offs. Enable teams to move from reactive monitoring to proactive reliability engineering.

Required Skills & Experience

  • Core Technical Skills: Strong hands-on experience with: Prometheus, Grafana; Elasticsearch, Kibana (cluster operations, ILM, tuning); OpenTelemetry, Jaeger, Zipkin; Kubernetes & OpenShift; Linux OS troubleshooting; CI/CD pipelines and automation. Solid understanding of SRE principles, including SLIs, SLOs, error budgets, and incident management. Experience supporting production, highly available, distributed systems.

Working Hours: Monday to Friday, 9:00 AM – 6:00 PM. Occasional weekend support may be required for critical deployments or incidents; compensatory off will be provided.

Site Reliability Engineer in Penarth employer: ELLIOTT MOSS CONSULTING PTE. LTD.

Join a forward-thinking company that prioritises innovation and reliability in the tech landscape. As a Site Reliability Engineer, you will thrive in a collaborative environment that fosters continuous learning and professional growth, with access to cutting-edge tools and technologies. Enjoy a supportive work culture that values work-life balance and offers flexible hours, ensuring you can contribute effectively while maintaining personal well-being.
E

Contact Detail:

ELLIOTT MOSS CONSULTING PTE. LTD. Recruiting Team

StudySmarter Expert Advice 🤫

We think this is how you could land Site Reliability Engineer in Penarth

✨Tip Number 1

Network like a pro! Attend meetups, webinars, or tech conferences related to Site Reliability Engineering. It's a great way to meet industry folks and get your name out there. Plus, you never know who might be hiring!

✨Tip Number 2

Show off your skills! Create a portfolio or GitHub repository showcasing your projects, especially those involving Kubernetes, Prometheus, or any of the tools mentioned in the job description. This gives potential employers a taste of what you can do.

✨Tip Number 3

Prepare for interviews by brushing up on SRE principles and incident management. Practice explaining your past experiences with SLIs, SLOs, and error budgets. We want to see how you think and solve problems under pressure!

✨Tip Number 4

Don't forget to apply through our website! It’s the best way to ensure your application gets seen. Plus, we love seeing candidates who are proactive about their job search!

We think you need these skills to ace Site Reliability Engineer in Penarth

Site Reliability Engineering (SRE)
Kubernetes
OpenShift
Prometheus
Grafana
Elasticsearch
Kibana
OpenTelemetry
Jaeger
Zipkin
CI/CD Pipelines
Incident Management
SLIs
SLOs
Error Budgets

Some tips for your application 🫡

Tailor Your CV: Make sure your CV is tailored to the Site Reliability Engineer role. Highlight your experience with Kubernetes, OpenShift, and observability tools like Prometheus and Grafana. We want to see how your skills align with our needs!

Craft a Compelling Cover Letter: Your cover letter is your chance to shine! Share your passion for reliability engineering and how you’ve implemented SLOs and error budgets in past roles. Let us know why you’re excited about joining StudySmarter!

Showcase Your Problem-Solving Skills: In your application, don’t forget to mention specific examples of how you've tackled reliability issues or improved system performance. We love seeing candidates who can think critically and act decisively!

Apply Through Our Website: We encourage you to apply directly through our website. It’s the best way for us to receive your application and ensures you’re considered for the role. Plus, it’s super easy!

How to prepare for a job interview at ELLIOTT MOSS CONSULTING PTE. LTD.

✨Know Your SRE Principles

Make sure you brush up on your understanding of SLIs, SLOs, and error budgets. Be ready to discuss how you've applied these concepts in past roles, as this will show your familiarity with the core principles of Site Reliability Engineering.

✨Demonstrate Your Technical Skills

Prepare to showcase your hands-on experience with tools like Prometheus, Grafana, and Kubernetes. You might be asked to solve a problem or explain how you've used these technologies to improve reliability in previous projects, so have some examples ready.

✨Incident Management Experience

Be prepared to talk about your experience with incident response and post-incident reviews. Highlight any blameless postmortems you've led and how you've driven improvements based on those incidents. This shows your proactive approach to reliability.

✨Cultural Fit and Collaboration

Since the role involves partnering closely with application and platform teams, be ready to discuss how you foster collaboration and a reliability-first culture. Share examples of how you've coached teams on best practices and improved service health together.

Site Reliability Engineer in Penarth
ELLIOTT MOSS CONSULTING PTE. LTD.
Location: Penarth

Land your dream job quicker with Premium

You’re marked as a top applicant with our partner companies
Individual CV and cover letter feedback including tailoring to specific job roles
Be among the first applications for new jobs with our AI application
1:1 support and career advice from our career coaches
Go Premium

Money-back if you don't land a job in 6-months

E
Similar positions in other companies
UK’s top job board for Gen Z
discover-jobs-cta
Discover now
>