Cloud Site Reliability Engineer

Apply Now

Job Board

Companies

GBST

Cloud Site Reliability Engineer

Full-Time No working from home possible

Apply Now

We are looking for a Cloud Site Reliability Engineer to join a global, diverse team working with cross-functional stakeholders. This is a permanent full-time opportunity based in London.

Candidate Profile

Ability to work on multiple tasks in parallel
Problem solver
Excellent communicator
Desire to improve things

Technical Skills

Kubernetes

Kubernetes and application troubleshooting
Application deployment (GitOps / ArgoCD)
K8s and application logging (Loki / fluent bit)
Service Mesh (Linkerd preferred)
Ingress Config / Troubleshooting (AWS LB Controller / Nginx)
Autoscaling configuration (Karpenter)
Certificate management (cert-manager)

AWS Services

EKS
RDS, DMS, RDS Proxy
AWS Backup
API Gateway
RabbitMQ
AWS Transfer Family (SFTP / SFTP Connector)
AWS NGFW, TGW, PrivateLink
AppStream
Lambda – Python
IAM
Kinesis
DynamoDB

Infrastructure Automation

Troubleshooting defects (Terragrunt / Terraform)
Helm / ArgoCD

Observability Tooling

Grafana, Prometheus, Loki, Cloudwatch configuration/dashboard creation

CI/CD

Git / Code Deploy / Code Pipeline

Platform Operations

Managing and optimising our infrastructure to ensure high availability and system reliability
Deliver 24/7 support via on-call rotation for after-hour issues

Infrastructure Automation Expertise

Experience with the AWS cloud platform including designing, deploying, and maintaining scalable infrastructure.

Additional Qualifications

Strong knowledge of container orchestration tools like Kubernetes and Docker.
Familiarity with deploying infrastructure as Code (IaC) with Terraform and CloudFormation.
Chaos Engineering proficiency; knowledge of resilience testing strategies, AWS Fault Injection, Gremlin, Chaos Monkey, LitmusChaos.
Monitoring and Observability with Prometheus, ADOT, Grafana, Datadog, New Relic, Elastic Stack.
Automation and Scripting: proficiency in Python, Go, Shell, Ruby, Java.
Incident Management and Root Cause Analysis: participate in incident response, triage, mitigation; tools like PagerDuty or Opsgenie.
Resilience and Scalability Design: system design principles, high-availability architectures, load testing (JMeter, Locust, k6), disaster recovery strategies.

Benefits

2 days flexible/hybrid working arrangement
Instant savings and discounts on major retailers across the country
Private Health Insurance including Dental and Optical Cover
Non-contributory Pension Scheme
Salary Sacrifice Schemes – Car, Cycle to Work and Additional Pension Contributions
Additional GBST & U day off every year
Employee Assistance Program (EAP)
LinkedIn Learning

#J-18808-Ljbffr

Contact Details:

GBST Recruitment Team

View GBST profile

Cloud Site Reliability Engineer

GBST

Apply Now

Cloud Site Reliability Engineer

Company

Product

Help