Cloud Site Reliability Engineer

Cloud Site Reliability Engineer

Full-Time No working from home possible
G

We are looking for a Cloud Site Reliability Engineer to join a global, diverse team working with cross-functional stakeholders. This is a permanent full-time opportunity based in London.

Candidate Profile

  • Ability to work on multiple tasks in parallel
  • Problem solver
  • Excellent communicator
  • Desire to improve things

Technical Skills

Kubernetes

  • Kubernetes and application troubleshooting
  • Application deployment (GitOps / ArgoCD)
  • K8s and application logging (Loki / fluent bit)
  • Service Mesh (Linkerd preferred)
  • Ingress Config / Troubleshooting (AWS LB Controller / Nginx)
  • Autoscaling configuration (Karpenter)
  • Certificate management (cert-manager)

AWS Services

  • EKS
  • RDS, DMS, RDS Proxy
  • AWS Backup
  • API Gateway
  • RabbitMQ
  • AWS Transfer Family (SFTP / SFTP Connector)
  • AWS NGFW, TGW, PrivateLink
  • AppStream
  • Lambda – Python
  • IAM
  • Kinesis
  • DynamoDB

Infrastructure Automation

  • Troubleshooting defects (Terragrunt / Terraform)
  • Helm / ArgoCD

Observability Tooling

  • Grafana, Prometheus, Loki, Cloudwatch configuration/dashboard creation

CI/CD

  • Git / Code Deploy / Code Pipeline

Platform Operations

  • Managing and optimising our infrastructure to ensure high availability and system reliability
  • Deliver 24/7 support via on-call rotation for after-hour issues

Infrastructure Automation Expertise

  • Experience with the AWS cloud platform including designing, deploying, and maintaining scalable infrastructure.

Additional Qualifications

  • Strong knowledge of container orchestration tools like Kubernetes and Docker.
  • Familiarity with deploying infrastructure as Code (IaC) with Terraform and CloudFormation.
  • Chaos Engineering proficiency; knowledge of resilience testing strategies, AWS Fault Injection, Gremlin, Chaos Monkey, LitmusChaos.
  • Monitoring and Observability with Prometheus, ADOT, Grafana, Datadog, New Relic, Elastic Stack.
  • Automation and Scripting: proficiency in Python, Go, Shell, Ruby, Java.
  • Incident Management and Root Cause Analysis: participate in incident response, triage, mitigation; tools like PagerDuty or Opsgenie.
  • Resilience and Scalability Design: system design principles, high-availability architectures, load testing (JMeter, Locust, k6), disaster recovery strategies.

Benefits

  • 2 days flexible/hybrid working arrangement
  • Instant savings and discounts on major retailers across the country
  • Private Health Insurance including Dental and Optical Cover
  • Non-contributory Pension Scheme
  • Salary Sacrifice Schemes – Car, Cycle to Work and Additional Pension Contributions
  • Additional GBST & U day off every year
  • Employee Assistance Program (EAP)
  • LinkedIn Learning
#J-18808-Ljbffr
G

Contact Details:

GBST Recruitment Team