SRE Engineer in London

London Full-Time No working from home possible

Apply Now

Permanent full time opportunity based in London.

Responsibilities

Ability to work on multiple tasks in parallel
Problem solver
Excellent communicator
Desire to improve things

Skills

Kubernetes
- Kubernetes and application troubleshooting
- Application deployment GitOps / ArgoCD
- K8s and application logging (Loki / fluent bit)
- Service Mesh (Linkerd preferred)
- Ingress Config / Troubleshooting (AWS LB Controller / Nginx)
- Autoscaling configuration (Karpenter)
- Certificate management (cert-manager)
AWS services
- EKS
- RDS, DMS, RDS Proxy
- AWS Backup
- API Gateway
- RabbitMQ
- AWS Transfer Family (SFTP / SFTP Connector)
- AWS NGFW, TGW, PrivateLink
- AppStream
- Lambda – Python
- IAM
- Kinesis
- DynamoDB
Terragrunt / Terraform
- Troubleshooting defects
GitOps
- Helm / ArgoCD
Observability Tooling
- Grafana, Prometheus, Loki, Cloudwatch configuration/dashboard creation
CI/CD
- Git / Code Deploy / Code Pipeline

What U will do

Platform Operations:
- Managing and optimising our infrastructure to ensure high availability and system reliability.
- Deliver 24/7 support via on call rotation for after hour issues
Infrastructure Automation Expertise:
- Experience with the AWS cloud platform including designing, deploying, and maintaining scalable infrastructure.

U will be someone with

Strong knowledge of container orchestration tools like Kubernetes and Docker.
Familiarity with deploying infrastructure as Code (IaC) with Terraform and CloudFormation.
Chaos Engineering Proficiency:
- Understanding of implementing resilience testing strategies
- Designing and implementing chaos engineering tools like AWS Fault Injection, Gremlin, Chaos Monkey, or LitmusChaos to design and execute fault injection experiments.
- Knowledge of modern chaos engineering trends, such as adaptive resilience testing or AI-driven fault detection.
Monitoring and Observability:
- Experience with monitoring and observability tools (e.g., Prometheus, ADOT, Grafana, Datadog, New Relic, Elastic Stack).
- Strong understanding of instrumenting infrastructure with metrics, logging, and tracing.
Automation and Scripting:
- Proficiency in scripting and automation languages (e.g., Python, Go, Shell, Ruby, or Java).
- Demonstrated ability to automate infrastructure and operational processes.
Incident Management and Root Cause Analysis:
- Participating in incident response processes, including triage, mitigation, and communication.
- Familiarity with incident management tools like PagerDuty or Opsgenie.
- Responding to production incidents, troubleshooting issues across the full stack, and ensuring minimal downtime by driving root cause analysis and applying long-term fixes.
- Conducting blameless post-mortems to identify root causes and derive actionable insights, ensuring continuous improvement.
- Developing playbooks for common incidents, reducing Mean Time to Resolution (MTTR).
Resilience and Scalability Design:
- Understanding of system design principles, scalability, and high-availability architectures.
- Practical experience with load testing and performance benchmarking tools (e.g., JMeter, Locust, k6).
- Designing and testing disaster recovery (DR) strategies to ensure minimal downtime and data integrity during failures.

Benefits

2 days flexible/hybrid working arrangement
Instant savings and discounts on major retailers across the country
Private Health Insurance including Dental and Optical Cover
Non-contributory Pension Scheme
Salary Sacrifice Schemes – Car, Cycle to Work and Additional Pension Contributions
Additional GBST & U day off every year
Employee Assistance Program (EAP)
LinkedIn Learning

#J-18808-Ljbffr

Contact Details:

GBST Holdings Ltd Recruitment Team

View GBST Holdings Ltd profile

SRE Engineer in London

GBST Holdings Ltd

Location: London

Apply Now

SRE Engineer in London

Company

Product

Help