We are looking for a Cloud Site Reliability Engineer to join a global, diverse team working with cross-functional stakeholders. This is a permanent full-time opportunity based in London.
Candidate Profile
- Ability to work on multiple tasks in parallel
- Problem solver
- Excellent communicator
- Desire to improve things
Technical Skills
Kubernetes
- Kubernetes and application troubleshooting
- Application deployment (GitOps / ArgoCD)
- K8s and application logging (Loki / fluent bit)
- Service Mesh (Linkerd preferred)
- Ingress Config / Troubleshooting (AWS LB Controller / Nginx)
- Autoscaling configuration (Karpenter)
- Certificate management (cert-manager)
AWS Services
- EKS
- RDS, DMS, RDS Proxy
- AWS Backup
- API Gateway
- RabbitMQ
- AWS Transfer Family (SFTP / SFTP Connector)
- AWS NGFW, TGW, PrivateLink
- AppStream
- Lambda β Python
- IAM
- Kinesis
- DynamoDB
Infrastructure Automation
- Troubleshooting defects (Terragrunt / Terraform)
- Helm / ArgoCD
Observability Tooling
- Grafana, Prometheus, Loki, Cloudwatch configuration/dashboard creation
CI/CD
- Git / Code Deploy / Code Pipeline
Platform Operations
- Managing and optimising our infrastructure to ensure high availability and system reliability
- Deliver 24/7 support via on-call rotation for after-hour issues
Infrastructure Automation Expertise
- Experience with the AWS cloud platform including designing, deploying, and maintaining scalable infrastructure.
Additional Qualifications
- Strong knowledge of container orchestration tools like Kubernetes and Docker.
- Familiarity with deploying infrastructure as Code (IaC) with Terraform and CloudFormation.
- Chaos Engineering proficiency; knowledge of resilience testing strategies, AWS Fault Injection, Gremlin, Chaos Monkey, LitmusChaos.
- Monitoring and Observability with Prometheus, ADOT, Grafana, Datadog, New Relic, Elastic Stack.
- Automation and Scripting: proficiency in Python, Go, Shell, Ruby, Java.
- Incident Management and Root Cause Analysis: participate in incident response, triage, mitigation; tools like PagerDuty or Opsgenie.
- Resilience and Scalability Design: system design principles, high-availability architectures, load testing (JMeter, Locust, k6), disaster recovery strategies.
Benefits
- 2 days flexible/hybrid working arrangement
- Instant savings and discounts on major retailers across the country
- Private Health Insurance including Dental and Optical Cover
- Non-contributory Pension Scheme
- Salary Sacrifice Schemes β Car, Cycle to Work and Additional Pension Contributions
- Additional GBST & U day off every year
- Employee Assistance Program (EAP)
- LinkedIn Learning