Permanent full time opportunity based in London.
Responsibilities
- Ability to work on multiple tasks in parallel
- Problem solver
- Excellent communicator
- Desire to improve things
Skills
- Kubernetes
- Kubernetes and application troubleshooting
- Application deployment GitOps / ArgoCD
- K8s and application logging (Loki / fluent bit)
- Service Mesh (Linkerd preferred)
- Ingress Config / Troubleshooting (AWS LB Controller / Nginx)
- Autoscaling configuration (Karpenter)
- Certificate management (cert-manager)
- AWS services
- EKS
- RDS, DMS, RDS Proxy
- AWS Backup
- API Gateway
- RabbitMQ
- AWS Transfer Family (SFTP / SFTP Connector)
- AWS NGFW, TGW, PrivateLink
- AppStream
- Lambda β Python
- IAM
- Kinesis
- DynamoDB
- Terragrunt / Terraform
- Troubleshooting defects
- GitOps
- Helm / ArgoCD
- Observability Tooling
- Grafana, Prometheus, Loki, Cloudwatch configuration/dashboard creation
- CI/CD
- Git / Code Deploy / Code Pipeline
What U will do
- Platform Operations:
- Managing and optimising our infrastructure to ensure high availability and system reliability.
- Deliver 24/7 support via on call rotation for after hour issues
- Infrastructure Automation Expertise:
- Experience with the AWS cloud platform including designing, deploying, and maintaining scalable infrastructure.
U will be someone with
- Strong knowledge of container orchestration tools like Kubernetes and Docker.
- Familiarity with deploying infrastructure as Code (IaC) with Terraform and CloudFormation.
- Chaos Engineering Proficiency:
- Understanding of implementing resilience testing strategies
- Designing and implementing chaos engineering tools like AWS Fault Injection, Gremlin, Chaos Monkey, or LitmusChaos to design and execute fault injection experiments.
- Knowledge of modern chaos engineering trends, such as adaptive resilience testing or AI-driven fault detection.
- Monitoring and Observability:
- Experience with monitoring and observability tools (e.g., Prometheus, ADOT, Grafana, Datadog, New Relic, Elastic Stack).
- Strong understanding of instrumenting infrastructure with metrics, logging, and tracing.
- Automation and Scripting:
- Proficiency in scripting and automation languages (e.g., Python, Go, Shell, Ruby, or Java).
- Demonstrated ability to automate infrastructure and operational processes.
- Incident Management and Root Cause Analysis:
- Participating in incident response processes, including triage, mitigation, and communication.
- Familiarity with incident management tools like PagerDuty or Opsgenie.
- Responding to production incidents, troubleshooting issues across the full stack, and ensuring minimal downtime by driving root cause analysis and applying long-term fixes.
- Conducting blameless post-mortems to identify root causes and derive actionable insights, ensuring continuous improvement.
- Developing playbooks for common incidents, reducing Mean Time to Resolution (MTTR).
- Resilience and Scalability Design:
- Understanding of system design principles, scalability, and high-availability architectures.
- Practical experience with load testing and performance benchmarking tools (e.g., JMeter, Locust, k6).
- Designing and testing disaster recovery (DR) strategies to ensure minimal downtime and data integrity during failures.
Benefits
- 2 days flexible/hybrid working arrangement
- Instant savings and discounts on major retailers across the country
- Private Health Insurance including Dental and Optical Cover
- Non-contributory Pension Scheme
- Salary Sacrifice Schemes β Car, Cycle to Work and Additional Pension Contributions
- Additional GBST & U day off every year
- Employee Assistance Program (EAP)
- LinkedIn Learning