Job Description
Job Title: Site Reliability Engineer – Manager
Location: Hybrid Remote – London EC2M
Contract (12 months)
Rate: Outside IR35 – £300 to £330 Per Day
About the Role:
We are partnering with one of the top companies in the mobile industry to hire a Site Reliability Engineer (SRE) Manager. In this role, you will collaborate with cross-functional teams to drive the design, development, and delivery of high-performing, scalable, and reliable infrastructure and services. You’ll be responsible for building robust systems, automating operations, and enhancing observability and deployment pipelines for modern cloud-native applications.
Key Responsibilities:
- System Reliability & Performance:
- Maintain and scale critical services and infrastructure. Identify performance bottlenecks and work closely with product engineers to optimize applications.
- Kubernetes Operations:
- Administer, scale, and troubleshoot clusters in GKE, EKS, or other Kubernetes environments.
- Infrastructure as Code (IaC):
- Design and maintain scalable infrastructure using Terraform and automate deployments across public, private, or hybrid clouds (mainly AWS).
- CI/CD Pipeline Enhancement:
- Build and improve robust CI/CD pipelines to support fast and safe deployment cycles.
- Observability & Monitoring:
- Implement code-based instrumentation and telemetry. Ensure systems are observable with tools for logging, metrics, and alerting.
- Automation & Scripting:
- Write tooling and automation scripts in Python, Go, or Rust to reduce toil and manual intervention.
- Storage & Networking:
- Manage and optimise storage services like Amazon S3 or Google Cloud Storage (GCS). Resolve complex networking issues in multi-cloud environments.
Essential Requirements:
- 5+ years of hands-on experience as a Site Reliability Engineer.
- Proven expertise in Kubernetes (GKE/EKS).
- Strong proficiency in Python, Go, or Rust.
- Solid experience with AWS and Infrastructure as Code using Terraform.
- Deep understanding of Linux internals, standard networking protocols, and distributed systems architecture.
- Hands-on experience with automation and performance optimisation.
- Strong knowledge of SRE principles and methodologies.
- Experience with observability tools and telemetry systems.
- Exposure to Google Cloud Platform (GCP).
- Familiarity with hybrid or multi-cloud architecture.
- Experience with service meshes or edge proxies (e.g., Envoy, Istio).
- Working knowledge of container security best practices.
Contact Detail:
TECEZE Recruiting Team