Key Responsibilities
- Lead the design and implementation of SRE frameworks for Azure and Google Cloud Platform environments, ensuring high availability and performance.
- Define, monitor, and improve Service Level Objectives (SLOs) and Service Level Indicators (SLIs) across global services.
- Collaborate with cross‑functional teams to embed reliability best practices into development, deployment, and incident response processes.
- Drive automation and tooling initiatives, leveraging IaC, CI/CD, and AI‑driven observability to reduce toil and accelerate incident resolution.
- Provide technical mentorship and guidance to SRE and DevOps teams, fostering a culture of continuous improvement and operational excellence.
Requirements
- 10+ years of experience in site reliability engineering or cloud operations, with deep expertise in Azure and GCP.
- Proven track record of designing and managing SLOs, incident response, and post‑mortem analysis at scale.
- Strong knowledge of cloud infrastructure, networking, security, and automation tools (Terraform, Ansible, Kubernetes).
- Experience with AI/ML monitoring and observability platforms to enhance reliability.
- Excellent communication skills and ability to work in a global, follow‑the‑sun environment.
Contact Details:
Gravity Engineering Services Pvt Ltd. Recruitment Team