Description
We have a Lead Site Reliability Engineer (SRE) opportunity within our Google Cloud Site Reliability Engineering team.
As a Lead Site Reliability Engineer at JPMorgan Chase within the Infrastructure Platform - Cloud Foundational Services SRE organization, you will join our Google Cloud Site Reliability Engineering team operating within a global follow-the-sun support model.
Job Responsibilities:
- Lead and Implement SRE frameworks to support global google cloud environments and ensure the highest level of SLOs through operational excellence
- Mastery of application, data, infrastructure, and Agentic AI disciplines
- Keen understanding of financial control and budget management using expertise in working in partnership with colleagues throughout the firm, and in leading collaborative teams to achieve common goals
- Uses enterprise-authorized AI capabilities within the work environment to accelerate major-incident triage, troubleshooting, and post-incident analysis, validating outputs and handling operational data according to sensitivity and security requirements.
- Provide support to develop & improve the quality of technical engineering documentation
- Provide technical supervision, oversight and problem resolution for engineering activities
- Champion a DevOps model so that services are automated and elastic across all platforms
Required qualifications, capabilities, and skills:
- Google & Azure cloud expertise in a mission critical production environment
- Strong understanding about container technologies such as Docker, Kubernetes, GKE and HELM
- Experience in programming in one of the following languages: Python, shell scripting or GO along with good understanding of REST APIs
- Hands-on experience with cloud-based technologies and tools especially in deployment, monitoring and operations, such as Google Observability, Azure Monitor, Data Dog, Prometheus, Splunk, Elasticsearch and Grafana.
- Demonstrated experience using enterprise-authorized AI capabilities within the work environment to improve SRE workflows (e.g., incident investigation support and knowledge capture) with strong validation habits and awareness of data sensitivity.
- Ability to evaluate AI-assisted operational recommendations for correctness and risk, define appropriate guardrails for team usage, and ensure outcomes align to resiliency and security expectations.
- Strong understanding about the Google Cloud governance and compliance and cost management
- Strong working knowledge of modern development technologies and tools such Agile, CI/CD, Git, Infrastructure as Code, Terraform and Jenkins.
- Google Cloud certification or equivalent technical experience in the Public Cloud.
- Good understanding of Agentic AI SDKs and GitHub Copilot Skills.
Preferred qualifications, capabilities, and skills:
- Good understanding of operating systems such as Windows, Linux (Redhat / Ubuntu)
- Good understanding of LLM and other AI/ML frameworks which can be used in AIOPS