The Role
- Automate for insight and scale: Build systems that make troubleshooting fast, safe, and scalable across thousands of Neo4j instances. From internal tools that surface clear insights to canaries that support safe rollouts, you’ll focus on automation that elevates reliability engineering.
- Treat operations as a software problem: Replace tribal knowledge and ad-hoc scripts with tools and systems that codify best practices—making operations predictable, scalable, and repeatable.
- Design for resilience, learn from failure: Own and evolve the tooling and processes behind incident response. From clear alerts to blameless reviews, you’ll help ensure teams respond with confidence and learn with clarity.
- Champion reliability as a product feature: Help teams define and act on SLIs and SLOs, turning reliability into a shared, data-driven priority across engineering.
- Create signals, not noise: Shape an observability stack that tells us what matters, when it matters—so we can detect issues early and resolve them quickly.
Qualifications
- Writing backend tools and automation in Go—the primary language—with an emphasis on sound architecture, testing, and maintainability. Strong software skills in other languages, like Python, are also welcome.
- Applying SRE practices in real-world environments: defining SLIs and SLOs, reducing toil through automation, and driving reliability through engineering.
- Collaborating with other teams to promote SRE thinking—educating on principles like observability, ownership, and service level objectives.
- Troubleshooting large-scale, cloud-based systems with confidence and curiosity.
- Monitoring distributed systems and understanding their performance characteristics.
- Designing systems with reliability, safety, and debugability as first-class concerns.
- Working with observability tools like OTel Collector, Prometheus, Grafana, and Google Cloud’s operations suite.
- Deploying and managing applications on Kubernetes; cluster-level administration is a plus.
- Managing infrastructure with Kustomize and Terraform—keeping it clear, modular, and easy to evolve.
- Building and maintaining CI/CD workflows—ours run on GitHub Actions.
- Participating in on-call rotations and incident response with a focus on improvement, not blame.
- Writing and contributing to postmortems that lead to meaningful, lasting changes.