Job Board

Companies

Tbwa Chiat/Day Inc

Senior Site Reliability Engineer, Observability

London Full-Time 43200 - 72000 £ / year (est.) No home office possible

Apply now

At a Glance

Tasks: Join our team to enhance monitoring and observability for our digital experience platform.
Company: Cisco ThousandEyes delivers flawless digital experiences across networks using AI and telemetry data.
Benefits: Enjoy a diverse workplace, opportunities for growth, and the chance to work with cutting-edge technology.
Why this job: Be part of a dynamic team that values innovation and collaboration in a fast-paced environment.
Qualifications: Strong skills in Infrastructure as Code, logging tools, and coding in Python or Go are essential.
Other info: We encourage applicants from all backgrounds, even if you don't meet every qualification.

The predicted salary is between 43200 - 72000 £ per year.

Senior Site Reliability Engineer, Observability

Who We Are

Cisco ThousandEyes is a Digital Experience Assurance platform that empowers organizations to deliver flawless digital experiences across every network – even the ones they don’t own. Powered by AI and an unmatched set of cloud, internet and enterprise network telemetry data, ThousandEyes enables IT teams to proactively detect, diagnose, and remediate issues – before they impact end-user experiences.

ThousandEyes is deeply integrated across the entire Cisco technology portfolio and beyond, helping customers deploy at scale while also delivering AI-powered assurance insights within Cisco’s leading Networking, Security, Collaboration, and Observability portfolios.

About The Role

The Site Reliability Engineering team focused on Observability is responsible for providing the tools, services, and infrastructure to monitor and observe the ThousandEyes platform. Leveraging cloud native tools like Prometheus, Grafana, Kibana, and even ThousandEyes itself, we enable our developers to instrument, analyze, and monitor their applications. The Senior Site Reliability Engineer in this role will work together with the team to own our logging pipeline and monitoring stack while working with developers to continuously improve our view of the platform.

What You’ll Do

As we expand our platform to a multi-region scale, it is essential to design and implement strategies that enhance visibility. This involves designing, deploying, and maintaining cloud-native monitoring services that are both elastic and resilient to failure across AWS and GCP. It is also crucial to establish standards and best practices for the instrumentation of container-based services and cloud-managed services. The maintenance of our alerting pipeline is key to ensuring that notifications are timely, accurate, and directed to the appropriate channels. Automation is a priority, as it allows our monitoring platforms to scale effortlessly, promoting a self-service approach. Additionally, active participation and contribution to the improvement of our 24×7 incident response and on-call rotation are vital to the robustness of our operational response.

Qualifications

Strong Infrastructure as Code skills, ideally with Terraform and Kubernetes.
Strong knowledge of modern logging tool sets, including Logstash or Fluentd.
Understanding of Prometheus and its ecosystem, including Alertmanager.
Good knowledge of Application Performance Monitoring tools and crash reporting tools, such as Sentry.
Good knowledge of cloud provider managed services, and how they can be leveraged in our context.
Ability to write high quality code in Python, Go, or equivalent languages.

Cisco values the perspectives and skills that emerge from employees with diverse backgrounds. That’s why Cisco is expanding the boundaries of discovering top talent by not only focusing on candidates with educational degrees and experience but also placing more emphasis on unlocking potential. We believe that everyone has something to offer and that diverse teams are better equipped to solve problems, innovate, and create a positive impact.

We encourage you to apply even if you do not believe you meet every single qualification . Not all strong candidates will meet every single qualification. Research shows that people from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy. We urge you not to prematurely exclude yourself and to apply if you’re interested in this work.

Cisco is an Affirmative Action and Equal Opportunity Employer and all qualified applicants will receive consideration for employment without regard to race, color, religion, gender, sexual orientation, national origin, genetic information, age, disability, veteran status, or any other legally protected basis. Cisco will consider for employment, on a case by case basis, qualified applicants with arrest and conviction records.

#J-18808-Ljbffr

Senior Site Reliability Engineer, Observability employer: Tbwa Chiat/Day Inc

At Cisco ThousandEyes, we pride ourselves on fostering a dynamic work culture that champions innovation and collaboration. As a Senior Site Reliability Engineer in London, you'll benefit from our commitment to employee growth through continuous learning opportunities and access to cutting-edge technologies. Join us to be part of a diverse team that values unique perspectives and empowers you to make a meaningful impact in delivering exceptional digital experiences.

Contact Detail:

Tbwa Chiat/Day Inc Recruiting Team

View Tbwa Chiat/Day Inc Profile

StudySmarter Expert Advice 🤫

We think this is how you could land Senior Site Reliability Engineer, Observability

✨Tip Number 1

Familiarize yourself with the specific tools mentioned in the job description, such as Prometheus, Grafana, and Kubernetes. Having hands-on experience or projects showcasing your skills with these technologies can set you apart during the interview process.

✨Tip Number 2

Engage with the Site Reliability Engineering community online. Participate in forums, attend webinars, or join relevant groups on platforms like LinkedIn. This not only helps you stay updated on industry trends but also allows you to network with professionals who might provide insights or referrals.

✨Tip Number 3

Prepare to discuss your experience with Infrastructure as Code, particularly with Terraform. Be ready to share specific examples of how you've implemented IaC in past projects, as this is a crucial skill for the role.

✨Tip Number 4

Showcase your problem-solving skills by preparing for scenario-based questions. Think about challenges you've faced in previous roles related to monitoring and observability, and how you approached them. This will demonstrate your critical thinking and adaptability.

We think you need these skills to ace Senior Site Reliability Engineer, Observability

Infrastructure as Code

Terraform

Kubernetes

Logstash

Fluentd

Prometheus

Alertmanager

Application Performance Monitoring

Sentry

Cloud Provider Managed Services

Python

Monitoring Services Design

Automation

Incident Response

Some tips for your application 🫡

Understand the Role: Make sure to thoroughly read the job description for the Senior Site Reliability Engineer position. Understand the key responsibilities and qualifications required, especially around cloud-native tools and Infrastructure as Code.

Highlight Relevant Experience: In your CV and cover letter, emphasize your experience with tools like Terraform, Kubernetes, Prometheus, and any logging tools you have used. Provide specific examples of how you've implemented monitoring solutions or improved observability in past roles.

Showcase Your Coding Skills: Since coding is a crucial part of this role, include examples of your work in Python, Go, or similar languages. If possible, link to any relevant projects or repositories that demonstrate your coding abilities.

Express Your Interest in Diversity: Cisco values diverse backgrounds and perspectives. In your application, mention how your unique experiences can contribute to the team and the importance of diversity in problem-solving and innovation.

How to prepare for a job interview at Tbwa Chiat/Day Inc

✨Showcase Your Technical Skills

Be prepared to discuss your experience with Infrastructure as Code, particularly with Terraform and Kubernetes. Highlight specific projects where you've implemented these technologies and how they contributed to the success of the project.

✨Demonstrate Your Understanding of Observability Tools

Familiarize yourself with the tools mentioned in the job description, such as Prometheus, Grafana, and Kibana. Be ready to explain how you've used these tools in past roles to enhance monitoring and observability.

✨Discuss Automation Strategies

Since automation is a priority for this role, come prepared to share examples of how you've automated processes in previous positions. Discuss the impact of these automations on efficiency and incident response times.

✨Emphasize Collaboration and Communication

The role involves working closely with developers and participating in incident response. Be ready to talk about your experience in cross-functional teams and how you ensure effective communication during incidents.

Senior Site Reliability Engineer, Observability

London

Full-Time

43200 - 72000 £ / year (est.)

Apply now

Application deadline: 2027-02-28
Tbwa Chiat/Day Inc

View Tbwa Chiat/Day Inc Profile

Similar positions in other companies

Europas größte Jobbörse für Gen-Z

Discover now