Job Board

Companies

Tbwa Chiat/Day Inc

Site Reliability Engineer, Observability

London Full-Time 43200 - 72000 £ / year (est.) No home office possible

Apply now

At a Glance

Tasks: Join our team to enhance observability for the ThousandEyes platform using cutting-edge cloud-native tools.
Company: Cisco ThousandEyes empowers organizations to deliver flawless digital experiences across every network.
Benefits: Enjoy a diverse workplace, opportunities for growth, and a focus on innovation and collaboration.
Why this job: Be part of a dynamic team that values operational excellence and embraces automation in monitoring.
Qualifications: Strong coding skills in Python or Go; familiarity with observability concepts and AWS services.
Other info: We encourage applicants from all backgrounds, even if you don't meet every qualification.

The predicted salary is between 43200 - 72000 £ per year.

Site Reliability Engineer, Observability

Who We Are

Cisco ThousandEyes is a Digital Assurance platform that empowers organizations to deliver flawless digital experiences across every network – even the ones they don’t own. Powered by AI and an unmatched set of cloud, internet and enterprise network telemetry data, ThousandEyes enables IT teams to proactively detect, diagnose, and remediate issues – before they impact end-user experiences.

ThousandEyes is deeply integrated across the entire Cisco technology portfolio and beyond, helping customers deploy at scale while also delivering AI-powered assurance insights within Cisco’s leading Networking, Security, Collaboration, and Observability portfolios.

About The Role

The Site Reliability Engineering team focused on Observability is responsible for providing the tools, services, and infrastructure to monitor and observe the ThousandEyes platform. Leveraging cloud native tools like Prometheus, Grafana, Kibana, and even ThousandEyes itself, we enable our developers to instrument, analyze, and monitor their applications. The Site Reliability Engineer in this role will work together with the team to own our observability stack while working with developers to continuously improve our view of the platform.

What You’ll Do

As we expand our platform to a multi-region scale, it is essential to design and implement strategies that enhance visibility. This involves designing, deploying, and maintaining cloud-native monitoring services that are both elastic and resilient to failure. It is also crucial to establish standards and best practices for the instrumentation of container-based services and cloud-managed services. The maintenance of our alerting pipeline is key to ensuring that notifications are timely, accurate, and directed to the appropriate channels. Automation is a priority, as it allows our monitoring platforms to scale effortlessly, promoting a self-service approach. Additionally, active participation and contribution to the improvement of our 24×7 incident response and on-call rotation are vital to the robustness of our operational response.

Qualifications

Ability to write high quality code, preferably in Python or Go.
Passion for SRE / DevOps roles and Operational Excellence.
Familiarity with the most common Observability concepts: metrics, logs and traces.
Understanding of monitoring and alerting systems. Experience with the Grafana Observability Stack is a plus.
Good understanding of AWS services.
Infrastructure as Code skills, ideally with Terraform.

Cisco values the perspectives and skills that emerge from employees with diverse backgrounds. That’s why Cisco is expanding the boundaries of discovering top talent by not only focusing on candidates with educational degrees and experience but also placing more emphasis on unlocking potential. We believe that everyone has something to offer and that diverse teams are better equipped to solve problems, innovate, and create a positive impact. We encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification. Research shows that people from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy. We urge you not to prematurely exclude yourself and to apply if you’re interested in this work.

Cisco is an Affirmative Action and Equal Opportunity Employer and all qualified applicants will receive consideration for employment without regard to race, color, religion, gender, sexual orientation, national origin, genetic information, age, disability, veteran status, or any other legally protected basis. Cisco will consider for employment, on a case by case basis, qualified applicants with arrest and conviction records.

#J-18808-Ljbffr

Site Reliability Engineer, Observability employer: Tbwa Chiat/Day Inc

At Cisco ThousandEyes, we pride ourselves on being an exceptional employer that fosters a culture of innovation and collaboration in the heart of London. Our commitment to employee growth is evident through continuous learning opportunities and a supportive environment that values diverse perspectives. With a focus on operational excellence and cutting-edge technology, we empower our Site Reliability Engineers to make a meaningful impact while enjoying the benefits of a flexible work-life balance and a vibrant team atmosphere.

Contact Detail:

Tbwa Chiat/Day Inc Recruiting Team

View Tbwa Chiat/Day Inc Profile

StudySmarter Expert Advice 🤫

We think this is how you could land Site Reliability Engineer, Observability

✨Tip Number 1

Familiarize yourself with the specific tools mentioned in the job description, such as Prometheus, Grafana, and Kibana. Having hands-on experience or projects showcasing your skills with these tools can set you apart from other candidates.

✨Tip Number 2

Showcase your understanding of cloud-native monitoring services and how they can enhance visibility in a multi-region scale. Be prepared to discuss strategies you've implemented or would consider for improving observability in previous roles.

✨Tip Number 3

Highlight any experience you have with Infrastructure as Code, particularly with Terraform and Kubernetes. Being able to demonstrate your ability to automate processes will resonate well with the team’s focus on operational excellence.

✨Tip Number 4

Engage with the SRE and DevOps communities online. Participating in forums or contributing to open-source projects can not only enhance your knowledge but also provide networking opportunities that may lead to referrals or insights about the role.

We think you need these skills to ace Site Reliability Engineer, Observability

High-Quality Code Writing (Python or Go)

Site Reliability Engineering (SRE) Knowledge

DevOps Practices

Operational Excellence

Observability Concepts (Metrics, Logs, Traces)

Monitoring and Alerting Systems Understanding

Grafana Observability Stack Experience

AWS Services Proficiency

Infrastructure as Code (Terraform, Kubernetes)

Cloud-Native Monitoring Services Design

Automation Skills

Incident Response Participation

Collaboration with Development Teams

Elastic and Resilient System Design

Some tips for your application 🫡

Understand the Role: Make sure to thoroughly read the job description for the Site Reliability Engineer position. Understand the key responsibilities and qualifications required, especially around observability tools and cloud-native monitoring.

Highlight Relevant Experience: In your CV and cover letter, emphasize any experience you have with Python or Go, as well as your familiarity with observability concepts like metrics, logs, and traces. Mention any specific projects where you've used Grafana or AWS services.

Showcase Your Passion: Express your enthusiasm for SRE and DevOps roles in your application. Share examples of how you've contributed to operational excellence or improved monitoring systems in previous positions.

Tailor Your Application: Customize your CV and cover letter to reflect the values and mission of Cisco ThousandEyes. Highlight your understanding of diverse teams and how your unique background can contribute to their goals.

How to prepare for a job interview at Tbwa Chiat/Day Inc

✨Show Your Passion for SRE and DevOps

Make sure to express your enthusiasm for Site Reliability Engineering and DevOps during the interview. Share specific examples of projects or experiences that highlight your commitment to operational excellence and how you’ve contributed to improving system reliability.

✨Demonstrate Your Technical Skills

Be prepared to discuss your coding abilities, especially in Python or Go. You might be asked to solve a coding problem or explain your approach to writing high-quality code, so brush up on your technical skills and be ready to showcase them.

✨Familiarize Yourself with Observability Tools

Since the role involves working with tools like Prometheus, Grafana, and Kibana, it’s crucial to have a solid understanding of these observability concepts. Be ready to discuss how you’ve used these tools in past projects and how they can enhance visibility in a multi-region environment.

✨Highlight Your Experience with Cloud Services

Understanding AWS services is key for this position. Prepare to talk about your experience with cloud-native monitoring services and Infrastructure as Code, particularly with Terraform and Kubernetes. Discuss how you’ve implemented these technologies to improve system performance and reliability.

Site Reliability Engineer, Observability

London

Full-Time

43200 - 72000 £ / year (est.)

Apply now

Application deadline: 2027-02-28
Tbwa Chiat/Day Inc

View Tbwa Chiat/Day Inc Profile

Similar positions in other companies

Europas größte Jobbörse für Gen-Z

Discover now