At a Glance
- Tasks: Design and maintain monitoring systems using Prometheus and Grafana.
- Company: Join a global cybersecurity team focused on Data Loss Prevention.
- Benefits: Gain experience in a cloud-first environment with opportunities for growth.
- Why this job: Perfect for those wanting to dive into cybersecurity while enhancing their engineering skills.
- Qualifications: Experience with Prometheus, Splunk, and a programming language is essential.
- Other info: Participate in a 24/7 on-call support rota and collaborate in an Agile setting.
The predicted salary is between 43200 - 72000 £ per year.
Site Reliability Engineer (Prometheus and Grafana) (15797) London, England About the Role Join a global team of engineers, operators, and Agile practitioners responsible for building and operating a world-class Data Loss Prevention (DLP) infrastructure. This role is within the Cybersecurity organization, focusing on enhancing observability and telemetry across the DLP stack to support a cloud-first strategy while maintaining strong on-premise capabilities. This is an exciting opportunity for engineers with strong SRE and monitoring experience, and also a great entry point for professionals looking to transition into cybersecurity. Key Responsibilities Design and maintain Prometheus metrics collection and PromQL queries Build, review, and optimize Grafana and Splunk dashboards using observability best practices (e.g., Four Golden Signals, RED methodology) Refine alerting rules across tools like PagerDuty, Prometheus, and Splunk to eliminate noise and identify gaps Work closely with engineering squads to implement and maintain SLO/SLIs and error budgets Operate Prometheus in agent mode and troubleshoot issues Use telemetry data to generate actionable insights for the DLP teams Drive continuous improvement of monitoring and observability systems Participate in a 24/7 on-call support rota for DLP products Collaborate in a DevOps and Agile environment Required Skills and Experience Strong hands-on experience with Prometheus and PromQL Solid experience with Splunk dashboarding and queries Deep understanding of observability and monitoring principles Familiarity with SRE practices, SLO/SLIs, and error budget management Experience with PagerDuty or similar alerting/orchestration platforms Fluent in at least one programming or scripting language Knowledge of CI/CD tools (e.g., Jenkins, Bitbucket) Experience working in cloud environments (AWS or similar) or Unix/Linux systems Excellent collaboration, communication, and problem-solving skills Nice to Have Experience with: Cybersecurity or DLP products Incident, problem, and change management tools OpenTelemetry or telemetry pipeline tooling Automation and scripting for monitoring Working in Agile or operational environments Why Join? Work on a globally distributed, high-impact security team Learn and grow in a DevOps-driven, cloud-first organization Transition into cybersecurity or expand your existing expertise Application Process Please include your: First and last name Email address Phone number (including country code) CV / Resume Additionally, indicate your eligibility to work in the country you are applying to: Yes, I am currently eligible to work (work permit/visa/citizenship) No, I am not currently eligible to work (work permit/visa/citizenship) #J-18808-Ljbffr
Contact Detail:
Robert Walters Recruiting Team
StudySmarter Expert Advice 🤫
We think this is how you could land Site Reliability Engineer (Prometheus and Grafana)
✨Tip Number 1
Familiarise yourself with Prometheus and Grafana by exploring their documentation and community forums. Engaging with these resources can help you understand common challenges and best practices, which will be beneficial during interviews.
✨Tip Number 2
Join online communities or local meetups focused on Site Reliability Engineering and observability tools. Networking with professionals in the field can provide insights into the role and may even lead to referrals.
✨Tip Number 3
Consider contributing to open-source projects related to monitoring and observability. This not only enhances your skills but also showcases your commitment and expertise to potential employers.
✨Tip Number 4
Prepare for technical interviews by practising problem-solving scenarios that involve SLOs, SLIs, and error budgets. Being able to discuss these concepts confidently will demonstrate your understanding of key SRE principles.
We think you need these skills to ace Site Reliability Engineer (Prometheus and Grafana)
Some tips for your application 🫡
Tailor Your CV: Make sure your CV highlights your experience with Prometheus, Grafana, and any relevant SRE practices. Use specific examples that demonstrate your skills in observability and monitoring.
Craft a Strong Cover Letter: In your cover letter, express your enthusiasm for the role and the company. Mention how your background aligns with the responsibilities listed, particularly your experience with cloud environments and CI/CD tools.
Showcase Relevant Skills: Clearly outline your hands-on experience with PromQL, Splunk dashboarding, and alerting platforms like PagerDuty. Highlight any familiarity with cybersecurity or DLP products, as this could set you apart.
Follow Application Instructions: Ensure you include all required information such as your name, contact details, and eligibility to work in the country. Double-check for any additional documents that may be requested in the application process.
How to prepare for a job interview at Robert Walters
✨Showcase Your Technical Skills
Be prepared to discuss your hands-on experience with Prometheus, PromQL, and Splunk. Bring examples of dashboards you've built or optimised, and be ready to explain the observability best practices you applied.
✨Understand SRE Principles
Familiarise yourself with SLOs, SLIs, and error budgets. Be ready to discuss how you've implemented these concepts in previous roles and how they can enhance monitoring and observability.
✨Demonstrate Problem-Solving Abilities
Prepare to share specific instances where you've troubleshot issues in a cloud environment or Unix/Linux systems. Highlight your approach to identifying gaps and refining alerting rules to reduce noise.
✨Emphasise Collaboration and Communication
Since this role involves working closely with engineering squads, be ready to discuss your experience in collaborative environments. Share examples of how you've effectively communicated technical information to non-technical stakeholders.