At a Glance
- Tasks: Design and maintain monitoring systems using Prometheus and Grafana.
- Company: Join a global cybersecurity team focused on Data Loss Prevention.
- Benefits: Gain experience in a cloud-first environment with opportunities for growth.
- Why this job: Perfect for those wanting to dive into cybersecurity while enhancing their engineering skills.
- Qualifications: Strong experience with Prometheus, Splunk, and observability principles required.
- Other info: Participate in a 24/7 on-call support rota and collaborate in an Agile setting.
The predicted salary is between 36000 - 60000 £ per year.
Join a global team of engineers, operators, and Agile practitioners responsible for building and operating a world-class Data Loss Prevention (DLP) infrastructure. This role is within the Cybersecurity organization, focusing on enhancing observability and telemetry across the DLP stack to support a cloud-first strategy while maintaining strong on-premise capabilities. This is an exciting opportunity for engineers with strong SRE and monitoring experience, and also a great entry point for professionals looking to transition into cybersecurity.
Key Responsibilities
- Design and maintain Prometheus metrics collection and PromQL queries
- Build, review, and optimize Grafana and Splunk dashboards using observability best practices (e.g., Four Golden Signals, RED methodology)
- Refine alerting rules across tools like PagerDuty, Prometheus, and Splunk to eliminate noise and identify gaps
- Work closely with engineering squads to implement and maintain SLO/SLIs and error budgets
- Operate Prometheus in agent mode and troubleshoot issues
- Use telemetry data to generate actionable insights for the DLP teams
- Drive continuous improvement of monitoring and observability systems
- Participate in a 24/7 on-call support rota for DLP products
- Collaborate in a DevOps and Agile environment
Required Skills and Experience
- Strong hands-on experience with Prometheus and PromQL
- Solid experience with Splunk dashboarding and queries
- Deep understanding of observability and monitoring principles
- Familiarity with SRE practices, SLO/SLIs, and error budget management
- Experience with PagerDuty or similar alerting/orchestration platforms
- Fluent in at least one programming or scripting language
- Knowledge of CI/CD tools (e.g., Jenkins, Bitbucket)
- Experience working in cloud environments (AWS or similar) or Unix/Linux systems
- Excellent collaboration, communication, and problem-solving skills
Nice to Have Experience with:
- Cybersecurity or DLP products
- Incident, problem, and change management tools
- OpenTelemetry or telemetry pipeline tooling
- Automation and scripting for monitoring
- Working in Agile or operational environments
Why Join?
- Work on a globally distributed, high-impact security team
- Learn and grow in a DevOps-driven, cloud-first organization
- Transition into cybersecurity or expand your existing expertise
Application Process
Please include your: First and last name, Email address, Phone number (including country code), CV / Resume. Additionally, indicate your eligibility to work in the country you are applying to: Yes, I am currently eligible to work (work permit/visa/citizenship) or No, I am not currently eligible to work (work permit/visa/citizenship).
Site Reliability Engineer (Prometheus and Grafana) employer: Robert Walters
Contact Detail:
Robert Walters Recruiting Team
StudySmarter Expert Advice 🤫
We think this is how you could land Site Reliability Engineer (Prometheus and Grafana)
✨Tip Number 1
Familiarise yourself with Prometheus and Grafana by exploring their documentation and community forums. Engaging with these resources can help you understand best practices and common pitfalls, which will be beneficial during interviews.
✨Tip Number 2
Join online communities or local meetups focused on Site Reliability Engineering and observability tools. Networking with professionals in the field can provide insights into the role and may even lead to referrals.
✨Tip Number 3
Consider contributing to open-source projects related to monitoring and observability. This not only enhances your skills but also showcases your commitment and expertise to potential employers.
✨Tip Number 4
Prepare for technical interviews by practising problem-solving scenarios that involve SLOs, SLIs, and error budgets. Being able to discuss these concepts confidently will demonstrate your understanding of SRE principles.
We think you need these skills to ace Site Reliability Engineer (Prometheus and Grafana)
Some tips for your application 🫡
Tailor Your CV: Make sure your CV highlights your experience with Prometheus, Grafana, and any relevant SRE practices. Use specific examples that demonstrate your skills in observability and monitoring.
Craft a Strong Cover Letter: In your cover letter, express your enthusiasm for the role and the company. Mention how your background aligns with the responsibilities listed, particularly your experience with metrics collection and alerting tools.
Showcase Relevant Skills: Clearly outline your hands-on experience with Prometheus and Splunk in your application. Include any familiarity with CI/CD tools and cloud environments, as these are crucial for the role.
Highlight Collaboration Experience: Since the role involves working closely with engineering squads, emphasise any past experiences where you collaborated in a DevOps or Agile environment. This will show your ability to work effectively within teams.
How to prepare for a job interview at Robert Walters
✨Showcase Your Technical Skills
Be prepared to discuss your hands-on experience with Prometheus and PromQL in detail. Highlight specific projects where you've designed metrics collection or optimised dashboards, as this will demonstrate your technical expertise relevant to the role.
✨Understand Observability Principles
Familiarise yourself with observability best practices, such as the Four Golden Signals and the RED methodology. Be ready to explain how you have applied these principles in past roles to enhance monitoring and alerting systems.
✨Prepare for Scenario-Based Questions
Expect questions that assess your problem-solving skills in real-world scenarios. Think of examples where you've refined alerting rules or collaborated with engineering teams to implement SLOs/SLIs, as these experiences are crucial for the role.
✨Demonstrate Collaboration and Communication Skills
Since the role involves working closely with various teams, be prepared to discuss how you've effectively communicated and collaborated in a DevOps or Agile environment. Share specific instances where your communication skills led to successful project outcomes.