At a Glance
- Tasks: Ensure reliability and efficiency of platforms, focusing on observability and continuous improvement.
- Company: Join a dynamic team dedicated to optimising cloud environments and enhancing developer experiences.
- Benefits: Enjoy 100% remote work, competitive pay, and opportunities for personal and professional growth.
- Why this job: Be part of a culture that values innovation, collaboration, and impactful contributions across various industries.
- Qualifications: Experience with SRE principles, observability tools, and cloud environments is essential.
- Other info: Must have SC Clearance or be eligible; out-of-hours support may be required.
As a Site Reliability Engineer (SRE), you will play a key role in ensuring the reliability, scalability, and efficiency of our clients' platforms. Your focus will include building strong observability practices, aligning with the SRE mindset & principles, and driving continuous improvement. This will involve:
- Defining and implementing Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to measure and maintain system and application performance, ensuring services meet agreed reliability targets.
- Instrumenting applications to collect key metrics, logs, and traces that enable proactive monitoring and troubleshooting.
- Creating dashboards and configuring alerts to provide real-time visibility into system health, enabling teams to quickly detect and resolve issues.
- Assessing and enhancing Kubernetes capabilities, improving DevOps efficiency through innovation, agility and cost optimisation.
- Taking a holistic approach to modernising the developer experience, focusing on organisational culture, DevOps practices, processes, automation and tooling.
- Architecting scalable and resilient cloud infrastructure to ensure the seamless deployment and optimisation of containerised applications.
- Collaborating with cross-functional teams to implement automation strategies that reduce operational complexity and drive continuous improvement.
- Roles can involve out-of-hours or on-call support, depending on client requirements.
Key expectations from this role include:
- Lead site reliability engineering initiatives with a strong emphasis on observability, ensuring high performance and reliability of applications & infrastructure.
- Provide strategic insights to shape the overall SRE strategy while collaborating on the design and implementation of scalable and reliable solutions.
- Establish effective monitoring, alerting and incident response strategies to maintain system availability and promote continuous improvement by collaborating with team members to deliver observability best practices and SRE methodologies.
As part of your role you will also have the opportunity to contribute to the business and your own personal growth, through activities that form part of the following categories:
- Business Development - Leading/contributing to proposals, RFPs, bids, proposition development, client pitch contribution, client hosting at events.
- Internal contribution - Campaign development, internal think-tanks, whitepapers, practice development (operations, recruitment, team events & activities), offering development.
- Learning & development - Training to support your career development and the skills demand within the company, certifications etc.
Your Profile:
We are looking for someone with experience in implementing SRE principles, with a focus on observability and optimising applications & cloud environments. You will be comfortable working in a dynamic, technology-driven environment, while bringing proven expertise in the following areas:
- Strong understanding of the SRE mindset and principles, including the creation and management of Service Level Indicators (SLIs), Service Level Objectives (SLOs) and error budgets ensuring reliability and performance.
- Experience in implementing observability, instrumenting applications to provide insights into system performance.
- Hands-on experience with tools such as Dynatrace, Prometheus and OpenTelemetry for monitoring, tracing, and real-time alerting is highly sought after.
- An understanding of microservices and container orchestration with the ability to optimise containerised applications for reliability and scalability.
- Experience enabling continuous delivery pipelines, with a focus on ensuring system reliability, quality, and performance through automated deployment, scaling, and observability tools.
- Understanding of build and deployment of pipelines and experience in collaborating with developers to improve observability and monitoring practices.
- Strong collaboration skills with the ability to work effectively both independently and as part of a team.
- Comfortability interacting and engaging with clients, although a consulting background is not a prerequisite.
- An enthusiasm and excitement at the prospect of working with a wide range of technology stacks and cloud providers across the wide range of clients and industries we support.
You must have SC (Security Check) Clearance, or be eligible and willing to gain this level of clearance. You must be able to work Out of Hours or On Call should this be needed for the role you are on.
Site Reliability Engineer employer: Experis - ManpowerGroup
Contact Detail:
Experis - ManpowerGroup Recruiting Team
StudySmarter Expert Advice 🤫
We think this is how you could land Site Reliability Engineer
✨Tip Number 1
Familiarise yourself with the specific tools mentioned in the job description, such as Dynatrace, Prometheus, and OpenTelemetry. Having hands-on experience with these tools will not only boost your confidence but also demonstrate your readiness to hit the ground running.
✨Tip Number 2
Showcase your understanding of SRE principles by discussing real-world examples where you've implemented SLIs and SLOs. This will help you stand out as someone who not only knows the theory but has practical experience in applying it.
✨Tip Number 3
Prepare to discuss your approach to collaboration and communication, especially in a remote setting. Highlight any past experiences where you've successfully worked with cross-functional teams to improve system reliability and performance.
✨Tip Number 4
Be ready to talk about your adaptability in dynamic environments. Share examples of how you've embraced change and driven continuous improvement in your previous roles, as this aligns well with the expectations of the Site Reliability Engineer position.
We think you need these skills to ace Site Reliability Engineer
Some tips for your application 🫡
Tailor Your CV: Make sure your CV highlights your experience with SRE principles, observability tools, and cloud environments. Use specific examples that demonstrate your expertise in creating SLIs and SLOs, as well as your hands-on experience with monitoring tools like Dynatrace and Prometheus.
Craft a Compelling Cover Letter: In your cover letter, express your enthusiasm for the role and the company. Discuss how your skills align with the job description, particularly your ability to drive continuous improvement and collaborate with cross-functional teams. Mention any relevant projects or achievements that showcase your capabilities.
Highlight Relevant Experience: When detailing your work history, focus on roles where you implemented SRE practices or worked with cloud infrastructure. Be specific about your contributions to improving system reliability and performance, and include metrics or outcomes where possible.
Showcase Your Soft Skills: Since collaboration is key in this role, emphasise your teamwork and communication skills. Provide examples of how you've successfully worked with others to achieve common goals, especially in dynamic, technology-driven environments.
How to prepare for a job interview at Experis - ManpowerGroup
✨Understand SRE Principles
Make sure you have a solid grasp of Site Reliability Engineering principles, especially around SLIs and SLOs. Be prepared to discuss how you've implemented these in past roles and the impact they had on system reliability.
✨Showcase Your Technical Skills
Highlight your hands-on experience with monitoring tools like Dynatrace, Prometheus, and OpenTelemetry. Be ready to provide examples of how you've used these tools to enhance observability and troubleshoot issues.
✨Demonstrate Collaboration Abilities
Since this role involves working with cross-functional teams, be prepared to share examples of how you've successfully collaborated with developers and other stakeholders to improve system performance and reliability.
✨Prepare for Scenario-Based Questions
Expect scenario-based questions that assess your problem-solving skills in real-time situations. Think about past challenges you've faced in SRE roles and how you approached them, focusing on your decision-making process and outcomes.