At a Glance
- Tasks: Drive IT operations modernization through observability and automation.
- Company: Join a forward-thinking tech company focused on innovation and reliability.
- Benefits: Competitive salary, flexible work options, and opportunities for professional growth.
- Other info: Collaborative environment with a strong focus on continuous improvement and mentorship.
- Why this job: Make a real impact by enhancing system reliability and efficiency with cutting-edge tools.
- Qualifications: Expertise in SRE principles, observability tools, and automation techniques required.
The predicted salary is between 60000 - 80000 £ per year.
SRE will play a pivotal role in driving the modernization of IT operations by implementing observability practices and automating toil. This position requires a deep understanding of Site Reliability Engineering (SRE) principles, modern observability tools, and automation techniques to ensure scalability, reliability, and efficiency in IT systems. This role requires a strategic thinker with hands‑on expertise who can lead modernization efforts while fostering a culture of reliability and innovation.
Work closely with the Product Engineering team and implement strategies for modernizing IT operations enhancing observability and toil reduction.
Responsibilities- Architect and deploy observability platforms to monitor system health, performance, and reliability effectively.
- Propose & drive strategies for AI‑driven alerting and proactive anomaly detection to reduce MTTD & MTTR.
- Develop and enforce SRE best practices, including Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets.
- Establish & create AIOPS roadmap for improving operational efficiency.
- Lead efforts to automate repetitive tasks (toil) using Scripting, orchestration tools, and AI/ML‑based solutions.
- Drive toil automation initiatives for automated incident responses & self‑healing automation for achieving autonomous operations.
- Collaborate with cross‑functional teams to ensure systems are scalable, resilient, and maintainable.
- Drive incident management and root cause analysis processes through automation, ensuring continuous improvement to enable autonomous operations.
- Partner with engineering, architecture, and product teams to enable shift‑left engineering practices ensuring reliability.
- Mentor and guide teams on adopting SRE principles and tools.
- Advocate for a culture of reliability, automation, and continuous improvement across the organization.
- Strong expertise in implementing Site Reliability Engineering (SRE) principles.
- Advanced knowledge of establishing observability using tools - Dynatrace & Datadog (primary skills).
- Proficiency in automation & Scripting using Python & Ansible (primary skills).
- Strong experience with cloud platforms - AWS & Azure (primary skills).
- Solid understanding of containerization and orchestration tools like Docker and Kubernetes.
- Proficiency in cloud native distributed systems & microservices architecture.
- Exposure to AI/ML techniques for predictive analytics and automated problem resolution.
- Familiarity with CI/CD pipelines & enabling automated release & deployment engineering solutions.
- Good to have experience with chaos engineering tools like Gremlin or Chaos Monkey and implementing automation frameworks for resilience tracking.
- Ability to manage and prioritize multiple projects in a fast‑paced environment.
- Strong interpersonal and communication skills to work effectively across teams.
- Excellent problem solving, analytical thinking, and adaptability.
- Strategic mindset balancing engineering excellence with business priorities.
- 12+ years of experience in IT operations, SRE, or DevOps roles.
- Proven track record of SRE experience in implementing observability and automation solutions in large‑scale environments.
- Certifications in cloud platforms, observability tools & other SRE related areas.
IT Operations/Site Reliability Engineer (Datadog/Dynatrace) employer: Infoplus Technologies UK Ltd
As an employer, we pride ourselves on fostering a culture of innovation and reliability, making us an excellent choice for IT Operations/Site Reliability Engineers. Our commitment to employee growth is evident through continuous learning opportunities and mentorship programmes, while our collaborative work environment encourages cross-functional teamwork. Located in a vibrant tech hub, we offer competitive benefits and the chance to be at the forefront of modernising IT operations with cutting-edge tools like Datadog and Dynatrace.
Contact Details:
Infoplus Technologies UK Ltd Recruitment Team