IT Operations/Site Reliability Engineer (Datadog/Dynatrace)

IT Operations/Site Reliability Engineer (Datadog/Dynatrace)

Full-Time 60000 - 80000 £ / year (est.) No working from home possible
Infoplus Technologies UK Ltd

At a Glance

  • Tasks: Drive IT operations modernization through observability and automation.
  • Company: Join a forward-thinking tech company focused on innovation and reliability.
  • Benefits: Competitive salary, flexible work options, and opportunities for professional growth.
  • Other info: Collaborative environment with a strong focus on continuous improvement and mentorship.
  • Why this job: Make a real impact by enhancing system reliability and efficiency with cutting-edge tools.
  • Qualifications: Expertise in SRE principles, observability tools, and automation techniques required.

The predicted salary is between 60000 - 80000 £ per year.

SRE will play a pivotal role in driving the modernization of IT operations by implementing observability practices and automating toil. This position requires a deep understanding of Site Reliability Engineering (SRE) principles, modern observability tools, and automation techniques to ensure scalability, reliability, and efficiency in IT systems. This role requires a strategic thinker with hands‑on expertise who can lead modernization efforts while fostering a culture of reliability and innovation.

Work closely with the Product Engineering team and implement strategies for modernizing IT operations enhancing observability and toil reduction.

Responsibilities
  • Architect and deploy observability platforms to monitor system health, performance, and reliability effectively.
  • Propose & drive strategies for AI‑driven alerting and proactive anomaly detection to reduce MTTD & MTTR.
  • Develop and enforce SRE best practices, including Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets.
  • Establish & create AIOPS roadmap for improving operational efficiency.
  • Lead efforts to automate repetitive tasks (toil) using Scripting, orchestration tools, and AI/ML‑based solutions.
  • Drive toil automation initiatives for automated incident responses & self‑healing automation for achieving autonomous operations.
  • Collaborate with cross‑functional teams to ensure systems are scalable, resilient, and maintainable.
  • Drive incident management and root cause analysis processes through automation, ensuring continuous improvement to enable autonomous operations.
  • Partner with engineering, architecture, and product teams to enable shift‑left engineering practices ensuring reliability.
  • Mentor and guide teams on adopting SRE principles and tools.
  • Advocate for a culture of reliability, automation, and continuous improvement across the organization.
Qualifications
  • Strong expertise in implementing Site Reliability Engineering (SRE) principles.
  • Advanced knowledge of establishing observability using tools - Dynatrace & Datadog (primary skills).
  • Proficiency in automation & Scripting using Python & Ansible (primary skills).
  • Strong experience with cloud platforms - AWS & Azure (primary skills).
  • Solid understanding of containerization and orchestration tools like Docker and Kubernetes.
  • Proficiency in cloud native distributed systems & microservices architecture.
  • Exposure to AI/ML techniques for predictive analytics and automated problem resolution.
  • Familiarity with CI/CD pipelines & enabling automated release & deployment engineering solutions.
  • Good to have experience with chaos engineering tools like Gremlin or Chaos Monkey and implementing automation frameworks for resilience tracking.
  • Ability to manage and prioritize multiple projects in a fast‑paced environment.
  • Strong interpersonal and communication skills to work effectively across teams.
  • Excellent problem solving, analytical thinking, and adaptability.
  • Strategic mindset balancing engineering excellence with business priorities.
  • 12+ years of experience in IT operations, SRE, or DevOps roles.
  • Proven track record of SRE experience in implementing observability and automation solutions in large‑scale environments.
  • Certifications in cloud platforms, observability tools & other SRE related areas.

IT Operations/Site Reliability Engineer (Datadog/Dynatrace) employer: Infoplus Technologies UK Ltd

As an employer, we pride ourselves on fostering a culture of innovation and reliability, making us an excellent choice for IT Operations/Site Reliability Engineers. Our commitment to employee growth is evident through continuous learning opportunities and mentorship programmes, while our collaborative work environment encourages cross-functional teamwork. Located in a vibrant tech hub, we offer competitive benefits and the chance to be at the forefront of modernising IT operations with cutting-edge tools like Datadog and Dynatrace.

Infoplus Technologies UK Ltd

Contact Details:

Infoplus Technologies UK Ltd Recruitment Team

We think you need these skills to ace IT Operations/Site Reliability Engineer (Datadog/Dynatrace)

Site Reliability Engineering (SRE) principles
Observability tools (Dynatrace, Datadog)
Automation & Scripting (Python, Ansible)
Cloud platforms (AWS, Azure)
Containerization and orchestration (Docker, Kubernetes)
Cloud native distributed systems
Microservices architecture