IT Operations/Site Reliability Engineer (Datadog/Dynatrace)

Job Board

Companies

Infoplus Technologies UK Ltd

IT Operations/Site Reliability Engineer (Datadog/Dynatrace)

Full-Time 60000 - 80000 £ / year (est.) No working from home possible

Apply Now

At a Glance

Tasks: Drive IT operations modernization through observability and automation.
Company: Join a forward-thinking tech company focused on innovation and reliability.
Benefits: Competitive salary, flexible work options, and opportunities for professional growth.
Other info: Collaborative environment with a strong focus on continuous improvement and mentorship.
Why this job: Make a real impact by enhancing system reliability and efficiency with cutting-edge tools.
Qualifications: Expertise in SRE principles, observability tools, and automation techniques required.

The predicted salary is between 60000 - 80000 £ per year.

SRE will play a pivotal role in driving the modernization of IT operations by implementing observability practices and automating toil. This position requires a deep understanding of Site Reliability Engineering (SRE) principles, modern observability tools, and automation techniques to ensure scalability, reliability, and efficiency in IT systems. This role requires a strategic thinker with hands‑on expertise who can lead modernization efforts while fostering a culture of reliability and innovation.

Work closely with the Product Engineering team and implement strategies for modernizing IT operations enhancing observability and toil reduction.

Responsibilities

Architect and deploy observability platforms to monitor system health, performance, and reliability effectively.
Propose & drive strategies for AI‑driven alerting and proactive anomaly detection to reduce MTTD & MTTR.
Develop and enforce SRE best practices, including Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets.
Establish & create AIOPS roadmap for improving operational efficiency.
Lead efforts to automate repetitive tasks (toil) using Scripting, orchestration tools, and AI/ML‑based solutions.
Drive toil automation initiatives for automated incident responses & self‑healing automation for achieving autonomous operations.
Collaborate with cross‑functional teams to ensure systems are scalable, resilient, and maintainable.
Drive incident management and root cause analysis processes through automation, ensuring continuous improvement to enable autonomous operations.
Partner with engineering, architecture, and product teams to enable shift‑left engineering practices ensuring reliability.
Mentor and guide teams on adopting SRE principles and tools.
Advocate for a culture of reliability, automation, and continuous improvement across the organization.

Qualifications

Strong expertise in implementing Site Reliability Engineering (SRE) principles.
Advanced knowledge of establishing observability using tools - Dynatrace & Datadog (primary skills).
Proficiency in automation & Scripting using Python & Ansible (primary skills).
Strong experience with cloud platforms - AWS & Azure (primary skills).
Solid understanding of containerization and orchestration tools like Docker and Kubernetes.
Proficiency in cloud native distributed systems & microservices architecture.
Exposure to AI/ML techniques for predictive analytics and automated problem resolution.
Familiarity with CI/CD pipelines & enabling automated release & deployment engineering solutions.
Good to have experience with chaos engineering tools like Gremlin or Chaos Monkey and implementing automation frameworks for resilience tracking.
Ability to manage and prioritize multiple projects in a fast‑paced environment.
Strong interpersonal and communication skills to work effectively across teams.
Excellent problem solving, analytical thinking, and adaptability.
Strategic mindset balancing engineering excellence with business priorities.
12+ years of experience in IT operations, SRE, or DevOps roles.
Proven track record of SRE experience in implementing observability and automation solutions in large‑scale environments.
Certifications in cloud platforms, observability tools & other SRE related areas.

IT Operations/Site Reliability Engineer (Datadog/Dynatrace) employer: Infoplus Technologies UK Ltd

As an employer, we pride ourselves on fostering a culture of innovation and reliability, making us an excellent choice for IT Operations/Site Reliability Engineers. Our commitment to employee growth is evident through continuous learning opportunities and mentorship programmes, while our collaborative work environment encourages cross-functional teamwork. Located in a vibrant tech hub, we offer competitive benefits and the chance to be at the forefront of modernising IT operations with cutting-edge tools like Datadog and Dynatrace.

Contact Details:

Infoplus Technologies UK Ltd Recruitment Team

View Infoplus Technologies UK Ltd profile

We think you need these skills to ace IT Operations/Site Reliability Engineer (Datadog/Dynatrace)

Site Reliability Engineering (SRE) principles

Observability tools (Dynatrace, Datadog)

Automation & Scripting (Python, Ansible)

Cloud platforms (AWS, Azure)

Containerization and orchestration (Docker, Kubernetes)

Cloud native distributed systems

Microservices architecture

AI/ML techniques for predictive analytics

CI/CD pipelines

Chaos engineering tools (Gremlin, Chaos Monkey)

Incident management

Root cause analysis

Interpersonal and communication skills

Analytical thinking

Project management

IT Operations/Site Reliability Engineer (Datadog/Dynatrace)

Infoplus Technologies UK Ltd

Apply Now

IT Operations/Site Reliability Engineer (Datadog/Dynatrace)

At a Glance

IT Operations/Site Reliability Engineer (Datadog/Dynatrace) employer: Infoplus Technologies UK Ltd

We think you need these skills to ace IT Operations/Site Reliability Engineer (Datadog/Dynatrace)

Company

Product

Help