Senior Site Reliability Engineer - DevOps

London Full-Time 54000 - 84000 £ / year (est.) No home office possible

About Us: We love going to work and think you should too. Our team is dedicated to trust, customer obsession, agility, and striving to be better every day. These values serve as the foundation of our culture, guiding our actions and driving us towards excellence. This position is located in London, England. Our office is situated in a core location near Waterloo and Blackfriars on the Southbank.

What You'll Do: This role will take a lead in the operational uptime and continued expansion of LM Edwin AI infrastructure by serving as a facilitator of operational excellence. Responsibilities include:

Designing and implementing new production deployments of SOA-based software across cloud datacentres.
Providing guidance on organizing, securing and automating existing infrastructure and deployments.
Maintaining uptime of LogicMonitor's (Edwin AI) SaaS-based service and driving technical/process enhancements to improve uptime.
Leading efforts to design and implement resilient IT applications using DevOps and SRE principles.
Deploying production applications and driving improvements to the deployment process.
Monitoring system performance and troubleshooting issues to ensure high availability and reliability.
Designing and deploying new application components.
Designing and deploying new infrastructure components and integrations.
Ensuring security of the production environment.
Developing and implementing automated disaster recovery processes to minimise system downtime.
Identifying opportunities for improvement in system performance, deployment speed, and scalability.
Writing high-quality code to automate various aspects of infrastructure maintenance and deployment.
Supporting engineering and working closely with engineers to drive operational and architectural/design changes.
Owning, managing, and executing multiple large and technically complex projects across teams.
Providing direct technical guidance to help team members achieve goals and improve their productivity.
Participating in the recruitment and hiring of new engineers.

What You'll Need:

5+ years as a DevOps Engineer or SRE with designing and implementing resilient IT applications using DevOps and SRE principles.
Good understanding of Linux system administration and 3+ years of hands-on experience.
Good understanding of networking technologies.
Experience building IaC automations using Terraform.
Production experience of containers and container orchestration tools (Docker/Kubernetes).
Good understanding of Amazon Web Services.
Experience of designing/implementing CI/CD pipelines including production deployments.
Experience building and working with logging and metrics solutions such as Prometheus.
Experience programming with RESTful web services.
Proficient Python developer.
Well-versed in security principles, both systems and network.
Excellent written and verbal communications skills with a track record of improving documentation and processes.
Experience in carrying out complex problem determination and Root Cause Analysis across complex distributed systems.