Site Reliability Engineer in London

Site Reliability Engineer in London

London Full-Time 70000 - 90000 Β£ / year (est.) No working from home possible
N

At a Glance

  • Tasks: Automate and optimise processes for the Market Risk Platform using Python and AI.
  • Company: Join a leading financial institution focused on innovation and reliability.
  • Benefits: Competitive pay, hands-on experience, and opportunities for professional growth.
  • Other info: Dynamic work environment with a focus on collaboration and continuous improvement.
  • Why this job: Make a real impact by reducing operational toil and enhancing automation.
  • Qualifications: 8+ years SRE experience, strong Python skills, and a passion for process optimisation.

The predicted salary is between 70000 - 90000 Β£ per year.

NOTE: VISA SPONSORSHIP IS NOT PROVIDED

Location: London, UK (5 Days / Week Onsite)

Type: Contract Inside IR35 / Permanent

Experience: Minimum 8+ Years

Skills:

  • SRE experience with Python-based applications (not Java)
  • Exposure to cloud technologies
  • Familiarity with Athena ecosystem or similar (SecDB, Quartz)
  • Banking and risk domain exposure

SRE Role description

We need an experienced SRE to focus predominantly on automation, optimization, and process re-engineering using AI for the Market Risk Platform. Success is measured by capacity created (toil eliminated, fewer manual steps, faster recovery, safer/faster changes) not by being the primary BAU support resources. Strong Python and provable agentic AI delivery.

Primary Objectives:

  • Eliminate Operational toil and recurring manual work through durable automation
  • Re-engineer support/change processes to reduce handoffs, approvals friction and rerun complexity
  • Industrialize reliability operations so existing SREs spend less time firefighting and more time engineering

Key Responsibilities (Automation & Process first):

Automation Engineering (Core)

  • Build production grade automation in Python (tools, services, workflows) to remove repetitive work: environment checks, dependency validation, automated reruns/reprocessing, safe restarts, drift detection, remediation actions, and standardized operation tasks
  • Create self-service capabilities for common requests (guard railed, auditable, repeatable)
  • Implement automation with Safety: idempotency, dry-run modes, approval gates where needed, rollback/undo strategies, and clear audit trails

Process Re-engineering (Core)

  • Map current operation processes (incident/problem/change, release readiness, rerun/recovery, access/entitlements, environment onboarding) and redesign them to remove waste and reduce cycle time
  • Standardize runbooks/playbooks into executable workflows, reduce tribal knowledge via templates, checklists, and automated pre-flight controls
  • Define and track operation KPIs (toil hours removed, alert volume reduction, MTTR improvements, change failure rate reduction, rerun time reduction)

Agentic AI

  • Design and implement agentic workflows that take action using tools/runbooks (e.g., diagnostics, evidence gathering, correlation, guided remediation, change-risk checks, automated rerun orchestration)
  • Put strong controls in place: scoped permissions, deterministic fallbacks, human-in-the-loop approvals for risky actions, evaluation harnesses and measurable outcomes
  • Productionize with monitoring, logging and post incident learnings feeding back into the agent/tooling

Observability (enablement for automation)

Required skills & Experience:

  • Senior SRE experience on distributed systems and batch/intraday workloads in a production environment
  • Strong Python
  • Provable agentic AI experience showing tool integration, guard rails, evaluation approach
  • Measurable impact (toil reduction, MTTR reduction, alert reduction etc)
  • Demonstrated process optimization ability (removing steps/handoffs, standardizing workflows, implementing light weight controls with metrics)
  • Strong Linux and troubleshooting fundamentals across application/system/network layers
  • Experience working across mixed estates (On Prem VMs + Cloud, with some Kubernetes exposure for operational monitoring/reruns)

Differentiators:

  • Exposure to Banking/Finance Market Risk Domains
  • Experience and knowledge of Athena eco system familiarity or similar (Sec DB Quartz)

Site Reliability Engineer in London employer: Neev Limited

As a Site Reliability Engineer in London, you will join a dynamic team dedicated to innovation and excellence in the banking and risk domain. Our company fosters a collaborative work culture that prioritises employee growth through continuous learning and development opportunities, while also offering competitive benefits and a focus on automation and process optimisation. With a commitment to reducing operational toil and enhancing efficiency, we provide a unique environment where your contributions directly impact the success of our Market Risk Platform.

N

Contact Details:

Neev Limited Recruitment Team

We think you need these skills to ace Site Reliability Engineer in London

Site Reliability Engineering (SRE)
Python
Cloud Technologies
Athena Ecosystem
Automation Engineering
Process Re-engineering
Agentic AI