Site Reliability Engineer, Cloud Incident Response in London
Site Reliability Engineer, Cloud Incident Response

Site Reliability Engineer, Cloud Incident Response in London

London Full-Time 36000 - 60000 £ / year (est.) No home office possible
Go Premium
S

At a Glance

  • Tasks: Enhance production reliability and scalability using cutting-edge tools like Kubernetes and AWS.
  • Company: Join SS&C, a leader in tech innovation with a diverse and collaborative culture.
  • Benefits: Enjoy a competitive salary, bonuses, and comprehensive benefits in a hybrid work environment.
  • Why this job: Make a real impact by driving strategic improvements and enhancing cloud infrastructure.
  • Qualifications: 5+ years in SRE or DevOps, with strong skills in observability and cloud technologies.
  • Other info: Dynamic team focused on continuous learning and engineering excellence.

The predicted salary is between 36000 - 60000 £ per year.

Get To Know Us: SS&C is leading the way. We continue to look for today’s and tomorrow’s brightest talent, those who embody a spirit to improve not only their lives, but those around them. From college students to seasoned and experienced professionals, we encourage you to apply. SS&C prides itself on hiring diverse, honest, dynamic individuals who value collaboration, accountability, and innovation, to name a few.

Location: London office, hybrid — 2 days per week onsite

About the Role: We’re seeking a hands-on Site Reliability Engineer to enhance our production reliability, scalability, and operability. You’ll use your expertise across observability, Kubernetes, AWS, and infrastructure as code to investigate issues, implement tactical fixes quickly, and drive strategic improvements that raise availability and reduce toil. This is a hybrid role with two days per week in the office. You’ll collaborate closely with engineering, product, and support to design, build, and run robust platforms that meet demanding SLAs/SLOs.

What You’ll Do:

  • Keep production healthy: Monitor, troubleshoot, and resolve incidents across services and infrastructure; reduce MTTR and prevent recurrences through high-quality post-incident actions.
  • Observability as a first-class practice: Use Grafana, Datadog, and Splunk (and related tools like Prometheus/OpenTelemetry) to detect anomalies, root cause issues, and create actionable alerts and dashboards.
  • Run Kubernetes at scale: Operate and harden Kubernetes (EKS preferred); manage deployments, autoscaling, rollouts/rollbacks, service mesh/ingress, and cluster upgrades.
  • Build reliable cloud foundations: Design and operate AWS workloads (networking, IAM, EC2/EKS, RDS/Aurora, S3, CloudWatch, ALB/NLB, VPC, Security Groups) with a security-first mindset.
  • Automate with IaC: Codify and continuously improve infrastructure using Terraform (modules, workspaces, remote state, policy as code).
  • Enable fast, safe delivery: Partner with teams to enhance CI/CD pipelines (e.g., GitHub Actions/Jenkins/Argo CD), progressive delivery, and change management to lower the change failure rate.
  • Own reliability metrics: Define and iterate on SLOs/SLIs/error budgets; champion blameless post-mortems and reliability reviews.
  • Participate in on-call: Join a fair, well-documented on-call rota; improve runbooks, automation, and alert quality to make on-call sustainable.
  • Drive strategic improvements: Identify systemic issues and deliver durable fixes (architecture, capacity, scaling, caching, resilience patterns, rate limiting, back-pressure, circuit breakers, chaos engineering).

What you will bring:

  • 5+ years operating production systems as an SRE, DevOps engineer, or software engineer.
  • Observability: Hands-on with Grafana, Datadog, and Splunk for incident investigation, dashboarding, alerting, tracing/logs/metrics correlation, and performance analysis.
  • Kubernetes: Strong experience running and troubleshooting workloads (controllers, pods, networking, storage, HPA/VPA, Helm/Customise).
  • AWS: Solid practical knowledge of core services and best practices for security, cost, and reliability.
  • Terraform: Confident with module design, state management, DRY patterns, and CI for IaC.
  • On-call experience: Demonstrated participation in a production on-call rota, effective incident communication, and post-incident follow-through.
  • Collaboration & communication: Ability to work cross-functionally, write clear runbooks/RFCs, and influence engineering practices.

Nice-to-Have:

  • EKS internals, cluster autoscaler, managed node groups/Fargate; service mesh (Istio/Linkerd), ingress controllers (Nginx/ALB).
  • Prometheus, OpenTelemetry, Loki/Tempo, alert tuning and SLO burn-rate alerts.
  • Argo CD/FluxCD, Helm chart authoring, Kustomize.
  • CD patterns (blue/green, canary, feature flags), GitOps workflows.
  • Database operations (Postgres/MySQL), caching (Redis), message queues (Kafka/SQS).
  • Security & compliance (CIS benchmarks, IAM boundaries, secrets management, Vault/Sealed Secrets).
  • Resilience testing/chaos engineering.
  • Relevant certs (AWS Solutions Architect/DevOps Engineer, CKA/CKAD, Terraform Associate).

How We Work:

  • Hybrid: Two days per week in the office for collaboration and incident/architecture reviews; remote the rest.
  • Engineering excellence: Blameless culture, well-defined SLOs, automation-first, and continuous learning.
  • Impact focus: Measure success via availability, latency, MTTR, change failure rate, toil reduction, and customer outcomes.

On-Call Expectations:

  • Participate in a rotating on-call schedule with clear escalation paths.
  • Improve alert signal-to-noise ratio and operational readiness (dashboards, runbooks, playbooks).
  • Post-incident reviews focused on learning and durable improvements—no blame.

Benefits:

  • Competitive salary + bonus (DOE)
  • Pension and comprehensive benefits
  • Modern tooling and time allocated for reliability improvements

We encourage applications from people of all backgrounds to enable us to bring diverse perspectives to our thinking and conversation. It's important to us that we strive to have a workforce that is diverse in the widest sense.

Site Reliability Engineer, Cloud Incident Response in London employer: Ss&C Technologies Holdings

At SS&C, we foster a dynamic and inclusive work culture that prioritises collaboration, accountability, and innovation. As a Site Reliability Engineer in our London office, you'll benefit from a hybrid work model, competitive salary, and opportunities for professional growth while working with cutting-edge technologies in a supportive environment that values diverse perspectives. Join us to make a meaningful impact on our production systems and enhance your career in a company that champions continuous learning and engineering excellence.
S

Contact Detail:

Ss&C Technologies Holdings Recruiting Team

StudySmarter Expert Advice 🤫

We think this is how you could land Site Reliability Engineer, Cloud Incident Response in London

✨Tip Number 1

Network like a pro! Reach out to current employees at SS&C on LinkedIn or other platforms. Ask them about their experiences and any tips they might have for your application process. Personal connections can make a huge difference!

✨Tip Number 2

Prepare for the interview by brushing up on your technical skills. Make sure you can confidently discuss your experience with Kubernetes, AWS, and observability tools. Practice common SRE scenarios and be ready to showcase your problem-solving abilities.

✨Tip Number 3

Show off your passion for reliability engineering! During interviews, share examples of how you've improved system reliability in past roles. Highlight any innovative solutions you've implemented that align with SS&C's focus on collaboration and accountability.

✨Tip Number 4

Don’t forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, it shows you're genuinely interested in joining the SS&C team. Good luck!

We think you need these skills to ace Site Reliability Engineer, Cloud Incident Response in London

Site Reliability Engineering
Kubernetes
AWS
Infrastructure as Code (IaC)
Terraform
Observability
Grafana
Datadog
Splunk
Incident Management
Continuous Integration/Continuous Deployment (CI/CD)
Collaboration
Communication Skills
Post-Incident Review
Cloud Security Best Practices

Some tips for your application 🫡

Tailor Your CV: Make sure your CV is tailored to the Site Reliability Engineer role. Highlight your experience with Kubernetes, AWS, and observability tools like Grafana and Datadog. We want to see how your skills align with what we're looking for!

Craft a Compelling Cover Letter: Your cover letter is your chance to shine! Share your passion for reliability engineering and how you’ve tackled challenges in the past. Let us know why you’re excited about joining our team at StudySmarter.

Showcase Your Collaboration Skills: Since this role involves working closely with various teams, make sure to highlight your collaboration and communication skills. Share examples of how you've worked cross-functionally to achieve goals or solve problems.

Apply Through Our Website: We encourage you to apply directly through our website. It’s the best way for us to receive your application and ensures you’re considered for the role. Plus, it shows you’re keen on joining our team!

How to prepare for a job interview at Ss&C Technologies Holdings

✨Know Your Tools Inside Out

Make sure you’re well-versed in the tools mentioned in the job description, like Grafana, Datadog, and Kubernetes. Prepare to discuss your hands-on experience with these tools, including specific incidents where you used them to troubleshoot or improve system reliability.

✨Showcase Your Problem-Solving Skills

Be ready to share examples of how you've tackled complex issues in production systems. Think about times when you reduced MTTR or implemented strategic improvements. Use the STAR method (Situation, Task, Action, Result) to structure your answers.

✨Understand the Company Culture

Research SS&C’s values around collaboration, accountability, and innovation. Be prepared to discuss how your personal values align with theirs and provide examples of how you’ve embodied these traits in your previous roles.

✨Prepare for On-Call Scenarios

Since on-call participation is part of the role, think about your past experiences with on-call duties. Be ready to discuss how you handled incidents, communicated with teams, and contributed to post-incident reviews. Highlight your approach to improving alert quality and operational readiness.

Site Reliability Engineer, Cloud Incident Response in London
Ss&C Technologies Holdings
Location: London
Go Premium

Land your dream job quicker with Premium

You’re marked as a top applicant with our partner companies
Individual CV and cover letter feedback including tailoring to specific job roles
Be among the first applications for new jobs with our AI application
1:1 support and career advice from our career coaches
Go Premium

Money-back if you don't land a job in 6-months

S
Similar positions in other companies
UK’s top job board for Gen Z
discover-jobs-cta
Discover now
>