Site Reliability Engineer, Cloud Incident Response
Site Reliability Engineer, Cloud Incident Response

Site Reliability Engineer, Cloud Incident Response

Full-Time 36000 - 60000 ÂŁ / year (est.) Home office (partial)
Go Premium
S

At a Glance

  • Tasks: Enhance production reliability and scalability using cutting-edge tech like Kubernetes and AWS.
  • Company: Join SS&C, a leading financial services and healthcare tech company with a diverse culture.
  • Benefits: Enjoy competitive salary, bonuses, comprehensive benefits, and modern tools for your work.
  • Why this job: Make a real impact in a hybrid role while collaborating with talented teams.
  • Qualifications: 5+ years in SRE or DevOps, with skills in observability, Kubernetes, and AWS.
  • Other info: Dynamic environment focused on engineering excellence and continuous learning.

The predicted salary is between 36000 - 60000 ÂŁ per year.

As a leading financial services and healthcare technology company based on revenue, SS&C is headquartered in Windsor, Connecticut, and has 27,000+ employees in 35 countries. Some 20,000 financial services and healthcare organizations, from the world's largest companies to small and mid‑market firms, rely on SS&C for expertise, scale, and technology.

SS&C is leading the way. We continue to look for today’s and tomorrow’s brightest talent, those who embody a spirit to improve not only their lives, but those around them. From college students to seasoned and experienced professionals, we encourage you to apply. SS&C prides itself on hiring diverse, honest, dynamic individuals who value collaboration, accountability, and innovation, to name a few.

Location: London office, hybrid — 2 days per week onsite

About the Role

We’re seeking a hands‑on Site Reliability Engineer to enhance our production reliability, scalability, and operability. You’ll use your expertise across observability, Kubernetes, AWS, and infrastructure as code to investigate issues, implement tactical fixes quickly, and drive strategic improvements that raise availability and reduce toil. This is a hybrid role with two days per week in the office. You’ll collaborate closely with engineering, product, and support to design, build, and run robust platforms that meet demanding SLAs/SLOs.

What You’ll Do

  • Keep production healthy: Monitor, troubleshoot, and resolve incidents across services and infrastructure; reduce MTTR and prevent recurrences through high-quality post‑incident actions.
  • Observability as a first‑class practice: Use Grafana, Datadog, and Splunk (and related tools like Prometheus/OpenTelemetry) to detect anomalies, root cause issues, and create actionable alerts and dashboards.
  • Run Kubernetes at scale: Operate and harden Kubernetes (EKS preferred); manage deployments, autoscaling, rollouts/rollbacks, service mesh/ingress, and cluster upgrades.
  • Build reliable cloud foundations: Design and operate AWS workloads (networking, IAM, EC2/EKS, RDS/Aurora, S3, CloudWatch, ALB/NLB, VPC, Security Groups) with a security‑first mindset.
  • Automate with IaC: Codify and continuously improve infrastructure using Terraform (modules, workspaces, remote state, policy as code).
  • Enable fast, safe delivery: Partner with teams to enhance CI/CD pipelines (e.g., GitHub Actions/Jenkins/Argo CD), progressive delivery, and change management to lower the change failure rate.
  • Own reliability metrics: Define and iterate on SLOs/SLIs/error budgets; champion blameless post‑mortems and reliability reviews.
  • Participate in on‑call: Join a fair, well‑documented on‑call rota; improve runbooks, automation, and alert quality to make on‑call sustainable.
  • Drive strategic improvements: Identify systemic issues and deliver durable fixes (architecture, capacity, scaling, caching, resilience patterns, rate limiting, back‑pressure, circuit breakers, chaos engineering).

What you will bring

  • 5+ years operating production systems as an SRE, DevOps engineer, or software engineer.
  • Observability: Hands‑on with Grafana, Datadog, and Splunk for incident investigation, dashboarding, alerting, tracing/logs/metrics correlation, and performance analysis.
  • Kubernetes: Strong experience running and troubleshooting workloads (controllers, pods, networking, storage, HPA/VPA, Helm/Customise).
  • AWS: Solid practical knowledge of core services and best practices for security, cost, and reliability.
  • Terraform: Confident with module design, state management, DRY patterns, and CI for IaC.
  • On‑call experience: Demonstrated participation in a production on‑call rota, effective incident communication, and post‑incident follow‑through.
  • Scripting & engineering fundamentals: Proficiency in at least one of Python, Go, or Bash; strong Linux, networking (DNS, TLS, HTTP, TCP), and Git.
  • Collaboration & communication: Ability to work cross‑functionally, write clear runbooks/RFCs, and influence engineering practices.

Nice‑to‑Have

  • EKS internals, cluster autoscaler, managed node groups/Fargate; service mesh (Istio/Linkerd), ingress controllers (Nginx/ALB).
  • Prometheus, OpenTelemetry, Loki/Tempo, alert tuning and SLO burn‑rate alerts.
  • Argo CD/FluxCD, Helm chart authoring, Kustomize.
  • CD patterns (blue/green, canary, feature flags), GitOps workflows.
  • Database operations (Postgres/MySQL), caching (Redis), message queues (Kafka/SQS).
  • Security & compliance (CIS benchmarks, IAM boundaries, secrets management, Vault/Sealed Secrets).
  • Resilience testing/chaos engineering.
  • Relevant certs (AWS Solutions Architect/DevOps Engineer, CKA/CKAD, Terraform Associate).

How We Work

  • Hybrid: Two days per week in the office for collaboration and incident/architecture reviews; remote the rest.
  • Engineering excellence: Blameless culture, well‑defined SLOs, automation‑first, and continuous learning.
  • Impact focus: Measure success via availability, latency, MTTR, change failure rate, toil reduction, and customer outcomes.

On‑Call Expectations

  • Participate in a rotating on‑call schedule with clear escalation paths.
  • Improve alert signal‑to‑noise ratio and operational readiness (dashboards, runbooks, playbooks).
  • Post‑incident reviews focused on learning and durable improvements—no blame.

Benefits

  • Competitive salary + bonus (DOE)
  • Pension and comprehensive benefits
  • Modern tooling and time allocated for reliability improvements

We encourage applications from people of all backgrounds to enable us to bring diverse perspectives to our thinking and conversation. It’s important to us that we strive to have a workforce that is diverse in the widest sense.

Thank you for your interest in SS&C! If applicable, to further explore this opportunity, please apply directly with us through our Careers page on our corporate website.

Unless explicitly requested or approached by SS&C Technologies, Inc. or any of its affiliated companies, the company will not accept unsolicited resumes from headhunters, recruitment agencies, or fee‑based recruitment services.

SS&C Technologies is an Equal Employment Opportunity employer and does not discriminate against any applicant for employment or employee on the basis of race, color, religious creed, gender, age, marital status, sexual orientation, national origin, disability, veteran status or any other classification protected by applicable discrimination laws.

Site Reliability Engineer, Cloud Incident Response employer: SS&C Technologies

SS&C is an exceptional employer that fosters a collaborative and innovative work culture, offering employees the opportunity to thrive in a hybrid environment from our London office. With a strong focus on professional growth, competitive benefits, and a commitment to diversity, we empower our Site Reliability Engineers to make a meaningful impact while enjoying modern tooling and a supportive team atmosphere.
S

Contact Detail:

SS&C Technologies Recruiting Team

StudySmarter Expert Advice 🤫

We think this is how you could land Site Reliability Engineer, Cloud Incident Response

✨Tip Number 1

Network like a pro! Reach out to current employees at SS&C on LinkedIn or other platforms. Ask them about their experiences and any tips they might have for landing the Site Reliability Engineer role.

✨Tip Number 2

Prepare for the technical interview by brushing up on your skills with Kubernetes, AWS, and Terraform. We recommend setting up a mini-project to showcase your expertise in these areas—it's a great way to demonstrate your hands-on experience!

✨Tip Number 3

Don’t forget to highlight your collaboration skills! SS&C values teamwork, so be ready to share examples of how you've worked cross-functionally in the past. This will show that you can thrive in their dynamic environment.

✨Tip Number 4

Finally, apply through our website! It’s the best way to ensure your application gets seen. Plus, it shows you're genuinely interested in joining our team at SS&C. Good luck!

We think you need these skills to ace Site Reliability Engineer, Cloud Incident Response

Site Reliability Engineering
Kubernetes
AWS
Terraform
Grafana
Datadog
Splunk
Incident Management
CI/CD Pipelines
Scripting (Python, Go, Bash)
Linux
Networking (DNS, TLS, HTTP, TCP)
Collaboration
Communication Skills
Observability

Some tips for your application 🫡

Tailor Your CV: Make sure your CV is tailored to the Site Reliability Engineer role. Highlight your experience with Kubernetes, AWS, and incident response. We want to see how your skills align with what we're looking for!

Craft a Compelling Cover Letter: Your cover letter is your chance to shine! Share your passion for reliability engineering and how you can contribute to our team. Be sure to mention any relevant projects or experiences that showcase your expertise.

Showcase Your Problem-Solving Skills: In your application, give examples of how you've tackled complex issues in production systems. We love candidates who can demonstrate their ability to think critically and drive strategic improvements.

Apply Through Our Website: Don't forget to apply directly through our Careers page! It’s the best way for us to receive your application and ensures you’re considered for the role. We can't wait to hear from you!

How to prepare for a job interview at SS&C Technologies

✨Know Your Tech Stack

Make sure you’re well-versed in the tools mentioned in the job description, like Grafana, Datadog, and Kubernetes. Brush up on your AWS knowledge too, especially around security best practices and core services. Being able to discuss these confidently will show that you're ready to hit the ground running.

✨Demonstrate Problem-Solving Skills

Prepare to share specific examples of how you've tackled incidents in the past. Think about times when you reduced MTTR or implemented effective post-incident actions. This will highlight your hands-on experience and ability to improve production reliability.

✨Showcase Collaboration Experience

Since this role involves working closely with various teams, be ready to discuss how you've collaborated in previous roles. Share examples of how you’ve influenced engineering practices or contributed to cross-functional projects. This will demonstrate your ability to work well in a team environment.

✨Ask Insightful Questions

Prepare thoughtful questions about the company’s approach to reliability and incident response. Inquire about their blameless culture or how they measure success in terms of availability and latency. This shows your genuine interest in the role and helps you assess if the company is the right fit for you.

Site Reliability Engineer, Cloud Incident Response
SS&C Technologies
Go Premium

Land your dream job quicker with Premium

You’re marked as a top applicant with our partner companies
Individual CV and cover letter feedback including tailoring to specific job roles
Be among the first applications for new jobs with our AI application
1:1 support and career advice from our career coaches
Go Premium

Money-back if you don't land a job in 6-months

S
  • Site Reliability Engineer, Cloud Incident Response

    Full-Time
    36000 - 60000 ÂŁ / year (est.)
  • S

    SS&C Technologies

    1000-5000
Similar positions in other companies
UK’s top job board for Gen Z
discover-jobs-cta
Discover now
>