Site Reliability Engineer II

Job Board

Companies

American Express

Site Reliability Engineer II

Full-Time 60000 - 80000 £ / year (est.) No working from home possible

Apply Now

At a Glance

Tasks: Collaborate with engineering teams to enhance system resilience and implement automation tools.
Company: American Express focuses on innovation and trust, powering growth through technology services.
Benefits: Enjoy competitive salaries, comprehensive medical benefits, and flexible working arrangements.
Other info: Participation in a 24/7 on-call rotation is expected.
Why this job: Join a team that drives technology innovation and supports global operations 24/7.
Qualifications: Bachelor’s degree in Computer Science or related field; knowledge of cloud platforms like AWS required.

The predicted salary is between 60000 - 80000 £ per year.

The Enterprise Technology Services organization partners with every part of the American Express business to power the company’s growth and innovation with trust and efficiency, and drive competitive differentiation with speed. We support the delivery and operations of technology, digital, and data capabilities, platforms, and services globally. Specifically, our team is responsible for the company’s technology engineering, architecture, and infrastructure, providing 24x7 support to ensure an uninterrupted, high-quality experience for customers and colleagues.

Site Reliability Engineer II collaborates with engineering teams to enhance system resilience, scalability, and performance through feature development, automation, architectural design, resiliency testing, and disaster recovery planning, while promoting best practices for continuous improvement.

Responsibilities

Collaborates with Software Engineering teams to design, develop, and implement features that enhance system resilience, scalability, and performance, while identifying and addressing potential system bottlenecks and failure points with guidance from senior colleagues.
Develops and implements automation tools and frameworks, including infrastructure as code (IaC) practices to streamline operational workflows, deployment processes, and infrastructure management, with guidance from peers and leaders.
Collaborates with senior engineers to contribute to the architectural design of systems, ensuring that reliability, scalability, and performance considerations are integrated into design discussions and decision‑making processes.
Collaborates in the design and execution of chaos engineering experiments and other resiliency testing, analyzing results and implementing improvements to enhance system robustness and recovery capabilities, with guidance from peers and leaders.
Develops and implements disaster recovery plans and business continuity strategies, ensuring systems can recover quickly and effectively from unexpected disruptions.
Collaborates with seniors to promote and implement best practices such as error budgeting, service‑level objectives (SLOs), and service‑level indicators (SLIs), contributing to a culture of continuous improvement and reliability.
Collaborates and co‑creates effectively with teams in product and the business to align technology initiatives with business objectives.
Participates in a 24‑by‑7 on‑call rotation team, including working on a weekend shift rota at least once every 4–6 weeks.

Qualifications

Education Qualifications:

Bachelor’s degree in Computer Science, Information Technology, Engineering, and/or comparable experience; advanced degree preferred.

Knowledge of modern observability stack – Splunk, Elastic Search, Prometheus, Grafana.

Knowledge of containerization technologies (e.g., Kubernetes, Docker) and microservices architecture.

Knowledge of observability tools and methodologies, including experience with logging, monitoring, tracing, and performance analysis platforms.

Knowledge of cloud‑based Site Reliability Engineering (SRE) practices and experience with public cloud platforms such as AWS, Azure, or Google Cloud.

Work Experience

Experience in software development, or technology operations, with a focus on Site Reliability Engineering.
Experience in Linux/Unix systems, object‑oriented programming languages (e.g., Java), scripting languages (e.g., Python, Bash), and cloud platforms (e.g., AWS, Azure, GCP).

Employment Eligibility

Employment eligibility to work with American Express in the UK is required as the company will not pursue visa sponsorship for these positions.

Benefits

We back you with benefits that support your holistic well‑being so you can be and deliver your best. This means caring for you and your loved ones’ physical, financial, and mental health, as well as providing the flexibility you need to thrive personally and professionally:

Competitive base salaries.
Bonus incentives.
Support for financial‑well‑being and retirement.
Comprehensive medical, dental, vision, life insurance, and disability benefits (depending on location).
Flexible working model with hybrid, onsite or virtual arrangements depending on role and business need.
Generous paid parental leave policies (depending on your location).
Free access to global on‑site wellness centers staffed with nurses and doctors (depending on location).
Free and confidential counseling support through our Healthy Minds program.
Career development and training opportunities.

Offer of employment with American Express is conditioned upon the successful completion of a background verification check, subject to applicable laws and regulations.

Site Reliability Engineer II employer: American Express

American Express offers a hybrid working model and comprehensive wellness benefits, including free access to on-site wellness centres. The company values innovation and has a strong commitment to customer service, making it an exciting place to grow your career.

Contact Details:

American Express Recruitment Team

View American Express profile

We think you need these skills to ace Site Reliability Engineer II

System Resilience

Scalability

Performance Enhancement

Automation Tools and Frameworks

Infrastructure as Code (IaC)

Chaos Engineering

Disaster Recovery Planning

Business Continuity Strategies

Error Budgeting

Service-Level Objectives (SLOs)

Service-Level Indicators (SLIs)

Observability Stack (e.g., Splunk, Elastic Search, Prometheus, Grafana)

Containerization Technologies (e.g., Kubernetes, Docker)

Cloud-Based Site Reliability Engineering (SRE) Practices

Linux/Unix Systems

Site Reliability Engineer II

American Express

Apply Now

Site Reliability Engineer II

At a Glance

Site Reliability Engineer II employer: American Express

We think you need these skills to ace Site Reliability Engineer II

Company

Product

Help