Staff Software Engineer, AI Reliability Engineering
Staff Software Engineer, AI Reliability Engineering

Staff Software Engineer, AI Reliability Engineering

London Full-Time 60000 - 84000 £ / year (est.) Home office (partial)
Go Premium
A

At a Glance

  • Tasks: Join us to enhance AI reliability through innovative engineering and monitoring systems.
  • Company: Anthropic is on a mission to create safe, reliable AI for everyone.
  • Benefits: Enjoy flexible hours, generous leave, and a collaborative office environment.
  • Why this job: Be part of groundbreaking AI research that impacts society positively.
  • Qualifications: Bachelor's degree or equivalent experience in a related field required.
  • Other info: We value diverse perspectives and encourage all candidates to apply.

The predicted salary is between 60000 - 84000 £ per year.

Staff Software Engineer, AI Reliability Engineering

London, UK

About Anthropic

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.

About the role

Anthropic is seeking talented and experienced Reliability Engineers, including Software Engineers and Systems Engineers with experience and interest in reliability, to join our team. We will be defining and achieving reliability metrics for all of Anthropic’s internal and external products and services. While significantly improving reliability for Anthropic’s services, we plan to use the developing capabilities of modern AI models to reengineer the way we work. This team will be a critical part of Anthropic’s mission to bring the capabilities of groundbreaking AI technologies to benefit humanity in a safe and reliable way.

Responsibilities:

  • Develop appropriate Service Level Objectives for large language model serving and training systems, balancing availability/latency with development velocity
  • Design and implement monitoring systems including availability, latency and other salient metrics
  • Assist in the design and implementation of high-availability language model serving infrastructure capable of handling the needs of millions of external customers and high-traffic internal workloads
  • Develop and manage automated failover and recovery systems for model serving deployments across multiple regions and cloud providers
  • Lead incident response for critical AI services, ensuring rapid recovery and systematic improvements from each incident
  • Build and maintain cost optimization systems for large-scale AI infrastructure, focusing on accelerator (GPU/TPU/Trainium) utilization and efficiency

You may be a good fit if you:

  • Have extensive experience with distributed systems observability and monitoring at scale
  • Understand the unique challenges of operating AI infrastructure, including model serving, batch inference, and training pipelines
  • Have proven experience implementing and maintaining SLO/SLA frameworks for business-critical services
  • Are comfortable working with both traditional metrics (latency, availability) and AI-specific metrics (model performance, training convergence)
  • Have experience with chaos engineering and systematic resilience testing
  • Can effectively bridge the gap between ML engineers and infrastructure teams
  • Have excellent communication skills

Strong candidates may also:

  • Have experience operating large-scale model training infrastructure or serving infrastructure (>1000 GPUs)
  • Have experience with one or more ML hardware accelerators (GPUs, TPUs, Trainium, e.g.)
  • Understand ML-specific networking optimizations like RDMA and InfiniBand.
  • Have expertise in AI-specific observability tools and frameworks
  • Understand ML model deployment strategies and their reliability implications
  • Have contributed to open-source infrastructure or ML tooling

Deadline to apply: None. Applications will be reviewed on a rolling basis.

The expected salary range for this position is:

Logistics

Education requirements: We require at least a Bachelor\’s degree in a related field or equivalent experience.
Location-based hybrid policy:
Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.

Visa sponsorship: We do sponsor visas! However, we aren\’t able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.

We encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you\’re interested in this work. We think AI systems like the ones we\’re building have enormous social and ethical implications. We think this makes representation even more important, and we strive to include a range of diverse perspectives on our team.

How we\’re different

We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact — advancing our long-term goals of steerable, trustworthy AI — rather than work on smaller and more specific puzzles. We view AI research as an empirical science, which has as much in common with physics and biology as with traditional efforts in computer science. We\’re an extremely collaborative group, and we host frequent research discussions to ensure that we are pursuing the highest-impact work at any given time. As such, we greatly value communication skills.

The easiest way to understand our research directions is to read our recent research. This research continues many of the directions our team worked on prior to Anthropic, including: GPT-3, Circuit-Based Interpretability, Multimodal Neurons, Scaling Laws, AI & Compute, Concrete Problems in AI Safety, and Learning from Human Preferences.

Come work with us!

Anthropic is a public benefit corporation headquartered in San Francisco. We offer competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and a lovely office space in which to collaborate with colleagues.

Apply for this job

*

indicates a required field

First Name *

Last Name *

Email *

Phone

Resume/CV

Enter manually

Accepted file types: pdf, doc, docx, txt, rtf

Enter manually

Accepted file types: pdf, doc, docx, txt, rtf

(Optional) Personal Preferences *

How do you pronounce your name?

LinkedIn Profile

Please ensure to provide either your LinkedIn profile or Resume, we require at least one of the two.

Website

Publications (e.g. Google Scholar) URL

Are you open to working in-person in one of our offices 25% of the time? * Select…

When is the earliest you would want to start working with us?

Do you have any deadlines or timeline considerations we should be aware of?

AI Policy for Application * Select…

While we encourage people to use AI systems during their role to help them work faster and more effectively, please do not use AI assistants during the application process. We want to understand your personal interest in Anthropic without mediation through an AI system, and we also want to evaluate your non-AI-assisted communication skills. Please indicate \’Yes\’ if you have read and agree.

Why Anthropic? *

Why do you want to work at Anthropic? (We value this response highly – great answers are often 200-400 words.)

Will you now or will you in the future require employment visa sponsorship to work in the country in which the job you\’re applying for is located? * Select…

Do you require visa sponsorship? * Select…

Additional Information *

Add a cover letter or anything else you want to share.

Are you open to relocation for this role? * Select…

What is the address from which you plan on working? If you would need to relocate, please type \”relocating\”.

Have you ever interviewed at Anthropic before? * Select…

#J-18808-Ljbffr

A

Contact Detail:

Anthropic Recruiting Team

StudySmarter Expert Advice 🤫

We think this is how you could land Staff Software Engineer, AI Reliability Engineering

Tip Number 1

Familiarise yourself with the latest trends in AI reliability engineering. Understanding the current challenges and advancements in AI infrastructure will help you engage in meaningful conversations during interviews.

Tip Number 2

Network with professionals in the AI and reliability engineering fields. Attend relevant meetups, webinars, or conferences to connect with potential colleagues and learn more about Anthropic's work culture.

Tip Number 3

Prepare to discuss your experience with distributed systems and monitoring at scale. Be ready to share specific examples of how you've tackled reliability issues in past projects, as this will demonstrate your expertise.

Tip Number 4

Showcase your communication skills by practicing clear and concise explanations of complex technical concepts. This is crucial for bridging the gap between ML engineers and infrastructure teams, which is a key aspect of the role.

We think you need these skills to ace Staff Software Engineer, AI Reliability Engineering

Distributed Systems Observability
Monitoring at Scale
Service Level Objectives (SLO) Implementation
AI Infrastructure Management
High-Availability System Design
Automated Failover and Recovery Systems
Incident Response Management
Cost Optimisation for AI Infrastructure
Chaos Engineering
Resilience Testing
Communication Skills
Machine Learning Hardware Accelerators (GPUs, TPUs, Trainium)
ML-Specific Networking Optimisations (RDMA, InfiniBand)
AI-Specific Observability Tools
ML Model Deployment Strategies

Some tips for your application 🫡

Tailor Your Application: Make sure to customise your CV and cover letter to highlight your experience with reliability engineering, distributed systems, and AI infrastructure. Use specific examples that align with the responsibilities mentioned in the job description.

Showcase Relevant Skills: Emphasise your skills in monitoring systems, SLO/SLA frameworks, and incident response. Mention any experience you have with chaos engineering or resilience testing, as these are crucial for the role.

Craft a Compelling 'Why Anthropic?' Response: Take time to articulate why you want to work at Anthropic. Reflect on their mission to create reliable AI systems and how your values align with theirs. A strong answer can set you apart from other candidates.

Follow Application Guidelines: Ensure you follow all application instructions carefully. Provide either your LinkedIn profile or resume, and avoid using AI tools during the application process to demonstrate your personal interest and communication skills.

How to prepare for a job interview at Anthropic

Understand the Role

Make sure you thoroughly understand the responsibilities of a Staff Software Engineer in AI Reliability Engineering. Familiarise yourself with concepts like Service Level Objectives (SLOs), monitoring systems, and incident response. This will help you articulate how your experience aligns with the role.

Showcase Relevant Experience

Prepare to discuss your past experiences with distributed systems, AI infrastructure, and reliability metrics. Be ready to provide specific examples of how you've implemented SLO/SLA frameworks or handled incident responses in previous roles.

Communicate Effectively

Since communication is highly valued at Anthropic, practice explaining complex technical concepts in simple terms. This will demonstrate your ability to bridge the gap between ML engineers and infrastructure teams, which is crucial for this role.

Research Anthropic's Work

Familiarise yourself with Anthropic's recent research and projects. Understanding their mission and the impact of their work on AI safety and reliability will show your genuine interest in the company and its goals during the interview.

Staff Software Engineer, AI Reliability Engineering
Anthropic
Location: London
Go Premium

Land your dream job quicker with Premium

You’re marked as a top applicant with our partner companies
Individual CV and cover letter feedback including tailoring to specific job roles
Be among the first applications for new jobs with our AI application
1:1 support and career advice from our career coaches
Go Premium

Money-back if you don't land a job in 6-months

A
  • Staff Software Engineer, AI Reliability Engineering

    London
    Full-Time
    60000 - 84000 £ / year (est.)
  • A

    Anthropic

Similar positions in other companies
UK’s top job board for Gen Z
discover-jobs-cta
Discover now
>