Lead Infrastructure Engineer – AI Clusters & SRE

Job Board

Companies

Anthropic

Lead Infrastructure Engineer – AI Clusters & SRE

Full-Time No working from home possible

Apply Now

At a Glance

Tasks: Lead the development of cutting-edge AI infrastructure and ensure system reliability.
Company: Join Anthropic, a leader in AI technology with a mission to create safe AI systems.
Benefits: Competitive salary, equity options, unlimited PTO, and comprehensive health benefits.
Other info: Diverse team culture with excellent career growth opportunities and support for underrepresented groups.
Why this job: Make a real impact on AI technology while working with top talent in a dynamic environment.
Qualifications: 4+ years in infrastructure engineering, strong programming skills, and a passion for scalable systems.

Anthropic is seeking talented and experienced Infrastructure Engineers to join our team and support the development, scaling, and maintenance of our cutting-edge AI systems. By joining our Infrastructure team, you will have the opportunity to work on groundbreaking AI technologies and contribute to the development of frontier models, supporting Anthropic's mission to create safe and reliable AI systems that benefit humanity.

We currently have openings on:

Data Infrastructure: The Data Infrastructure team is responsible for designing, building, and maintaining the data infrastructure that powers our AI research and products. You will collaborate with cross-functional teams to understand data requirements, deliver efficient and reliable data solutions, and continuously improve our data infrastructure. Your role will involve building and optimizing data pipelines, implementing data governance best practices, monitoring and troubleshooting, and setting technical strategies for high-scale, reliable data infrastructure and pipelines. You will work with technologies such as Spark, Airflow, dbt, and cloud services from GCP and AWS.
Research Infrastructure: The research infra team addresses the problem of developing and scaling systems that enable researchers to iterate quickly and also scale key systems/components used by researchers during the development phase to work at production scale as our model footprint grows.
Site Reliability Engineering: As an SRE at Anthropic, you will design and implement scalable solutions, collaborate with development teams to improve infrastructure reliability, and establish monitoring systems, SLOs, and SLIs. You will implement fault-tolerant design patterns, build automation tools, and participate in an on-call rotation. Utilizing IaC principles, you will collaborate with cross-functional teams to ensure reliability and scalability in new features and services.
Systems: The systems team is responsible for supporting some of the largest, most sophisticated clusters in industry used to train, research, and ultimately serve AI models. Your work will be crucial in ensuring Anthropic is able to continue reliably and safely training frontier models. You will be responsible for building systems and running large Kubernetes clusters with GPU/TPU/Tranium workloads.
Observability: The observability team is responsible for designing, building, and maintaining the observability infrastructure that ensures the reliability, performance, and efficiency of our AI systems and services. You will collaborate with cross-functional teams to understand their observability requirements and deliver solutions using technologies such as Prometheus, Splunk, Cloud Logging, Grafana, and Honeycomb. Your role will involve developing a config-driven approach to manage dashboards and alerts, implementing structured logging and tracing, optimizing the observability stack, and building a reliable system that requires minimal maintenance.

Responsibilities:

Lead build out of industry-leading AI clusters (thousands to hundreds of thousands of machines), partnering closely with cloud service providers on cluster build out and required features.
Consult with different stakeholders to deeply understand infrastructure, data and compute needs, identifying potential solutions to support frontier research and product development.
Set technical strategy and oversee development of high scale, reliable infrastructure systems.
Mentor top technical talent.
Design processes (e.g. postmortem review, incident response, on-call rotations) that help the team operate effectively and never fail the same way twice.

You may be a good fit if you:

Have 4+ years of relevant industry experience, 1+ years leading large scale, complex projects or teams as an engineer or tech lead.
Are obsessed with distributed systems at scale, infrastructure reliability, scalability, security, and continuous improvement.
Strong proficiency in at least one programming language (e.g., Python, Rust, Go, Java).
Strong problem-solving skills and ability to work independently.
Have a passion for supporting internal partners like research to understand their needs.
Have excellent communication skills to build consensus with stakeholders, both internally and externally.
Possess deep knowledge of modern cloud infrastructure including Kubernetes, Infrastructure as Code, AWS, and GCP.

Strong candidates may also:

Have security and privacy best practice expertise.
Experience with machine learning infrastructure like GPUs, TPUs, or Trainium, as well as supporting networking infrastructure like NCCL.
Low level systems experience, for example linux kernel tuning and eBPF.

Technical expertise: Quickly understanding systems design tradeoffs, keeping track of rapidly evolving software systems.

Deadline to apply: None. Applications will be reviewed on a rolling basis.

The expected salary range for this position is: Annual Salary: £225,000—£390,000 GBP.

Logistics:

Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.
US visa sponsorship: We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate; operations roles are especially difficult to support.

We encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you're interested in this work. We think AI systems like the ones we're building have enormous social and ethical implications. We think this makes representation even more important, and we strive to include a range of diverse perspectives on our team.

Compensation and Benefits:

Anthropic’s compensation package consists of three elements: salary, equity, and benefits. We are committed to pay fairness and aim for these three elements collectively to be highly competitive with market rates.

Equity: For eligible roles, equity will be a major component of the total compensation.
US Benefits: The following benefits are for our US-based employees:

Optional equity donation matching.
Comprehensive health, dental, and vision insurance for you and all your dependents.
401(k) plan with 4% matching.
22 weeks of paid parental leave.
Unlimited PTO – most staff take between 4-6 weeks each year, sometimes more!
Stipends for education, home office improvements, commuting, and wellness.
Fertility benefits via Carrot.
Daily lunches and snacks in our office.
Relocation support for those moving to the Bay Area.

UK Benefits: The following benefits are for our UK-based employees:

Optional equity donation matching.
Private health, dental, and vision insurance for you and your dependents.
Pension contribution (matching 4% of your salary).
21 weeks of paid parental leave.
Unlimited PTO – most staff take between 4-6 weeks each year, sometimes more!
Health cash plan.
Life insurance and income protection.
Daily lunches and snacks in our office.

Lead Infrastructure Engineer – AI Clusters & SRE employer: Anthropic

Anthropic is an exceptional employer for those passionate about advancing reinforcement learning in a collaborative and innovative environment. With competitive compensation, generous vacation and parental leave, and flexible working hours, employees enjoy a supportive work culture that prioritises both personal and professional growth. Located in a vibrant office space, team members have the unique opportunity to engage directly with cutting-edge research while making meaningful contributions to the responsible scaling of AI.

Contact Details:

Anthropic Recruitment Team

View Anthropic profile

StudySmarter Expert Advice🤫

We think this is how you could land Lead Infrastructure Engineer – AI Clusters & SRE

✨Tip Number 1

Network like a pro! Reach out to folks in the industry, attend meetups, and connect with people on LinkedIn. You never know who might have the inside scoop on job openings or can put in a good word for you.

✨Tip Number 2

Show off your skills! Create a portfolio or GitHub repository showcasing your projects and contributions. This is a great way to demonstrate your expertise in infrastructure engineering and make a lasting impression.

✨Tip Number 3

Prepare for interviews by brushing up on technical concepts and common questions. Practice explaining your thought process and problem-solving approach, especially around distributed systems and cloud infrastructure.

✨Tip Number 4

Don't forget to apply through our website! It’s the best way to ensure your application gets seen. Plus, we love seeing candidates who are genuinely interested in joining our team at Anthropic.

We think you need these skills to ace Lead Infrastructure Engineer – AI Clusters & SRE

Infrastructure Engineering

Data Infrastructure Design

Data Pipeline Optimization

Cloud Services (GCP, AWS)

Kubernetes Management

Site Reliability Engineering (SRE)

Programming (Python, Rust, Go, Java)

Distributed Systems

Infrastructure as Code (IaC)

Monitoring Systems (Prometheus, Grafana)

Incident Response Processes

Technical Strategy Development

Communication Skills

Problem-Solving Skills

Machine Learning Infrastructure

Some tips for your application 🫡

Tailor Your Application:Make sure to customise your CV and cover letter for the Lead Infrastructure Engineer role. Highlight your experience with distributed systems, cloud infrastructure, and any relevant projects that showcase your skills in building scalable solutions.

Showcase Your Technical Skills:Don’t hold back on your technical expertise! Mention your proficiency in programming languages like Python or Go, and any hands-on experience with Kubernetes or AWS. We want to see how you can contribute to our cutting-edge AI systems.

Communicate Clearly:Your written application should reflect your excellent communication skills. Be clear and concise, and make sure to explain your thought process when discussing past projects or challenges you've faced. This helps us understand how you approach problem-solving.

Apply Through Our Website:We encourage you to apply directly through our website. It’s the best way for us to receive your application and ensures you’re considered for the role. Plus, it shows you’re keen on joining our team at Anthropic!

How to prepare for a job interview at Anthropic

✨Know Your Tech Inside Out

Make sure you’re well-versed in the technologies mentioned in the job description, like Kubernetes, AWS, and GCP. Brush up on your programming skills in languages like Python or Go, as you might be asked to solve technical problems on the spot.

✨Showcase Your Problem-Solving Skills

Prepare to discuss specific challenges you've faced in previous roles, especially related to infrastructure reliability and scalability. Use the STAR method (Situation, Task, Action, Result) to structure your answers and highlight your problem-solving prowess.

✨Understand Their Mission

Familiarise yourself with Anthropic's mission to create safe and reliable AI systems. Be ready to discuss how your experience aligns with their goals and how you can contribute to their cutting-edge projects.

✨Ask Insightful Questions

Prepare thoughtful questions that show your interest in the role and the company. Inquire about their current projects, team dynamics, or how they measure success in their infrastructure initiatives. This not only demonstrates your enthusiasm but also helps you gauge if the company is the right fit for you.

Lead Infrastructure Engineer – AI Clusters & SRE

Anthropic

Apply Now

Lead Infrastructure Engineer – AI Clusters & SRE

At a Glance

Lead Infrastructure Engineer – AI Clusters & SRE employer: Anthropic

StudySmarter Expert Advice🤫

We think you need these skills to ace Lead Infrastructure Engineer – AI Clusters & SRE

Some tips for your application 🫡

How to prepare for a job interview at Anthropic

Company

Product

Help