At a Glance
- Tasks: Join us in enhancing AI reliability and ensuring Claude's performance across critical systems.
- Company: Dynamic tech company focused on AI and reliability engineering.
- Benefits: Competitive salary, hybrid work model, visa sponsorship, and opportunities for professional growth.
- Other info: Exciting role with exposure to cutting-edge AI technologies and significant career advancement potential.
- Why this job: Make a real impact on AI systems while collaborating with diverse teams.
- Qualifications: Experience in distributed systems, strong communication skills, and a passion for reliability.
The predicted salary is between 70000 - 90000 € per year.
Requirements
- Have strong distributed systems, infrastructure, or reliability backgrounds -- we're looking for reliability-minded software engineers and SREs.
- Are curious and brave -- comfortable jumping into unfamiliar systems during an incident and helping drive resolution even when you don't have deep expertise yet.
- Think holistically about how systems compose and where the seams are.
- Can build lasting relationships across teams -- our engagement model depends on being welcomed as teammates, not outsiders with opinions.
- Care about users and feel ownership over outcomes, even for systems you don't own.
- Have excellent communication and collaboration skills -- you'll be partnering across the entire company.
- Bring diverse experience -- the team's strength comes from people who've built product stacks, scaled databases, run massive distributed systems, and everything in between.
Desirable
- Have been an SRE, Production Engineer, or in similar reliability-focused roles on large scale systems.
- Have experience operating large-scale model serving or training infrastructure (>1000 GPUs).
- Have experience with one or more ML hardware accelerators (GPUs, TPUs, Trainium).
- Understand ML-specific networking optimizations like RDMA and InfiniBand.
- Have expertise in AI-specific observability tools and frameworks.
- Have experience with chaos engineering and systematic resilience testing.
- Have contributed to open-source infrastructure or ML tooling.
Education requirements: We require at least a Bachelor's degree in a related field or equivalent experience.
Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.
Visa sponsorship: We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.
What the job involves
Claude has your back. AIRE has Claude's. Help us keep Claude reliable for everyone who depends on it. AIRE (AI Reliability Engineering) partners with teams across Anthropic to improve reliability across our most critical serving paths -- every hop from the SDK through our network, API layers, serving infrastructure, and accelerators and back. We jump into the trenches alongside partner teams to make the systems that deliver Claude more robust and resilient, be it during an incident or collaborating on projects. Reliability here is an emergent phenomenon that transcends any single team's boundaries, so someone has to zoom out and look at the whole picture. That's us -- and it means few teams at Anthropic offer this kind of dynamic, cross-cutting exposure to the systems that matter most.
Develop appropriate Service Level Objectives for large language model serving systems, balancing availability and latency with development velocity. Design and implement monitoring and observability systems across the token path. Assist in the design and implementation of high-availability serving infrastructure across multiple regions and cloud providers. Lead incident response for critical AI services, ensuring rapid recovery, thorough incident reviews, and systematic improvements. Support the reliability of safeguard model serving -- critical for both site reliability and Anthropic's safety commitments.
Senior Software Engineer (AI Reliability Engineering) in London employer: Deepstreamtech
At Anthropic, we pride ourselves on being an exceptional employer that fosters a culture of collaboration and innovation. Our Senior Software Engineers in AI Reliability Engineering enjoy a dynamic work environment where they can engage with diverse teams, contribute to critical projects, and develop their skills in cutting-edge technologies. With a strong commitment to employee growth, competitive benefits, and a hybrid work policy that promotes work-life balance, Anthropic is the ideal place for those looking to make a meaningful impact in the AI landscape.
StudySmarter Expert Advice🤫
We think this is how you could land Senior Software Engineer (AI Reliability Engineering) in London
✨Tip Number 1
Network, network, network! Get out there and connect with folks in the industry. Attend meetups, webinars, or even just grab a coffee with someone who works in AI reliability engineering. Building relationships can open doors that a CV just can't.
✨Tip Number 2
Show off your skills in real-time! Consider contributing to open-source projects or engaging in hackathons. This not only showcases your expertise but also demonstrates your commitment to the field and your ability to collaborate with others.
✨Tip Number 3
When you land an interview, be ready to dive deep into your experiences. Share specific examples of how you've tackled reliability issues or improved systems. We want to hear about your thought process and how you approach problem-solving!
✨Tip Number 4
Don't forget to apply through our website! It’s the best way to ensure your application gets the attention it deserves. Plus, we love seeing candidates who are proactive about joining our team!
We think you need these skills to ace Senior Software Engineer (AI Reliability Engineering) in London
Some tips for your application 🫡
Show Your Reliability Mindset:Make sure to highlight your experience with distributed systems and reliability in your application. We want to see how you've tackled challenges in the past and how you think about system resilience.
Be Curious and Brave:Don’t shy away from mentioning times when you jumped into unfamiliar territory to solve a problem. We love candidates who are willing to dive into the unknown and help drive resolution, even if they don’t have all the answers right away.
Communicate Clearly:Since collaboration is key for us, ensure your application reflects your communication skills. Share examples of how you've built relationships across teams and contributed to projects, as this will show us you're a team player.
Apply Through Our Website:We encourage you to apply directly through our website. It’s the best way for us to keep track of your application and ensure it gets the attention it deserves. Plus, we can’t wait to see what you bring to the table!
How to prepare for a job interview at Deepstreamtech
✨Know Your Systems
Make sure you brush up on your knowledge of distributed systems and reliability engineering. Be ready to discuss your past experiences with large-scale systems, especially any incidents you've handled. This will show that you're not just familiar with the theory but have practical experience too.
✨Show Your Curiosity
During the interview, don't hesitate to ask questions about the systems you'll be working with. Demonstrating your curiosity and willingness to dive into unfamiliar territory can set you apart. It shows that you're brave and ready to tackle challenges head-on.
✨Communicate Effectively
Since collaboration is key in this role, practice articulating your thoughts clearly. Use examples from your previous work to illustrate how you've built relationships across teams. Good communication skills can make a huge difference in how you're perceived by the interviewers.
✨Emphasise User Ownership
Be prepared to discuss how you take ownership of outcomes, even for systems you don’t directly manage. Share specific instances where you’ve prioritised user needs and contributed to improving system reliability. This will resonate well with the team’s focus on user-centric solutions.