Senior Software Engineer (AI Reliability Engineering)

Senior Software Engineer (AI Reliability Engineering)

Full-Time 70000 - 90000 € / year (est.) Home office (partial)
Deepstreamtech

At a Glance

  • Tasks: Join us in enhancing AI reliability and ensuring Claude serves users seamlessly.
  • Company: Dynamic tech company focused on AI and reliability engineering.
  • Benefits: Competitive salary, visa sponsorship, hybrid work model, and growth opportunities.
  • Other info: Engage with diverse teams and tackle exciting challenges in a supportive environment.
  • Why this job: Make a real impact on AI systems that millions rely on every day.
  • Qualifications: Experience in distributed systems and strong collaboration skills required.

The predicted salary is between 70000 - 90000 € per year.

Requirements

  • Have strong distributed systems, infrastructure, or reliability backgrounds -- we're looking for reliability-minded software engineers and SREs.
  • Are curious and brave -- comfortable jumping into unfamiliar systems during an incident and helping drive resolution even when you don't have deep expertise yet.
  • Think holistically about how systems compose and where the seams are.
  • Can build lasting relationships across teams -- our engagement model depends on being welcomed as teammates, not outsiders with opinions.
  • Care about users and feel ownership over outcomes, even for systems you don't own.
  • Have excellent communication and collaboration skills -- you'll be partnering across the entire company.
  • Bring diverse experience -- the team's strength comes from people who've built product stacks, scaled databases, run massive distributed systems, and everything in between.

Desirable

  • Have been an SRE, Production Engineer, or in similar reliability-focused roles on large scale systems.
  • Have experience operating large-scale model serving or training infrastructure (>1000 GPUs).
  • Have experience with one or more ML hardware accelerators (GPUs, TPUs, Trainium).
  • Understand ML-specific networking optimizations like RDMA and InfiniBand.
  • Have expertise in AI-specific observability tools and frameworks.
  • Have experience with chaos engineering and systematic resilience testing.
  • Have contributed to open-source infrastructure or ML tooling.

Education requirements: We require at least a Bachelor's degree in a related field or equivalent experience.

Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.

Visa sponsorship: We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.

What the job involves

Claude has your back. AIRE has Claude's. Help us keep Claude reliable for everyone who depends on it. AIRE (AI Reliability Engineering) partners with teams across Anthropic to improve reliability across our most critical serving paths -- every hop from the SDK through our network, API layers, serving infrastructure, and accelerators and back. We jump into the trenches alongside partner teams to make the systems that deliver Claude more robust and resilient, be it during an incident or collaborating on projects. Reliability here is an emergent phenomenon that transcends any single team's boundaries, so someone has to zoom out and look at the whole picture. That's us -- and it means few teams at Anthropic offer this kind of dynamic, cross-cutting exposure to the systems that matter most.

Develop appropriate Service Level Objectives for large language model serving systems, balancing availability and latency with development velocity. Design and implement monitoring and observability systems across the token path. Assist in the design and implementation of high-availability serving infrastructure across multiple regions and cloud providers. Lead incident response for critical AI services, ensuring rapid recovery, thorough incident reviews, and systematic improvements. Support the reliability of safeguard model serving -- critical for both site reliability and Anthropic's safety commitments.

Senior Software Engineer (AI Reliability Engineering) employer: Deepstreamtech

At Anthropic, we pride ourselves on fostering a collaborative and innovative work culture that empowers our employees to take ownership of their projects and drive meaningful outcomes. As a Senior Software Engineer in AI Reliability Engineering, you'll have the opportunity to work alongside talented teams, enhancing the reliability of critical systems while benefiting from our commitment to professional growth and development. With a hybrid work policy and visa sponsorship available, we ensure that our diverse workforce feels valued and supported in a dynamic environment.

Deepstreamtech

Contact Detail:

Deepstreamtech Recruiting Team

StudySmarter Expert Advice🀫

We think this is how you could land Senior Software Engineer (AI Reliability Engineering)

✨Tip Number 1

Network, network, network! Get out there and connect with folks in the industry. Attend meetups, webinars, or even just grab a coffee with someone who works in AI reliability engineering. Building relationships can open doors that a CV just can't.

✨Tip Number 2

Show off your skills in real-time! If you get the chance, participate in hackathons or coding challenges related to distributed systems. This not only showcases your expertise but also demonstrates your ability to collaborate under pressure.

✨Tip Number 3

When you land an interview, be ready to dive deep into your past experiences. Share specific examples of how you've tackled reliability issues or improved system performance. We love hearing about your hands-on experience!

✨Tip Number 4

Don't forget to apply through our website! It’s the best way to ensure your application gets the attention it deserves. Plus, we’re always on the lookout for passionate individuals who care about user outcomes and system reliability.

We think you need these skills to ace Senior Software Engineer (AI Reliability Engineering)

Distributed Systems
Infrastructure Reliability
Incident Management
Cross-Team Collaboration
Communication Skills
Large-Scale Systems Operation
Model Serving Infrastructure

Some tips for your application 🫑

Show Your Reliability Mindset:Make sure to highlight your experience with distributed systems and reliability in your application. We want to see how you've tackled challenges in the past and how you think about system resilience.

Be Curious and Brave:Don’t shy away from mentioning times when you jumped into unfamiliar territory to solve a problem. We love candidates who are willing to dive into the unknown and help drive resolution, even if they don’t have all the answers right away.

Communicate Clearly:Since collaboration is key for us, ensure your application reflects your communication skills. Share examples of how you've built relationships across teams and contributed to projects that required teamwork.

Apply Through Our Website:We encourage you to apply directly through our website. It’s the best way for us to receive your application and get to know you better. Plus, it shows you're serious about joining our team!

How to prepare for a job interview at Deepstreamtech

✨Know Your Systems

Make sure you brush up on your knowledge of distributed systems and reliability engineering. Be ready to discuss your past experiences with large-scale systems, especially any incidents you've managed. This will show that you're not just familiar with the theory but have practical insights to share.

✨Show Your Curiosity

During the interview, demonstrate your curiosity and willingness to dive into unfamiliar systems. Share examples of times when you tackled challenges without having all the answers upfront. This will highlight your bravery and problem-solving mindset, which are key traits for this role.

✨Build Connections

Emphasise your ability to build relationships across teams. Talk about how you've collaborated with different departments in the past and how you approach teamwork. This is crucial since the role involves working closely with various teams to enhance system reliability.

✨Communicate Clearly

Excellent communication skills are a must. Practice explaining complex technical concepts in simple terms, as you'll need to partner with non-technical teams too. Being able to articulate your thoughts clearly will set you apart and show that you can bridge gaps between teams effectively.