Site Reliability Engineer, Inference Infrastructure

Job Board

Companies

Cohere

Site Reliability Engineer, Inference Infrastructure

Full-Time 70000 - 90000 £ / year (est.) No working from home possible

Apply Now

At a Glance

Tasks: Build and manage high-performance AI systems for cutting-edge applications.
Company: Join a diverse team at Cohere, leading the AI revolution.
Benefits: Enjoy flexible remote work, generous vacation, and health perks.
Other info: Collaborative culture with opportunities for personal and professional growth.
Why this job: Shape the future of AI while working with top-notch talent.
Qualifications: 5+ years in engineering with expertise in Kubernetes and distributed systems.

The predicted salary is between 70000 - 90000 £ per year.

Who are we? Our mission is to scale intelligence to serve humanity. We’re training and deploying frontier models for developers and enterprises who are building AI systems to power magical experiences like content generation, semantic search, RAG, and agents. We believe that our work is instrumental to the widespread adoption of AI. We obsess over what we build. Each one of us is responsible for contributing to increasing the capabilities of our models and the value they drive for our customers. We like to work hard and move fast to do what’s best for our customers. Cohere is a team of researchers, engineers, designers, and more, who are passionate about their craft. Each person is one of the best in the world at what they do. We believe that a diverse range of perspectives is a requirement for building great products. Join us on our mission and shape the future!

Why this role? Are you energized by building high-performance, scalable and reliable machine learning systems? Do you want to help define and build the next generation of AI platforms powering advanced NLP applications? We are looking for a Site Reliability Engineer to join the Model Serving team at Cohere. The team is responsible for developing, deploying, and operating the AI platform delivering Cohere's large language models through easy to use API endpoints. In this role, you will work closely with many teams to deploy optimized NLP models to production in low latency, high throughput, and high availability environments. You will also get the opportunity to interface with customers and create customized deployments to meet their specific needs.

As a Site Reliability Engineer You Will:

Build self-service systems that automate managing, deploying and operating services. This includes our custom Kubernetes operators that support language model deployments.
Automate environment observability and resilience.
Enable all developers to troubleshoot and resolve problems.
Take steps required to ensure we hit defined SLOs, including participation in an on-call rotation.
Build strong relationships with internal developers and influence the Infrastructure team’s roadmap based on their feedback.
Develop our team through knowledge sharing and an active review process.

You May Be a Good Fit If You Have:

5+ years of engineering experience running production infrastructure at a large scale.
Experience designing large, highly available distributed systems with Kubernetes, and GPU workloads on those clusters.
Experience with Kubernetes dev and production coding and support.
Experience with GCP, Azure, AWS, OCI, multi-cloud on-prem / hybrid serving.
Experience in designing, deploying, supporting, and troubleshooting in complex Linux-based computing environments.
Experience in compute/storage/network resource and cost management.
Excellent collaboration and troubleshooting skills to build mission-critical systems, and ensure smooth operations and efficient teamwork.
The grit and adaptability to solve complex technical challenges that evolve day to day.
Familiarity with computational characteristics of accelerators (GPUs, TPUs, and/or custom accelerators), especially how they influence latency and throughput of inference.
Strong understanding or working experience with distributed systems.
Experience in Golang, C++ or other languages designed for high-performance scalable servers.

We value and celebrate diversity and strive to create an inclusive work environment for all. We welcome applicants from all backgrounds and are committed to providing equal opportunities. Should you require any accommodations during the recruitment process, please submit an Accommodations Request Form, and we will work together to meet your needs.

Full-Time Employees At Cohere Enjoy These Perks:

An open and inclusive culture and work environment.
Work closely with a team on the cutting edge of AI research.
Weekly lunch stipend, in-office lunches & snacks.
Full health and dental benefits, including a separate budget to take care of your mental health.
100% Parental Leave top-up for up to 6 months.
Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement.
Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend.
6 weeks of vacation (30 working days!).

Site Reliability Engineer, Inference Infrastructure employer: Cohere

Cohere is an exceptional employer that champions innovation and collaboration, particularly in the dynamic field of AI deployment. With a flexible remote work environment, generous vacation policies, and robust training stipends, we prioritise employee well-being and professional growth. Join us to be part of a forward-thinking team that is making significant impacts in sectors like finance and healthcare.

Contact Details:

Cohere Recruitment Team

View Cohere profile

StudySmarter Expert Advice🤫

We think this is how you could land Site Reliability Engineer, Inference Infrastructure

✨Tip Number 1

Network like a pro! Reach out to current employees at Cohere on LinkedIn or other platforms. Ask them about their experiences and any tips they might have for landing a role like the Site Reliability Engineer position.

✨Tip Number 2

Prepare for technical interviews by brushing up on your skills. Focus on Kubernetes, distributed systems, and troubleshooting in complex environments. We recommend doing mock interviews with friends or using online platforms to get comfortable.

✨Tip Number 3

Showcase your passion for AI and reliability engineering during interviews. Share specific projects or experiences where you’ve built scalable systems or solved complex problems. This will help us see how you can contribute to our mission.

✨Tip Number 4

Don’t forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, it shows you’re genuinely interested in joining our team at Cohere.

We think you need these skills to ace Site Reliability Engineer, Inference Infrastructure

Site Reliability Engineering

Kubernetes

Machine Learning Systems

NLP Applications

Production Infrastructure

GCP

Azure

AWS

Linux-based Computing Environments

Distributed Systems

Golang

C++

Troubleshooting Skills

Collaboration Skills

Adaptability

Some tips for your application 🫡

Show Your Passion:When you're writing your application, let your enthusiasm for AI and machine learning shine through. We want to see that you’re not just looking for a job, but that you’re genuinely excited about contributing to our mission of scaling intelligence to serve humanity.

Tailor Your Experience:Make sure to highlight your relevant experience in building scalable systems and working with Kubernetes. We love seeing how your background aligns with the role, so don’t be shy about showcasing your skills in distributed systems and cloud environments.

Be Clear and Concise:Keep your application straightforward and to the point. We appreciate clarity, so avoid jargon and focus on what makes you a great fit for the Site Reliability Engineer position. Remember, we’re looking for someone who can communicate effectively!

Apply Through Our Website:We encourage you to apply directly through our website. It’s the best way for us to receive your application and ensures you’re considered for the role. Plus, it shows you’re serious about joining our team at Cohere!

How to prepare for a job interview at Cohere

✨Know Your Stuff

Make sure you brush up on your knowledge of Kubernetes, distributed systems, and the specific technologies mentioned in the job description. Be ready to discuss your past experiences with high-performance, scalable systems and how you've tackled challenges in those areas.

✨Show Your Passion for AI

Cohere is all about building magical AI experiences, so let your enthusiasm shine through! Talk about projects you've worked on that relate to AI and NLP, and express why you're excited about contributing to the future of AI platforms.

✨Prepare for Technical Questions

Expect some technical grilling during the interview. Practice explaining complex concepts clearly and concisely. You might be asked to solve problems on the spot, so consider doing mock interviews or coding challenges to sharpen your skills.

✨Build Relationships

Since this role involves collaboration with various teams, demonstrate your teamwork skills. Share examples of how you've built strong relationships in previous roles and how you’ve influenced project outcomes through effective communication and collaboration.

Site Reliability Engineer, Inference Infrastructure

Cohere

Apply Now

Site Reliability Engineer, Inference Infrastructure

At a Glance

Site Reliability Engineer, Inference Infrastructure employer: Cohere

StudySmarter Expert Advice🤫

We think you need these skills to ace Site Reliability Engineer, Inference Infrastructure

Some tips for your application 🫡

How to prepare for a job interview at Cohere

Company

Product

Help