At a Glance
- Tasks: Design, build, and maintain scalable infrastructures while troubleshooting production issues.
- Company: Join a fast-paced startup focused on AI/ML solutions and high-performance computing.
- Benefits: Participate in open-source projects and contribute to research publications and conferences.
- Other info: Experience with tools like Docker, Kubernetes, Prometheus, and Terraform is essential.
- Why this job: Shape the reliability and performance of a cutting-edge Cloud platform.
- Qualifications: Master’s degree in Computer Science and 5+ years in a DevOps/SRE role required.
The predicted salary is between 60000 - 80000 £ per year.
Requirements
- Master’s degree in Computer Science, Engineering or a related field
- 5+ years of experience in a DevOps/SRE role
- Strong experience with bare metal infrastructure and highly available distributed systems
- Exposure to site reliability issues in critical environments (issue root cause analysis, in-production troubleshooting, on-call rotations...)
- Experience working against reliability KPIs (observability, alerting, SLAs)
- Hands-on experience with CI/CD, containerization and orchestration tools (Docker, Kubernetes...)
- Knowledge of monitoring, logging, alerting and observability tools (Prometheus, Grafana, ELK Stack, Datadog...)
- Familiarity with infrastructure-as-code tools like Terraform or CloudFormation
- Proficiency in scripting languages (Python, Go, Bash...) and knowledge of software development best practices
- Strong understanding of networking, security, and system administration concepts
- Excellent problem-solving and communication skills
- Self-motivated and able to work well in a fast-paced startup environment
- Experience in an AI/ML environment
- Experience of high-performance computing (HPC) systems and workload managers (Slurm)
- Worked with modern AI-oriented solutions (Fluidstack, Coreweave, Vast...)
What the job involves
- We are seeking highly experienced Site Reliability Engineers (SRE) to shape the reliability, scalability and performance of our Cloud platform and customer facing applications.
- You will work closely with our software engineers and product teams to ensure our systems meet and exceed our internal and external customers' expectations.
- Design, build, and maintain scalable, highly available and fault-tolerant infrastructures.
- Operate systems and troubleshoot issues in production environments (interrupts, on-call responses, users admin, data extraction, infrastructure scaling, etc.).
- Implement and improve monitoring, alerting, and incident response systems to ensure optimal system performance and minimize downtime.
- Implement and maintain workflows and tools (CI/CD, containerization, orchestration, monitoring, logging and alerting systems) for both our customer-facing APIs and large training runs.
- Participate occasionally in on-call rotations to respond to incidents and perform root cause analysis to prevent future occurrences.
- Drive continuous improvement in infrastructure automation, deployment, and orchestration.
- Collaborate with software engineers to develop and implement solutions that enable safe and reproducible model-training experiments.
- Help build a cloud platform offering an abstraction layer between science, engineering and infrastructure.
- Design and develop new workflows and tooling to improve the reliability, availability and performance of our systems (automation scripts, refactoring, new API-based features, web apps, dashboards, etc.).
- Collaborate with the security team to ensure infrastructure adheres to best security practices and compliance requirements.
- Document processes and procedures to ensure consistency and knowledge sharing across the team.
- Contribute to open-source projects, research publications, blog articles and conferences.
Site Reliability Engineer (Mistral Cloud) employer: Mistral AI
This innovative startup is located in a dynamic tech hub, offering opportunities to work on AI-oriented solutions. Employees enjoy contributing to open-source projects and engaging in research publications, fostering a collaborative environment.
We think you need these skills to ace Site Reliability Engineer (Mistral Cloud)
DevOps
Site Reliability Engineering (SRE)
Bare Metal Infrastructure
Distributed Systems
Root Cause Analysis
CI/CD
Containerization