At a Glance
- Tasks: Ensure reliability and performance of our global GPU cloud while tackling complex production issues.
- Company: Fluidstack, a leader in AI infrastructure, partnering with top labs and enterprises.
- Benefits: Competitive salary, equity options, health insurance, and generous PTO.
- Why this job: Join us to accelerate the future of intelligence and work on cutting-edge technology.
- Qualifications: 2+ years in SRE, DevOps, or similar roles; strong coding skills in Go, Python, Bash.
- Other info: Dynamic environment with opportunities for growth and innovation.
The predicted salary is between 126000 - 232000 ÂŁ per year.
About Fluidstack
At Fluidstack, we’re building the infrastructure for abundant intelligence. We partner with top AI labs, governments, and enterprises to unlock compute at the speed of light. We’re working with urgency to make AGI a reality. Our team is highly motivated and committed to delivering world‑class infrastructure. We treat our customers’ outcomes as our own, taking pride in the systems we build and the trust we earn. If you’re motivated by purpose, obsessed with excellence, and ready to work very hard to accelerate the future of intelligence, join us in building what’s next.
About the Role
Senior / Staff SREs at Fluidstack sit at the core of our infrastructure, working across software, hardware, and operations to ensure the reliability and performance of our global GPU cloud. They partner closely with teams including networking, platform engineering, and data center operations to build systems that scale with the demands of AI workloads. SREs are hands‑on and possess deep systems knowledge and strong communication skills. You’ll be responsible for tackling complex production issues, deploying resilient infrastructure, and continuously improving the stability and observability of our platform as we grow.
A typical day may involve:
- Deploying clusters of 1,000+ GPUs using custom written playbooks; modifying these tools as necessary to provide the perfect solution for a customer.
- Validating correctness and performance of underlying compute, storage, and networking infrastructure, and working with providers to optimize these subsystems.
- Migrating petabytes of data from public cloud platforms to local storage, as quickly and cost effectively as possible.
- Debugging issues anywhere in the stack, from “this server’s fan is blocked by a plastic bag” to “optimizing S3 dataloaders from buckets in different regions”.
- Building internal tooling to decrease deployment time and increase cluster reliability, including automation where the customer benefits clearly outweigh the implementation overhead.
This role will involve being part of an on‑call rotation up to one week per month.
Focus
- A customer‑centric attitude, an accountability mindset, and a bias to action.
- A track record of shipping clean, well‑documented code in complex environments.
- An ability to create structure from chaos, navigate ambiguity, and adapt to the dynamic nature of the AI ecosystem.
- Strong technical and interpersonal communication skills, a low ego, and a positive mental attitude.
Minimum Requirements
- 2+ years of SRE, DevOps, Sysadmin, and/or HPC engineering experience.
- Great verbal and written communication skills in English.
- Experience deploying and operating Kubernetes and/or SLURM clusters.
- Experience in writing Go, Python, Bash.
- Experience using Ansible, Terraform, and other automation or IAC tools.
- Strong engineering background, preferably in Computer Science, Software Engineering, Math, Computer Engineering, or similar fields.
Nice To Haves:
- You have built and operated an AI workload at 1000+ GPU scale.
- You have built multi‑tenant, hyperscale Kubernetes based services.
- You have physically deployed infrastructure in a datacenter, managed bare metal hardware via MaaS or Netbox, etc.
- You have deployed and managed multi‑tenant InfiniBand or RoCE networks.
- You have deployed and managed petabyte scale all‑flash storage systems, including DDN, VAST, and/or Weka; or Ceph, LUSTRE, or similar open source tools.
Salary & Benefits
Competitive total compensation package (salary + equity). Retirement or pension plan, in line with local norms. Health, dental, and vision insurance. Generous PTO policy, in line with local norms. The base salary range for this position is $175,000 - $320,000 per year, depending on experience, skills, qualifications, and location. This range represents our good faith estimate of the compensation for this role at the time of posting. Total compensation may also include equity in the form of stock options. We are committed to pay equity and transparency.
Fluidstack is an Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability and protected veterans’ status, or any other characteristic protected by law. Fluidstack will consider for employment qualified applicants with arrest and conviction records pursuant to applicable law.
Senior / Staff Site Reliability Engineer employer: FluidStack
Contact Detail:
FluidStack Recruiting Team
StudySmarter Expert Advice 🤫
We think this is how you could land Senior / Staff Site Reliability Engineer
✨Tip Number 1
Network like a pro! Reach out to current employees at Fluidstack on LinkedIn or other platforms. Ask them about their experiences and any tips they might have for the interview process. It’s all about making connections!
✨Tip Number 2
Prepare for technical interviews by brushing up on your SRE skills. Practice coding challenges, especially in Go, Python, and Bash. We recommend using platforms like LeetCode or HackerRank to get those problem-solving muscles flexed!
✨Tip Number 3
Showcase your projects! If you’ve built or contributed to any relevant tools or systems, make sure to highlight them during your interviews. We love seeing hands-on experience that aligns with what we do at Fluidstack.
✨Tip Number 4
Don’t forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, it shows you’re genuinely interested in joining our team at Fluidstack.
We think you need these skills to ace Senior / Staff Site Reliability Engineer
Some tips for your application 🫡
Tailor Your Application: Make sure to customise your CV and cover letter for the Senior / Staff SRE role. Highlight your experience with Kubernetes, automation tools, and any relevant projects that showcase your skills in building reliable infrastructure.
Showcase Your Communication Skills: Since strong communication is key for this role, don’t shy away from demonstrating your ability to explain complex technical concepts clearly. Use examples from your past experiences where you successfully collaborated with teams or resolved issues.
Be Specific About Your Experience: When detailing your experience, be specific about the technologies you've worked with, like Go, Python, or Terraform. Mention any large-scale projects you've been involved in, especially those related to AI workloads or GPU clusters.
Apply Through Our Website: We encourage you to apply directly through our website. It’s the best way for us to receive your application and ensures you’re considered for the role. Plus, it shows your enthusiasm for joining our team at Fluidstack!
How to prepare for a job interview at FluidStack
✨Know Your Tech Inside Out
Make sure you’re well-versed in the technologies mentioned in the job description, like Kubernetes, SLURM, and automation tools like Ansible and Terraform. Brush up on your coding skills in Go, Python, and Bash, as you might be asked to demonstrate your knowledge during the interview.
✨Showcase Your Problem-Solving Skills
Prepare to discuss specific examples of complex production issues you've tackled in the past. Fluidstack values hands-on experience, so be ready to explain how you approached these challenges and what solutions you implemented to improve system reliability.
✨Communicate Clearly and Confidently
Strong communication skills are crucial for this role. Practice articulating your thoughts clearly, especially when discussing technical concepts. Remember, it’s not just about what you know, but how you convey that knowledge to others.
✨Emphasise Your Customer-Centric Mindset
Fluidstack is all about customer outcomes, so be prepared to discuss how you’ve prioritised customer needs in your previous roles. Share examples of how you’ve built systems or tools that directly benefited users, showcasing your accountability and bias to action.