At a Glance
- Tasks: Own observability, design distributed systems, and architect auto-scaling infrastructure.
- Company: Fast-growing developer infrastructure startup with a dynamic team.
- Benefits: Competitive salary, equity, and remote work options.
- Other info: Join a small team where engineers own their projects and collaborate directly with founders.
- Why this job: Tackle real challenges in scaling and make a significant impact.
- Qualifications: Experience in observability, distributed systems, TypeScript/Go, and Kubernetes.
The predicted salary is between 136000 - 180000 £ per year.
We're partnering with a fast-growing developer infrastructure startup on a senior SRE hire at a pivotal moment in their growth. The platform runs AI agents and background workflows in production at massive scale handling hundreds of millions of executions per month on infrastructure they run themselves. The team is ~13 people. No engineering managers. Engineers own large parts of the system and work directly with the founders. The core challenge right now is scale. Execution volume is growing faster than the team can build, which means the next hires are walking into genuine distributed systems problems — not a greenfield rebuild or a dashboard feature.
What you'll be working on:
- Owning observability across the platform OpenTelemetry, metrics, logs, traces, and making them genuinely useful at 3am
- Designing and operating distributed systems primitives under real production load — queues, schedulers, checkpoints, backpressure
- Architecting and tuning auto-scaling infrastructure that runs untrusted customer code at high throughput
- Hardening multi-tenant sandbox isolation, secrets handling, network policy, and supply chain security
- Owning Terraform and IaC as a first principle across a cloud-native footprint
- Running on-call practice: SLOs, runbooks, blameless postmortems, paging hygiene
What they're looking for:
- Strong observability background production experience with OpenTelemetry, Prometheus or equivalent
- Distributed systems experience you've designed or operated systems with non-trivial failure modes
- Strong with TypeScript and/or Go; the codebase is TypeScript-heavy with Go emerging as a second language.
- Self-managed Kubernetes in production, not just managed control planes
- Performance and scaling instincts; you've chased real bottlenecks across app, database, and infra layers
- Terraform as a first principle, run at meaningful scale
- Security mindset — multi-tenant isolation, least privilege, threat modelling
- Postgres and Redis under load, AWS strongly preferred
The process:
- Screening call
- Hiring manager conversation
- Technical with roughly a 10% pass rate
- Final with the wider team
The bar is high but if you find that motivating rather than off-putting, that's probably a good sign.
Site Reliability Engineer employer: Wave Talent
Join a dynamic and innovative startup that prioritises employee ownership and direct collaboration with founders, offering a unique opportunity to tackle real-world distributed systems challenges at scale. With a strong focus on observability and security, this role not only provides competitive compensation and equity but also fosters a culture of growth and learning in a supportive remote environment across Europe or London. Embrace the chance to make a significant impact while working alongside a small, dedicated team passionate about pushing the boundaries of developer infrastructure.
StudySmarter Expert Advice🤫
We think this is how you could land Site Reliability Engineer
✨Tip Number 1
Network like a pro! Reach out to current employees on LinkedIn or other platforms. Ask them about their experiences and the company culture. This can give you insider info and might even lead to a referral!
✨Tip Number 2
Prepare for the technical interview by brushing up on your distributed systems knowledge. Dive deep into topics like observability, scaling, and security practices. We recommend doing mock interviews with friends or using online platforms to get comfortable.
✨Tip Number 3
Showcase your projects! If you've worked on relevant projects, make sure to highlight them during interviews. Discuss the challenges you faced and how you overcame them, especially in areas like Terraform and Kubernetes.
✨Tip Number 4
Apply through our website! It’s the best way to ensure your application gets seen. Plus, it shows you're genuinely interested in joining our team. Don’t forget to follow up after applying; a little persistence goes a long way!
We think you need these skills to ace Site Reliability Engineer
Some tips for your application 🫡
Tailor Your CV:Make sure your CV speaks directly to the job description. Highlight your experience with distributed systems, observability tools like OpenTelemetry, and any relevant coding skills in TypeScript or Go. We want to see how your background aligns with our needs!
Craft a Compelling Cover Letter:Your cover letter is your chance to shine! Use it to explain why you're excited about the role and how you can tackle the challenges we face at StudySmarter. Be genuine and let your personality come through – we love that!
Showcase Your Projects:If you've worked on any projects that demonstrate your skills in scaling infrastructure or managing observability, make sure to include them. We’re keen to see real-world examples of your work and how you’ve solved complex problems.
Apply Through Our Website:We encourage you to apply directly through our website. It’s the best way for us to keep track of your application and ensure it gets the attention it deserves. Plus, it shows you’re serious about joining our team!
How to prepare for a job interview at Wave Talent
✨Know Your Tech Inside Out
Make sure you’re well-versed in the technologies mentioned in the job description, especially OpenTelemetry, TypeScript, and Go. Brush up on your experience with distributed systems and be ready to discuss specific challenges you've faced and how you overcame them.
✨Showcase Your Problem-Solving Skills
Prepare to talk about real-world scenarios where you tackled performance bottlenecks or scaling issues. Use examples that highlight your ability to think critically and act decisively under pressure, especially in production environments.
✨Demonstrate Your Security Mindset
Given the emphasis on security in the role, be prepared to discuss your approach to multi-tenant isolation and threat modelling. Share any relevant experiences where you implemented security measures in a cloud-native environment.
✨Engage with the Team's Culture
Since the team is small and collaborative, show your enthusiasm for working closely with others. Be ready to discuss how you’ve contributed to team dynamics in previous roles and how you can bring that same energy to their startup environment.