At a Glance
- Tasks: Own observability and design distributed systems for a fast-growing startup.
- Company: Dynamic developer infrastructure startup with a focus on innovation.
- Benefits: Competitive salary, equity options, and remote work flexibility.
- Other info: High standards and growth opportunities in a collaborative environment.
- Why this job: Join a small team tackling real-world challenges in a high-impact role.
- Qualifications: Experience in observability, distributed systems, and proficiency in TypeScript or Go.
The predicted salary is between 136000 - 180000 £ per year.
We're partnering with a fast-growing developer infrastructure startup on a senior SRE hire at a pivotal moment in their growth. The platform runs AI agents and background workflows in production at massive scale handling hundreds of millions of executions per month on infrastructure they run themselves. The team is ~13 people. No engineering managers. Engineers own large parts of the system and work directly with the founders.
The core challenge right now is scale. Execution volume is growing faster than the team can build, which means the next hires are walking into genuine distributed systems problems — not a greenfield rebuild or a dashboard feature.
What you'll be working on:
- Owning observability across the platform OpenTelemetry, metrics, logs, traces, and making them genuinely useful at 3am
- Designing and operating distributed systems primitives under real production load — queues, schedulers, checkpoints, backpressure
- Architecting and tuning auto-scaling infrastructure that runs untrusted customer code at high throughput
- Hardening multi-tenant sandbox isolation, secrets handling, network policy, and supply chain security
- Owning Terraform and IaC as a first principle across a cloud-native footprint
- Running on-call practice: SLOs, runbooks, blameless postmortems, paging hygiene
What they're looking for:
- Strong observability background production experience with OpenTelemetry, Prometheus or equivalent
- Distributed systems experience you've designed or operated systems with non-trivial failure modes
- Strong with TypeScript and/or Go; the codebase is TypeScript-heavy with Go emerging as a second language.
- Self-managed Kubernetes in production, not just managed control planes
- Performance and scaling instincts; you've chased real bottlenecks across app, database, and infra layers
- Terraform as a first principle, run at meaningful scale
- Security mindset — multi-tenant isolation, least privilege, threat modelling
- Postgres and Redis under load, AWS strongly preferred
The process:
- Screening call
- Hiring manager conversation
- Technical with roughly a 10% pass rate
- Final with the wider team
The bar is high but if you find that motivating rather than off-putting, that's probably a good sign.
Site Reliability Engineer employer: Wave Talent
Join a dynamic and innovative startup that prioritises employee autonomy and ownership, where engineers directly collaborate with founders to tackle real-world challenges in distributed systems. With a strong focus on personal growth and a culture that embraces high standards, you'll have the opportunity to work on cutting-edge technology while enjoying the flexibility of remote work across Europe or from London. The company offers competitive compensation, equity options, and a supportive environment that fosters creativity and problem-solving.
StudySmarter Expert Advice🤫
We think this is how you could land Site Reliability Engineer
✨Tip Number 1
Get your networking game on! Reach out to current employees or connections in the industry. A friendly chat can give you insider info about the company culture and maybe even a referral, which can seriously boost your chances.
✨Tip Number 2
Prepare for those technical interviews like a pro! Brush up on your distributed systems knowledge and be ready to discuss real-world scenarios. Practising with mock interviews can help you feel more confident when it’s showtime.
✨Tip Number 3
Showcase your problem-solving skills! During interviews, don’t just talk about what you’ve done; explain how you tackled challenges. Use specific examples that highlight your experience with observability tools and scaling issues.
✨Tip Number 4
Don’t forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, we love seeing candidates who are proactive about their job search!
We think you need these skills to ace Site Reliability Engineer
Some tips for your application 🫡
Tailor Your CV:Make sure your CV reflects the skills and experiences that match the job description. Highlight your observability background and distributed systems experience, as these are key for us at StudySmarter.
Craft a Compelling Cover Letter:Use your cover letter to tell us why you're passionate about Site Reliability Engineering. Share specific examples of how you've tackled scaling challenges or improved system performance in your previous roles.
Showcase Your Technical Skills:Don’t shy away from detailing your technical expertise, especially with TypeScript, Go, and Terraform. We want to see how you’ve applied these in real-world scenarios, so be specific!
Apply Through Our Website:We encourage you to apply directly through our website. It’s the best way for us to receive your application and ensures you’re considered for this exciting opportunity with our growing team.
How to prepare for a job interview at Wave Talent
✨Know Your Tech Inside Out
Make sure you’re well-versed in the technologies mentioned in the job description, especially OpenTelemetry, TypeScript, and Go. Brush up on your experience with distributed systems and be ready to discuss specific challenges you've faced and how you overcame them.
✨Demonstrate Your Problem-Solving Skills
Prepare to talk about real-world scenarios where you've tackled performance bottlenecks or scaling issues. Use examples that highlight your ability to think critically and act decisively under pressure, especially in a production environment.
✨Show Off Your Observability Knowledge
Since observability is key for this role, be ready to explain how you've implemented metrics, logs, and traces in previous projects. Discuss how you’ve made these tools genuinely useful for your team, especially during high-stress situations.
✨Emphasise Your Security Mindset
Security is a big deal in this role, so come prepared to discuss your approach to multi-tenant isolation and threat modelling. Share any experiences where you’ve had to ensure security while maintaining performance and scalability.