Staff Site Reliability Engineer - Confluent Incident Management & Reliability in Markham

Job Board

Companies

IBM

Staff Site Reliability Engineer - Confluent Incident Management & Reliability

Staff Site Reliability Engineer - Confluent Incident Management & Reliability in Markham

Markham Full-Time 80000 - 100000 € / year (est.) Home office (partial)

Apply Now

At a Glance

Tasks: Drive proactive reliability improvements and teach teams incident response best practices.
Company: Join IBM Software, a leader in AI-powered, cloud-native solutions.
Benefits: Competitive salary, remote work options, and opportunities for continuous learning.
Other info: Be part of a global team with excellent career growth potential.
Why this job: Make a real impact on global digital transformation with cutting-edge technology.
Qualifications: 10+ years in SRE or incident management; strong cloud experience required.

The predicted salary is between 80000 - 100000 € per year.

At IBM Software, we transform client challenges into solutions. Building the world’s leading AI-powered, cloud-native products that shape the future of business and society. Our legacy of innovation creates endless opportunities for IBMers to learn, grow, and make an impact on a global scale. Working in Software means joining a team fueled by curiosity and collaboration. You’ll work with diverse technologies, partners, and industries to design, develop, and deliver solutions that power digital transformation.

With Confluent, data doesn’t sit still. We put information in motion, streaming in near real time so organizations can react faster, build smarter, and deliver experiences as dynamic as the world around them.

Your Role And Responsibilities

Confluent Cloud processes millions of events per second across AWS, GCP, and Azure. When incidents happen in a multi-cloud streaming platform, they happen at scale—data in motion, exactly‑once semantics, and cascading failure modes that require deep systems thinking. We need an expert‑level engineer who can drive proactive reliability improvements that prevent these incidents before they occur. This role combines hands‑on technical work with strategic program ownership. You’ll spend roughly 75% of your time on engineering: building automation, improving tooling, analyzing systemic failure patterns, and designing reliability improvements. The remaining 25% is teaching and coordination: coaching teams through post‑mortems, training incident commanders, and evolving our incident response practices. You’ll be part of a global team with follow‑the‑sun coverage, with clean handoffs that keep everyone working sustainable hours. This role sits within Cloud Architecture and Reliability – Supportability, a horizontal team that owns reliability standards and tooling across engineering. You’re the person who makes us need incident management less.

What You Will Do

Analyze systemic failure patterns and design reliability improvements that prevent incident recurrence
Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack
Define and maintain SLO/SLA frameworks; use error budgets to guide reliability investments
Own standards, practices, and continuous improvement of incident response across engineering
Edit and review customer‑facing incident documents (CRCAs) to ensure quality and clarity
Develop and deliver training programs; coach teams through post‑mortems
Partner with engineering leaders to elevate reliability practices org‑wide
Deep experience with observability: metrics, logging, tracing
Kubernetes and container orchestration experience
Understanding of CI/CD pipelines and release processes
Strong written communication (design docs, runbooks, post‑mortems)
Experience driving org‑wide process and cultural changes

Required Technical And Professional Expertise

10+ years of relevant experience in SRE, incident management, or reliability engineering
Cloud experience with at least one of AWS, GCP, or Azure (we run all three)
Experience navigating reliability/incident programs at 500+ engineer organizations
Deep expertise with incident management tooling (Rootly, PagerDuty, or similar)
Strong understanding of distributed systems and failure modes at scale
Kafka/event streaming expertise preferred, or demonstrated rapid mastery of complex systems

Preferred Technical And Professional Experience

Advanced Cloud Knowledge: Experience with cloud‑based infrastructure and its application in reliability and resiliency engineering.
Specialized Scripting Skills: Proficiency in scripting languages and automation tools to optimize system reliability and performance.

Staff Site Reliability Engineer - Confluent Incident Management & Reliability in Markham employer: IBM

At IBM Software, we pride ourselves on fostering a culture of innovation and collaboration, where every employee is empowered to learn and grow. As a Staff Site Reliability Engineer, you will be at the forefront of cutting-edge technology, working in a dynamic environment that values continuous improvement and proactive problem-solving. With global reach and diverse opportunities for career advancement, IBM offers a unique platform for you to make a meaningful impact while enjoying a supportive work culture that prioritises work-life balance.

Contact Detail:

IBM Recruiting Team

View IBM Profile

StudySmarter Expert Advice🤫

We think this is how you could land Staff Site Reliability Engineer - Confluent Incident Management & Reliability in Markham

✨Tip Number 1

Network like a pro! Reach out to folks in the industry, attend meetups, and connect with people on LinkedIn. You never know who might have the inside scoop on job openings or can put in a good word for you.

✨Tip Number 2

Show off your skills! Create a portfolio or GitHub repository showcasing your projects, especially those related to SRE and incident management. This gives potential employers a taste of what you can do and sets you apart from the crowd.

✨Tip Number 3

Prepare for interviews by brushing up on your technical knowledge and soft skills. Practice common SRE scenarios and be ready to discuss how you've handled incidents in the past. Confidence is key!

✨Tip Number 4

Don't forget to apply through our website! We love seeing applications directly from candidates who are passionate about joining us at StudySmarter. It shows initiative and helps us get to know you better.

We think you need these skills to ace Staff Site Reliability Engineer - Confluent Incident Management & Reliability in Markham

Incident Management

Reliability Engineering

Cloud Experience (AWS, GCP, Azure)

Rootly Configuration

SLO/SLA Frameworks

Observability (Metrics, Logging, Tracing)

Kubernetes and Container Orchestration

CI/CD Pipelines

Strong Written Communication

Incident Management Tooling (Rootly, PagerDuty)

Distributed Systems Understanding

Kafka/Event Streaming Expertise

Scripting Skills

Automation Tools

Some tips for your application 🫡

Tailor Your Application:Make sure to customise your CV and cover letter for the Staff Site Reliability Engineer role. Highlight your experience with incident management, cloud platforms, and any relevant tools like Rootly or PagerDuty. We want to see how your skills align with what we’re looking for!

Showcase Your Technical Skills:Don’t hold back on your technical expertise! Detail your experience with Kubernetes, CI/CD pipelines, and observability tools. We love seeing candidates who can demonstrate their hands-on experience and deep understanding of distributed systems.

Communicate Clearly:Strong written communication is key for this role. Make sure your application materials are clear and concise. Use bullet points where necessary and avoid jargon unless it’s relevant. We appreciate clarity as much as you do!

Apply Through Our Website:We encourage you to apply directly through our website. It’s the best way to ensure your application gets into the right hands. Plus, you’ll find all the details about the role and our company culture there!

How to prepare for a job interview at IBM

✨Know Your Stuff

Make sure you brush up on your knowledge of SRE principles, incident management, and reliability engineering. Be ready to discuss your experience with cloud platforms like AWS, GCP, and Azure, as well as your familiarity with tools like Rootly and PagerDuty. The more you can demonstrate your expertise, the better!

✨Showcase Your Problem-Solving Skills

Prepare to share specific examples of how you've tackled systemic failures in the past. Think about incidents you've managed, the steps you took to resolve them, and how you implemented changes to prevent recurrence. This will show that you not only understand the theory but can apply it in real-world situations.

✨Communicate Clearly

Since strong written communication is key for this role, practice explaining complex concepts in a simple way. You might be asked to review or edit incident documents, so being able to articulate your thoughts clearly will set you apart. Consider preparing a few design docs or runbooks to showcase your writing skills.

✨Be Ready to Collaborate

This role involves coaching teams and working closely with engineering leaders. Be prepared to discuss how you've successfully collaborated in the past, especially in high-pressure situations. Highlight your experience in training others and evolving incident response practices to show you're a team player who can elevate reliability across the organisation.

Staff Site Reliability Engineer - Confluent Incident Management & Reliability in Markham

IBM

Location: Markham

Apply Now

Staff Site Reliability Engineer - Confluent Incident Management & Reliability in Markham

At a Glance

Staff Site Reliability Engineer - Confluent Incident Management & Reliability in Markham employer: IBM

StudySmarter Expert Advice🤫

We think you need these skills to ace Staff Site Reliability Engineer - Confluent Incident Management & Reliability in Markham

Some tips for your application 🫡

How to prepare for a job interview at IBM

Company

Product

Help