At a Glance
- Tasks: Lead reliability engineering and AIOps to enhance cloud infrastructure and incident response.
- Company: Join a forward-thinking tech company focused on innovation and collaboration.
- Benefits: Enjoy competitive pay, health perks, remote work options, and growth opportunities.
- Other info: Dynamic role with excellent career advancement potential in a supportive environment.
- Why this job: Make a real impact by improving system reliability and automating recovery processes.
- Qualifications: 8+ years in infrastructure operations and strong leadership skills required.
The predicted salary is between 80000 - 100000 ÂŁ per year.
The Sr. Manager, Infrastructure Reliability and AIOps Engineering is accountable for improving reliability, observability, and automated recovery across Cloud Infrastructure, Networking, Enterprise Tools, and IAM. This leader builds and operates the Operations and Reliability Engineering function using AIOps practices and is accountable for day‑to‑day operational outcomes, including incident response, escalations, and restoration quality. The role leads Reliability Analysts and partners with domain teams, ITSM/Platform Enablement, and Security to prevent incidents, reduce alert noise, and improve recovery performance.
Scope and Accountability
- Operational ownership of event‑driven incidents, including active participation in incident response, ticket escalation management, and coordination through resolution and restoration.
- AIOps outcomes and governance for platform operations: event ingestion, normalization, correlation, alert quality, intelligent routing, and automated event‑to‑incident workflows.
- Reliability outcomes across Cloud Infrastructure, Networking, Enterprise Tools, and IAM (SLO attainment, improved availability/latency where applicable, MTTD/MTTR reduction, reduced repeat incidents).
- Signal quality management (alert hygiene, deduplication, suppression, threshold tuning, enrichment, and ownership mapping) to improve signal‑to‑noise and reduce operational toil.
- Event correlation standards and service impact intelligence (dependency mapping, CI/service association, and prioritization logic aligned to CMDB/ITSM).
- Automation quality and “production readiness” for self‑healing workflows across all platform domains (validation, rollback, auditability, and measurable success criteria).
- Reliability operating cadence (incident triage standards, major incident support model, post‑incident reviews, problem trend management, and reliability roadmap governance).
- Reliability standards for telemetry, runbooks, monitoring coverage, and operational readiness checks (aligned to ITSM practices and security/compliance needs where applicable).
- Predictive avoidance driven IT Operations.
Key Responsibilities
- Reliability operations leadership. Own the reliability execution model from signal → event → incident → restoration, including active incident engagement, escalation management, and accountability for ticket progression and resolution quality.
- Operate and continuously improve the AIOps layer: event ingestion/normalization, correlation rule design, enrichment, de‑duplication, suppression, and noise reduction.
- Drive measurable improvements in operational performance through alert‑quality KPIs (false positives, duplicates, unassigned events, time‑to‑triage).
- Lead post‑incident reviews with a prevention mindset; convert lessons learned into problem records, reliability backlog items, and automation candidates with clear owners and due dates.
- Establish a consistent “incident learning → reliability backlog → automation delivery” feedback loop with Cloud, Network, Tools, and IAM teams.
- Define reliability measurement across services and platforms: SLIs, SLOs, scorecards, and operational thresholds tied to customer impact and business priorities.
- Ensure telemetry standards are implemented across domains (metrics, logs, traces where applicable) to enable fast correlation, accurate impact analysis, and actionable alerts.
- Mature service health views and early warning signals by improving dependency awareness and context enrichment (service, CI, owner, criticality, user impact).
- Partner with domain SMEs to identify “leading indicators” and implement proactive detection and prevention patterns.
- Build and execute a reliability automation roadmap that reduces manual intervention, accelerates recovery, and improves operational consistency.
- Ensure reliability workflows are validated prior to release, with clear rollback, verification steps, and success metrics (automation success rate, time saved, MTTR impact).
- Lead development of event‑triggered remediation and guardrails that safely automate recurring recovery actions, aligned with ITSM and change controls where required.
- Establish standards for runbooks and automated playbooks so recurring issues have a clear manual path and an automation path.
- Drive resilient operational patterns and standardized health signals for critical services in Cloud Infrastructure.
- Improve detection of degradation, accelerate isolation and restoration, and mature automated health validation, rollback, and recovery routines in Networking.
- Establish monitoring and reliability standards for enterprise platforms in Enterprise Tools.
- Ensure identity lifecycle automation is reliable and observable in IAM.
- Partner with ITSM/Platform Enablement to strengthen event‑to‑incident flows, categorization, routing, major incident engagement, and service mapping alignment.
- Partner with Security and Compliance to ensure reliability supports control execution.
- Establish reliability communication and governance cadence: weekly health review, monthly scorecard, quarterly roadmap outcomes.
- Align domain teams on reliability standards and adoption.
Required Qualifications
- 8+ years in infrastructure operations, SRE, reliability engineering, or platform operations (or equivalent experience).
- 5+ years leading teams in an operations, reliability, or engineering environment.
- Proven track record of designing, architecting and building reliability through AIOps/event correlation, observability, automation, and incident learning.
- Experience building and operating alert/event management practices (signal quality, routing, enrichment, deduplication, suppression, and operational tuning).
- Working knowledge across cloud infrastructure concepts, enterprise networking fundamentals, enterprise tool operations, and IAM lifecycle concepts.
- Strong incident command and stakeholder communication skills, including executive‑ready reporting and post‑incident facilitation.
Preferred Qualifications
- Experience implementing practical SLOs/SLIs, error budgets (where appropriate), and operational scorecards tied to business impact.
- Experience with AIOps platforms and event‑to‑ITSM integration patterns (event ingestion, correlation, automated ticketing, and routing).
- Leadership in scripting/automation (PowerShell, Python, Ansible, Terraform, or similar) and operationalizing safe automation at scale.
- Familiarity with service mapping, CMDB dependency modeling, and operational governance practices.
- Experience establishing reliability standards across multiple infrastructure domains and driving adoption through governance and coaching.
Equal Opportunity Employment
Genesys is an equal opportunity employer committed to fairness in the workplace. We evaluate qualified applicants without regard to race, color, age, religion, sex, sexual orientation, gender identity or expression, marital status, domestic partner status, national origin, genetics, disability, military and veteran status, and other protected characteristics.
Senior Manager Reliability Engineering & AIOps employer: Genesys
Contact Detail:
Genesys Recruiting Team
StudySmarter Expert Advice 🤫
We think this is how you could land Senior Manager Reliability Engineering & AIOps
✨Tip Number 1
Network like a pro! Attend industry meetups, conferences, or webinars related to reliability engineering and AIOps. It's a great way to meet potential employers and learn about job openings that might not be advertised.
✨Tip Number 2
Show off your skills! Create a portfolio or a GitHub repository showcasing your projects, especially those involving automation and incident management. This gives you a chance to demonstrate your expertise beyond just a CV.
✨Tip Number 3
Prepare for interviews by brushing up on common technical questions related to AIOps and reliability engineering. Practice explaining your past experiences with incident response and automation clearly and confidently.
✨Tip Number 4
Don’t forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, we love seeing candidates who are proactive about their job search!
We think you need these skills to ace Senior Manager Reliability Engineering & AIOps
Some tips for your application 🫡
Tailor Your Application: Make sure to customise your CV and cover letter to highlight your experience in reliability engineering and AIOps. We want to see how your skills align with the role, so don’t hold back on showcasing relevant projects or achievements!
Showcase Your Leadership Skills: As a Senior Manager, you'll be leading teams, so it's crucial to demonstrate your leadership experience. Share examples of how you've successfully managed teams or projects, especially in high-pressure situations like incident responses.
Be Clear and Concise: When writing your application, clarity is key! Use straightforward language and avoid jargon where possible. We appreciate a well-structured application that makes it easy for us to see your qualifications at a glance.
Apply Through Our Website: We encourage you to apply directly through our website. It’s the best way to ensure your application gets into the right hands. Plus, you’ll find all the details about the role and our company culture there!
How to prepare for a job interview at Genesys
✨Know Your AIOps Inside Out
Make sure you’re well-versed in AIOps practices and how they apply to reliability engineering. Brush up on event ingestion, correlation, and automated workflows, as these are crucial for the role. Be ready to discuss specific examples of how you've implemented these in past positions.
✨Demonstrate Leadership Experience
Since this is a senior manager position, highlight your leadership skills and experience managing teams. Prepare to share stories about how you've led incident response efforts or improved operational performance through team collaboration. Show them you can inspire and guide others.
✨Prepare for Technical Questions
Expect technical questions related to infrastructure operations, SRE, and reliability engineering. Brush up on your knowledge of cloud infrastructure, networking fundamentals, and IAM lifecycle concepts. Being able to speak confidently about these topics will set you apart from other candidates.
✨Showcase Your Problem-Solving Skills
Be prepared to discuss how you've tackled complex incidents in the past. Use the STAR method (Situation, Task, Action, Result) to structure your answers. This will help you clearly convey your thought process and the impact of your actions on improving reliability and reducing operational toil.