Director of Production Engineering (Reliability Platform Engineering) in North East
Director of Production Engineering (Reliability Platform Engineering)

Director of Production Engineering (Reliability Platform Engineering) in North East

North East Full-Time No home office possible
Go Premium
T

Director of Production Engineering (Reliability Platform Engineering) – Durham, England, UK

We’re looking for a Director of Production Engineering who will own the engineering systems that make reliability, performance, correctness and release safety predictable across our global POS edge cloud and middleware platform.

This leadership role focuses on distributed system correctness, resilience and performance engineering. You partner closely with SRE and Operations leaders, but your charter is to engineer prevention by building production readiness standards, automated release gates and performance/resilience validation mechanisms that stop unsafe changes before they ship.

What Success Looks Like

  • Releases ship faster without increasing Sev‑1 / Sev‑2 incidents
  • Incident recurrence drops measurably due to enforced learning and prevention
  • Edge store cloud workflows behave safely under real failure conditions
  • Reliability is engineered, automated and enforced, not reactive
  • Teams clearly understand what is safe to release and pipelines enforce it

Responsibilities

  • Production Engineering & Release Safety
    Own non‑functional release criteria and automated release gates for reliability, resilience, performance and correctness across complex release trains.
  • Define and enforce Production Readiness Reviews (PRRs) and platformwide engineering standards.
  • Establish objective measurable safe‑to‑release signals consumed by CI/CD and release tooling.
  • Distributed Systems Correctness (Edge Cloud Commerce)
    Partner with Architects and Principal Engineers to define failure modes, degradation behavior and system guardrails for distributed and eventually consistent workflows.
  • Ensure systems behave correctly during retries, partial outages, intermittent connectivity, degraded modes and recovery.
  • Lead initiatives that reduce risk of data loss, duplication, corruption or inconsistent state across POS middleware and cloud services.
  • Incident Learning That Prevents Recurrence
    Lead blameless incident reviews using formal analysis methods.
  • Ensure corrective actions are engineered into systems, validated, tracked and audited.
  • Institutionalize learning so failures do not reappear under new conditions or scale.
  • Resilience & Performance Engineering
    Own platform‑level strategies for resilience, performance and scalability validation.
  • Drive chaos, failover, load, stress and soak testing focused on real failure modes, not synthetic demos.
  • Validate store‑mode behavior, payment workflows, edge‑device dependencies and multi‑service interactions.
  • Observability & Reliability Signals
    Ensure high‑fidelity telemetry (logs, metrics, traces and business signals) that supports release gating, correctness verification and diagnosis.
  • Drive instrumentation standards that allow teams to prove reliability outcomes with data.
  • Cross‑Org Technical Leadership
    Partner with Software Engineering, Architecture, Quality Engineering, Cloud Operations and TPM/TPO teams.
  • Build and lead senior technical managers and staff‑level engineers.
  • Set expectations for technical depth, ownership and execution quality.

Required Experience

  • Bachelor’s degree in Computer Science, Engineering or equivalent practical experience.
  • 10‑15 years building production engineering capabilities for distributed software platforms with direct accountability for production outcomes.
  • Demonstrated experience defining and enforcing production readiness standards and non‑functional release gates that prevent unsafe changes from shipping.
  • Proven ability to lead formal root‑cause / reliability analysis and ensure systemic fixes reduce recurrence.
  • Strong distributed systems fundamentals, including the ability to reason about:
  • Failure modes and degradation behavior
  • Dependency risk, retries and backpressure
  • Consistency trade‑offs and correctness under failure
  • Experience partnering deeply with Architecture and Software Engineering to embed reliability guardrails into design reviews, CI/CD pipelines and system standards.
  • Senior leadership experience building teams and influencing across large engineering organizations.

Preferred Experience

  • Designing reliability automation (release scoring, regression detection, incident pattern analysis).
  • Hybrid cloud edge architectures; Kubernetes/AKS; modern observability platforms.
  • Leading reliability transformations in large complex engineering organizations.

Why This Role Matters

  • Uptime is engineered, not reactive.
  • Development and QA operate at AI‑enabled speed.
  • The platform scales safely without sacrificing correctness.
  • TGCS matches or exceeds best‑in‑class engineering organizations.

Benefits

  • Group health coverage (medical, dental & vision)
  • Employee Assistance Programs
  • Pre‑tax spending accounts
  • 401(k) plan with company match
  • Company‑provided life insurance
  • Pet insurance
  • Employee discounts
  • Generous paid holiday schedule, paid vacation & sick/personal days

EEO Statement

Toshiba Global Commerce Solutions is an equal opportunity/affirmative action employer that evaluates qualified applicants without regard to age, ancestry, color, religious creed, disability, marital status, medical condition, genetic information, military or veteran status, national origin, race, sex, gender, gender identity, gender expression or sexual orientation or any other protected factor. We also consider qualified applicants regardless of criminal histories consistent with legal requirements.

Requests for Reasonable Accommodation

Individuals who need a reasonable accommodation because of a disability for any part of the employment process should email to request an accommodation.

Diversity, Equity & Inclusion

We firmly believe that our people are an integral part to the success of our customers. We are committed to Diversity, Equity and Inclusion for all our people, highlighted by our 5 Core Principles: Create, Outreach, Foster Belonging, Unleash Opportunity, Diverse Cultural Engagement and Culture of Transparency. We’re passionate about our customers, the retail industry and becoming a more responsible company as we help create a brighter future.

#J-18808-Ljbffr

T

Contact Detail:

Toshiba Global Commerce Solutions Recruiting Team

Director of Production Engineering (Reliability Platform Engineering) in North East
Toshiba Global Commerce Solutions
Location: North East
Go Premium

Land your dream job quicker with Premium

You’re marked as a top applicant with our partner companies
Individual CV and cover letter feedback including tailoring to specific job roles
Be among the first applications for new jobs with our AI application
1:1 support and career advice from our career coaches
Go Premium

Money-back if you don't land a job in 6-months

T
Similar positions in other companies
UK’s top job board for Gen Z
discover-jobs-cta
Discover now
>