Platform Reliability & Operations Lead

Platform Reliability & Operations Lead

Full-Time 60000 - 80000 £ / year (est.) No working from home possible
Epsilon

At a Glance

  • Tasks: Lead the reliability and operations of production systems, ensuring stability and premium customer support.
  • Company: Join a forward-thinking company focused on innovation and excellence in tech.
  • Benefits: Competitive salary, flexible working options, and opportunities for professional growth.
  • Other info: Dynamic role with a focus on continuous improvement and cutting-edge technology.
  • Why this job: Make a real impact by driving system reliability and collaborating with diverse teams.
  • Qualifications: 5+ years in Site Reliability, strong leadership, and expertise in containerization and scripting.

The predicted salary is between 60000 - 80000 £ per year.

Requirements

  • At least 5 years of hands-on experience in Site Reliability focused positions.
  • Strong knowledge of containerization technologies (Docker, Kubernetes).
  • Experience with infrastructure as code (Terraform).
  • Solid understanding of networking, security, and system architecture.
  • Proficient in scripting languages (Java, Golang, Python, Bash, or similar).
  • Experience with monitoring and observability tools (DataDog, Prometheus, Grafana).
  • Knowledge of database management systems (PostgreSQL, Bigtable).
  • Understanding of API and microservices architecture.
  • Strong people leadership skills with at least a year in leading and driving high-performance technical teams.
  • Operations teams within enterprise environments with knowledge of DevOps, ITIL, Cloud Services, IT Infrastructure and Operations supporting and maintaining production and development environments.
  • Experience with establishing Service Delivery strategies that align to new ways of work methods, including Agile.
  • Experience of establishing and delivering IT support services in a high availability (HA) environment such as 24/7 operations.

What the job involves

  • The System and Platform Operations Manager is a technical leadership role responsible for the support, reliability, and stability of Epsilon Retail Media production systems, environments, and offerings.
  • The team owns the reliability vision for the company, driving continuous improvement through a combination of development and operations initiatives as well as process excellence.
  • This position has solid-line responsibility for operations including the deployment, management, monitoring, reporting, troubleshooting, and repair of production systems.
  • Core to the success of the role is to provide a premium customer support experience focused on a “centre of excellence” that allows for a full-service delivery support cycle.
  • This role is responsible for managing the Platform Operation Team centralized within a single geo-region, orchestrating the regional teamwork, serving with both technical and professional support, and championing the company values.
  • The Platform Operations Engineer works closely with the Engineering team to ensure ongoing system stability and supports the Technical Account Managers from an environment's perspective.
  • The Platform Operations team is responsible for supporting all retailers once they are live.
  • Critically important is how this team collaborates and liaises with other teams such as Customer Support, Technical Account Management, Engineering, and Customer Success teams.
  • You'll establish and manage operational practices and ensure we design, implement, and operate a support model that is fit for purpose for our future.
  • Adopt a “Measure Everything” approach to ensure that internal service level objectives and customer service levels agreements are exceeded including executive level reporting on operational health metrics such as SLAs, incident resolution, performance, availability, reliability, capacity etc.
  • Take ownership of complex issues related to performance, reliability, and scalability and lead resolution of serious incidents and events including communications with customers and wider stakeholders.
  • Provide insight and expertise on how customers will perceive the changes or impacts to customers to drive customer organisation change management and communication.
  • Empower the Delivery teams to release new products, features, updates, and fixes quickly, while ensuring Platforms remain reliable and stable.
  • Work with the wider Engineering, Product, Delivery, and Security teams to ensure that appropriate attention is given to production/system reliability.
  • Identify the capabilities needed to meet the current and emerging business needs of a significant function.
  • As subject matter expert on the team, maintain understanding of current technology, database management, reliability practices, and future trends through ongoing education, conference attendance, and industry press.

Platform Reliability & Operations Lead employer: Epsilon

Epsilon Retail Media is an exceptional employer that prioritises a culture of collaboration and continuous improvement, making it an ideal place for professionals in the Platform Reliability & Operations field. With a strong focus on employee growth, we offer opportunities for technical leadership and innovation within a supportive environment, ensuring our teams are equipped to deliver premium customer experiences. Located in a dynamic region, our company fosters a work-life balance while championing cutting-edge technologies and practices that empower our employees to excel.

Epsilon

Contact Details:

Epsilon Recruitment Team

StudySmarter Expert Advice🤫

We think this is how you could land Platform Reliability & Operations Lead

Tip Number 1

Network like a pro! Attend industry meetups, conferences, or webinars related to Site Reliability and Operations. It's a great way to connect with potential employers and learn about job openings that might not be advertised.

Tip Number 2

Show off your skills! Create a portfolio or GitHub repository showcasing your projects, especially those involving Docker, Kubernetes, or Terraform. This gives you a chance to demonstrate your hands-on experience and technical prowess.

Tip Number 3

Prepare for interviews by brushing up on common SRE scenarios and challenges. Be ready to discuss how you've tackled performance issues or improved system reliability in past roles. We want to see your problem-solving skills in action!

Tip Number 4

Don't forget to apply through our website! It’s the best way to ensure your application gets noticed. Plus, we love seeing candidates who are genuinely interested in joining our team and contributing to our mission.

We think you need these skills to ace Platform Reliability & Operations Lead

Site Reliability Engineering
Containerization Technologies (Docker, Kubernetes)
Infrastructure as Code (Terraform)
Networking and Security Knowledge
System Architecture Understanding
Scripting Languages (Java, Golang, Python, Bash)
Monitoring and Observability Tools (DataDog, Prometheus, Grafana)

Some tips for your application 🫡

Tailor Your CV:Make sure your CV is tailored to the Platform Reliability & Operations Lead role. Highlight your hands-on experience in Site Reliability, containerization technologies, and any relevant scripting languages. We want to see how your skills match what we're looking for!

Showcase Your Leadership Skills:Since this role involves leading high-performance technical teams, don’t forget to showcase your leadership experience. Share examples of how you've driven teams towards success and improved operational practices. We love seeing strong people leaders!

Be Specific About Your Experience:When detailing your experience, be specific about the tools and technologies you've used, like Terraform, DataDog, or Grafana. We appreciate candidates who can clearly articulate their expertise and how it relates to our needs.

Apply Through Our Website:Finally, make sure to apply through our website! It’s the best way for us to receive your application and ensures you’re considered for the role. We can’t wait to see what you bring to the table!

How to prepare for a job interview at Epsilon

Know Your Tech Inside Out

Make sure you brush up on your knowledge of containerization technologies like Docker and Kubernetes, as well as infrastructure as code tools like Terraform. Be ready to discuss how you've used these in past roles, and maybe even prepare a few examples of challenges you've faced and how you overcame them.

Showcase Your Leadership Skills

Since this role involves leading high-performance technical teams, be prepared to share specific instances where you've successfully managed a team or project. Highlight your people leadership skills and how you've driven results in a collaborative environment.

Understand the Bigger Picture

Familiarise yourself with the company's vision for reliability and stability. Think about how your experience aligns with their goals and be ready to discuss how you can contribute to their 'center of excellence' approach in customer support and operational practices.

Prepare for Scenario-Based Questions

Expect questions that assess your problem-solving abilities, especially around performance, reliability, and scalability issues. Prepare to walk through your thought process in resolving complex incidents, and how you communicate with stakeholders during such events.