At a Glance
- Tasks: Resolve complex platform and hardware incidents while managing firmware lifecycle and server configurations.
- Company: Join a high-performing enterprise infrastructure team in Glasgow with a hybrid work model.
- Benefits: Competitive day rate, flexible working, and opportunities for professional growth.
- Other info: Collaborative environment with a focus on continuous improvement and operational excellence.
- Why this job: Make a real impact on critical platforms and enhance your Linux expertise.
- Qualifications: Strong Linux skills, incident management experience, and excellent communication abilities.
We are seeking a Senior Platform Reliability Engineer with deep Linux systems expertise and strong exposure to server hardware, firmware, and low-level infrastructure operations. This role sits within a high-performing enterprise infrastructure team responsible for maintaining and improving the reliability of critical platforms at scale.
The position is heavily focused on resolving complex platform and hardware-related incidents, particularly those escalated from L3 support, with an emphasis on firmware lifecycle management, disk encryption, logging, and server configuration (BIOS-level controls) across multi-vendor environments. This is a hands-off hardware role, requiring strong remote troubleshooting capabilities, excellent communication skills, and the ability to work closely with internal teams and external vendors to drive issues through to resolution.
Key Responsibilities- Own and manage end-to-end incident resolution for platform and hardware-related issues, including triage, mitigation, escalation, and post-incident review.
- Diagnose and troubleshoot Linux OS-level issues arising from hardware faults, firmware changes, or configuration inconsistencies.
- Manage and support firmware lifecycle processes, including upgrades, validation, and issue remediation.
- Work with disk encryption technologies and logging frameworks, ensuring system integrity and auditability.
- Maintain and troubleshoot server configuration settings, including BIOS-level parameters across multiple hardware vendors (strong Dell focus).
- Utilize out-of-band management tools (e.g., iDRAC, iLO, RACADM, Redfish APIs) for remote diagnostics and recovery.
- Analyse vendor logs, support bundles, and telemetry data to identify root causes and remediation paths.
- Engage directly with hardware vendors and engineering teams, managing escalations and driving timely resolutions.
- Contribute to continuous improvement initiatives, reducing incident recurrence and operational toil.
- Produce and maintain high-quality documentation, including runbooks, troubleshooting guides, and knowledge base articles.
- Participate in post-incident reviews (RCA) and support improvements in reliability metrics (MTTR, MTTD, SLOs).
- Strong Linux administration and troubleshooting expertise, including process and service management, system logs and diagnostics, networking fundamentals, and package and configuration management.
- Solid understanding of server hardware and infrastructure, including disks, RAID/HBA controllers, NICs and firmware interactions, and hardware failure modes and OS-level symptoms.
- Proven experience with firmware management and upgrades, disk encryption and secure configurations, and BIOS/server configuration management.
- Hands-on experience with remote management and lights-out technologies, such as iDRAC, iLO, RACADM, Redfish or similar APIs.
- Strong track record of incident ownership, including triage and mitigation, cross-team coordination, stakeholder communication, and driving issues through to resolution.
- Experience working with vendor diagnostics, logs, and support bundles, as well as vendor escalation processes and engineering engagement.
- Excellent communication skills (written and verbal), with the ability to clearly articulate technical issues to both technical and non-technical stakeholders.
- Strong documentation skills, including creation of runbooks, procedures, and RCA reports.
- Scripting and automation experience (e.g., Python, Bash, Ansible).
- Familiarity with configuration management and automation frameworks.
- Exposure to virtualisation and containerisation technologies (VMware, KVM, Docker, Kubernetes).
- Experience with monitoring, observability, and alerting systems, including log analysis and alert tuning.
- Understanding of SRE principles and metrics, including SLOs, SLIs, error budgets, MTTR/MTTD.
- Methodical and detail-oriented approach to troubleshooting.
- Strong sense of ownership and accountability.
- Comfortable working in high-pressure, incident-driven environments.
- Collaborative mindset with the ability to work across global teams and vendors.
- Proactive approach to continuous improvement and operational excellence.
SRE (Linux, Firmware & Server Infrastructure) in Glasgow employer: Networking People (UK) Limited
At our Glasgow-based enterprise infrastructure team, we pride ourselves on fostering a collaborative and innovative work culture that empowers our employees to excel in their roles. As a Senior Platform Reliability Engineer, you will benefit from a hybrid working model, competitive day rates, and opportunities for professional growth through continuous improvement initiatives and hands-on experience with cutting-edge technologies. Join us to be part of a high-performing team dedicated to maintaining the reliability of critical platforms while enjoying a supportive environment that values your contributions.
Contact Detail:
Networking People (UK) Limited Recruiting Team
StudySmarter Expert Advice🤫
We think this is how you could land SRE (Linux, Firmware & Server Infrastructure) in Glasgow
✨Tip Number 1
Network like a pro! Reach out to folks in the industry on LinkedIn or at local meetups. You never know who might have the inside scoop on job openings or can put in a good word for you.
✨Tip Number 2
Prepare for those tricky technical interviews. Brush up on your Linux skills and be ready to troubleshoot on the spot. Practising common scenarios can really help us stand out!
✨Tip Number 3
Show off your problem-solving skills! When discussing past experiences, focus on how you tackled complex incidents and what impact your solutions had. We want to highlight our ownership and accountability.
✨Tip Number 4
Don’t forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, it shows you’re genuinely interested in joining our team.
We think you need these skills to ace SRE (Linux, Firmware & Server Infrastructure) in Glasgow
Some tips for your application 🫡
Tailor Your CV:Make sure your CV is tailored to the role of Senior Platform Reliability Engineer. Highlight your Linux expertise, server hardware knowledge, and any relevant experience with firmware management. We want to see how your skills match what we're looking for!
Craft a Compelling Cover Letter:Your cover letter is your chance to shine! Use it to explain why you're passionate about this role and how your background makes you a perfect fit. Don’t forget to mention your experience with incident resolution and cross-team collaboration.
Show Off Your Documentation Skills:Since we value strong documentation skills, include examples of runbooks or troubleshooting guides you've created in the past. This will demonstrate your ability to communicate complex technical issues clearly, which is key for us.
Apply Through Our Website:We encourage you to apply through our website for a smoother application process. It helps us keep track of your application and ensures you don’t miss out on any important updates from us!
How to prepare for a job interview at Networking People (UK) Limited
✨Know Your Linux Inside Out
Make sure you brush up on your Linux administration skills. Be prepared to discuss troubleshooting techniques, system logs, and diagnostics. They’ll likely ask you about specific scenarios where you've resolved OS-level issues, so have some examples ready!
✨Familiarise Yourself with Hardware and Firmware
Since this role involves a lot of hardware interaction, it’s crucial to understand server components and firmware management. Review common failure modes and how they manifest at the OS level. Being able to articulate your experience with BIOS configurations and vendor-specific tools will set you apart.
✨Communication is Key
You’ll need to communicate complex technical issues clearly to both technical and non-technical stakeholders. Practice explaining your past experiences in a way that anyone can understand. This will show your ability to collaborate effectively across teams.
✨Prepare for Incident Management Scenarios
Expect questions around incident ownership and resolution processes. Think of times when you’ve triaged incidents or worked with vendors to resolve issues. Highlight your methodical approach and any improvements you’ve implemented to reduce incident recurrence.