At a Glance
- Tasks: Join a dynamic team to shape the future of AI infrastructure and solve complex technical challenges.
- Company: Carbon3.ai, an innovative start-up revolutionising AI solutions with renewable energy.
- Benefits: Competitive salary, flexible working options, and opportunities for professional growth.
- Why this job: Be at the forefront of AI technology and make a real impact in a fast-growing industry.
- Qualifications: Experience in systems engineering, technical support, and a passion for AI/ML concepts.
- Other info: Exciting career development in a collaborative environment with leading investors backing our vision.
The predicted salary is between 48000 - 72000 £ per year.
Join to apply for the Lead Systems Engineer - HPC / AI role at Carbon3.ai - Building the UK's AI Solution Platform. We are an emerging AI infrastructure start-up building next-generation data centres and high-performance compute environments to power AI, LLM training and cloud-scale workloads, powered by renewable energy, rooted in sovereign capability, and designed to give enterprises and innovators the compute they need. Backed by leading investors, we are rapidly expanding our site development pipeline, engineering capabilities, and commercial partnerships.
We are looking for a Lead Systems Engineer who can assist in shaping the new Platform team. This role will be customer facing, involve technical troubleshooting, and collaboration with vendor engineering teams to ensure seamless AI platform operations.
Key Responsibilities- Coordinate resolution of complex issues (L3) to (vendor) product/engineering teams and manage vendor responses.
- Monitor system health, alerts, and customer usage patterns.
- Document solutions/workarounds, create and maintain knowledge, document support procedures.
- Automate common tasks and fixes.
- Configure and integrate tooling to support optimal operation of the platform, and support tool selection.
- Assist customers with platform configuration, onboarding, and usage best practices.
- Collaborate with platform and infrastructure support/engineering teams to resolve platform integration issues.
- Ensure SLAs and customer satisfaction targets are met.
- L1 support for customer-reported issues and requests.
- L2 support by diagnosing, replicating, and troubleshooting issues across platform and infrastructure.
- Work with customers and multiple stakeholders to understand requirements and challenges, provide reporting on usage, workflow and billing.
- Cluster Infrastructure management: Managing the Nvidia GPU cluster.
- High availability and resilience: Implement failover strategies and manage maintenance events to minimise downtime.
- Resource allocation and optimisation: Resource partitioning (GPU resources), workload scheduling, capacity planning.
- Performance monitoring and troubleshooting: Performance analysis, monitoring (realtime) with available Nvidia and HPE tools.
- Incident response: node failure management, network issues, driver issues, troubleshooting common issues and then working with vendor support to resolve any critical issues.
- Security and access control: Manage user permissions, RBAC, security hardening, data protection.
- Extensive experience in technical support, system engineering, or platform operations.
- Solid understanding of L1 and L2 support processes (ticketing, escalation, troubleshooting).
- Familiarity with cloud-based platforms, APIs, and distributed systems.
- Understanding of AI/ML concepts and tooling (model training, inference, data pipelines basics).
- Experience with monitoring/logging tools (e.g., Grafana, Kibana, Splunk).
- Excellent communication skills to interface with both customers and internal/vendor teams.
- Good understanding of tools requirements for ML engineers and data scientists, and how to optimize the experience.
- System administration experience with OS's like RHEL/CentOS, Ubuntu, tuning Linux kernel.
- Proficiency with Ansible, Nvidia and CUDA toolkits, Kubernetes and container orchestration.
- Understanding of automation, monitoring and security with GPU as a service.
- Experience supporting HPE PCAI or other AI/HPC infrastructure and platforms.
- Experience with GPU resource allocation (across instances, GPUs count and time).
- Advanced networking skills with High performance networking, troubleshooting and fine tuning.
- Background in DevOps or SRE practices.
- ITIL familiarity.
- Customers receive timely, effective support with minimal escalations.
- Issues are resolved or routed correctly with high-quality documentation.
- The platform maintains strong uptime and customer satisfaction.
Systems Engineer - (HPC & AI) in London employer: Carbon3.ai - Building the UK's AI Solution Platform
Contact Detail:
Carbon3.ai - Building the UK's AI Solution Platform Recruiting Team
StudySmarter Expert Advice 🤫
We think this is how you could land Systems Engineer - (HPC & AI) in London
✨Tip Number 1
Network like a pro! Attend industry meetups, webinars, or conferences related to AI and HPC. It's a great way to connect with potential employers and learn about job openings that might not be advertised.
✨Tip Number 2
Show off your skills! Create a portfolio showcasing your projects, especially those involving AI/ML concepts or system engineering. This gives you a tangible way to demonstrate your expertise during interviews.
✨Tip Number 3
Practice makes perfect! Prepare for technical interviews by brushing up on your troubleshooting skills and understanding of L1 and L2 support processes. Mock interviews can help you feel more confident when it’s time to shine.
✨Tip Number 4
Don’t forget to apply through our website! We’re always looking for talented individuals to join our team. Plus, applying directly can sometimes give you an edge over other candidates.
We think you need these skills to ace Systems Engineer - (HPC & AI) in London
Some tips for your application 🫡
Tailor Your CV: Make sure your CV is tailored to the Systems Engineer role. Highlight your experience with HPC, AI, and any relevant technical skills that match the job description. We want to see how your background aligns with what we're looking for!
Show Off Your Communication Skills: Since this role involves customer interaction, it's crucial to showcase your communication skills. Use clear and concise language in your application, and don’t hesitate to include examples of how you've effectively communicated with customers or teams in the past.
Detail Your Technical Experience: We’re keen on your technical expertise, so be specific about your experience with system administration, monitoring tools, and any relevant platforms. Mention any hands-on experience you have with Nvidia GPU clusters or similar technologies to stand out!
Apply Through Our Website: Don’t forget to apply through our website! It’s the best way for us to receive your application and ensures you’re considered for the role. Plus, it gives you a chance to explore more about our company and culture!
How to prepare for a job interview at Carbon3.ai - Building the UK's AI Solution Platform
✨Know Your Tech Inside Out
Make sure you brush up on your technical skills, especially around systems engineering and AI concepts. Familiarise yourself with tools like Ansible, Nvidia, and Kubernetes, as well as the specifics of L1 and L2 support processes. Being able to discuss these confidently will show that you're ready for the role.
✨Showcase Your Problem-Solving Skills
Prepare to discuss past experiences where you've resolved complex issues or troubleshot technical problems. Think about specific examples where you coordinated with vendor teams or managed system health. This will demonstrate your ability to handle the responsibilities of the role effectively.
✨Understand Customer Needs
Since this role is customer-facing, be ready to talk about how you’ve previously interacted with customers to understand their requirements. Highlight any experience you have in providing support and ensuring customer satisfaction, as this will be crucial for success in the position.
✨Prepare Questions for Them
Interviews are a two-way street! Prepare insightful questions about Carbon3.ai's platform, their approach to AI infrastructure, and how they measure success. This shows your genuine interest in the company and helps you assess if it’s the right fit for you.