At a Glance
- Tasks: Maintain and enhance the reliability of CoreWeave's cutting-edge cloud infrastructure.
- Company: Join CoreWeave, a leading AI HyperscalerTM transforming cloud services.
- Benefits: Competitive salary, equity awards, comprehensive health benefits, and tuition reimbursement.
- Why this job: Make a real impact in a high-performing team while growing your technical skills.
- Qualifications: 4+ years in cloud operations or site reliability engineering; strong communication skills.
- Other info: Hybrid work environment with excellent career growth opportunities.
The predicted salary is between 68000 - 104000 £ per year.
CoreWeave is the AI HyperscalerTM, delivering a cloud platform of cutting‐edge services that power the next wave of AI. Our technology provides enterprises and leading AI labs with the most performant, efficient, and resilient solutions for accelerated computing. Since 2017, CoreWeave has operated a growing footprint of data centers across the US and Europe, and the company was ranked as one of the TIME100 most influential companies of 2024.
As a Production Engineer, you will play a key role in maintaining the reliability and stability of CoreWeave's cloud infrastructure. You will work closely with the Production Engineer Team Lead and other engineers to support incident response, platform reliability, and operational improvements. This position is ideal for individuals eager to grow their technical skills, contribute to a high‐performing team, and make an impact on the operational excellence of CoreWeave's cloud services.
Key Responsibilities- Assist in incident response by helping identify and resolve service disruptions quickly, under senior engineer guidance.
- Document incidents, support root‐cause analysis (RCA), and contribute to post‐incident reviews (PIRs) to capture lessons learned.
- Help develop and maintain incident response playbooks to ensure preparedness for various failure scenarios.
- Participate in communication during incidents, updating stakeholders and keeping clear incident records.
- Monitor system performance and health using tools like Prometheus and Grafana to identify performance issues or potential incidents.
- Implement automation and process improvements to enhance efficiency and reduce manual intervention in incident detection and recovery.
- Support the development of KPIs and SLAs for incident management and align them with team goals.
- Collaborate with engineers across teams to improve platform reliability, resilience, and disaster recovery.
- Work closely with other engineers to troubleshoot system issues, refine workflows, and support ongoing operational needs.
- Participate in knowledge‐sharing activities and learn from senior team members.
- Take part in training and mentorship opportunities to grow technical skills and progress to advanced responsibilities.
- 4+ years of experience in cloud operations, site reliability engineering (SRE), or related technical roles.
- Understanding of cloud platforms (e.g., Kubernetes, AWS, GCP) and basic knowledge of cloud infrastructure.
- Familiarity with incident management practices and frameworks (e.g., ITIL, SRE best practices).
- Experience with monitoring and alerting tools (e.g., Prometheus, Grafana) or willingness to learn.
- Basic experience with scripting or automation tools (e.g., Python, Bash, Terraform, Ansible).
- Strong communication skills and the ability to explain technical concepts clearly to both technical and non‐technical audiences.
- Ability to work in a fast‐paced, high‐pressure environment while learning and adapting quickly.
- Exposure to Kubernetes, containerization, and distributed systems.
- Familiarity with change management processes and post‐incident analysis.
- Experience with automated or self‐healing infrastructure.
- A desire to learn and grow in cloud operations, reliability engineering, and incident management.
In addition to a competitive salary (base £79,000 – £130,000), we provide a comprehensive rewards package that includes a discretionary bonus, equity awards, and a benefits program.
- Family‐level Medical Insurance
- Family‐level Dental Insurance
- Generous Pension Contribution
- Life Assurance at 4× Salary
- Critical Illness Cover
- Employee Assistance Programme
- Tuition Reimbursement
Work culture focused on innovative disruption. Benefits may vary by location.
Our WorkplaceWe prioritize a hybrid work environment; remote work may be considered for candidates located more than 30 miles from an office, based on role requirements. New hires will be invited to attend onboarding at a hub within their first month.
Equal Opportunity EmployerCoreWeave is an equal‐opportunity employer, committed to fostering an inclusive and supportive workplace. All qualified applicants will receive consideration without regard to race, color, religion, sex, disability, age, sexual orientation, gender identity, national origin, veteran status, or genetic information.
Export Control ComplianceThis position requires access to export‐controlled information. Applicants must be U.S. persons per U.S. Government export regulations or otherwise eligible to access the information without export authorization.
ContactTo apply, submit your qualified application through our career portal. Applications after the closing date will not be considered.
Senior Production Engineer in London employer: CoreWeave
Contact Detail:
CoreWeave Recruiting Team
StudySmarter Expert Advice 🤫
We think this is how you could land Senior Production Engineer in London
✨Tip Number 1
Network like a pro! Reach out to current or former employees at CoreWeave on LinkedIn. A friendly chat can give you insider info and maybe even a referral, which can really boost your chances.
✨Tip Number 2
Prepare for the interview by brushing up on your technical skills. Make sure you can talk confidently about cloud platforms, incident management, and any tools mentioned in the job description. We want you to shine!
✨Tip Number 3
Show your passion for AI and cloud technology during interviews. Share your thoughts on industry trends or recent developments. This will demonstrate your enthusiasm and commitment to the field, making you stand out.
✨Tip Number 4
Don’t forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, it shows you’re serious about joining the CoreWeave team.
We think you need these skills to ace Senior Production Engineer in London
Some tips for your application 🫡
Tailor Your CV: Make sure your CV is tailored to the Production Engineer role. Highlight your experience with cloud operations and incident management, as these are key for us at CoreWeave. We want to see how your skills align with our needs!
Craft a Compelling Cover Letter: Your cover letter is your chance to shine! Use it to explain why you're excited about the role and how you can contribute to our team. We love seeing genuine enthusiasm and a clear understanding of what we do.
Showcase Your Technical Skills: Don’t forget to mention your technical skills, especially with tools like Prometheus, Grafana, or any scripting languages. We’re looking for someone who can hit the ground running, so make sure we know what you bring to the table!
Apply Through Our Website: Remember to submit your application through our career portal. It’s the best way for us to keep track of your application and ensure it gets the attention it deserves. We can’t wait to hear from you!
How to prepare for a job interview at CoreWeave
✨Know Your Tech Inside Out
Make sure you brush up on your knowledge of cloud platforms like Kubernetes, AWS, and GCP. Be ready to discuss how you've used these technologies in past roles, especially in incident management and operational support.
✨Prepare for Incident Scenarios
Think about past incidents you've managed or been involved in. Be prepared to explain your role, the steps you took for resolution, and what you learned from the experience. This will show your understanding of incident management practices.
✨Showcase Your Automation Skills
If you have experience with scripting or automation tools like Python, Bash, or Terraform, be ready to share specific examples of how you've implemented these in your work. If you're still learning, express your eagerness to develop these skills further.
✨Communicate Clearly and Confidently
Practice explaining technical concepts in simple terms. You might be asked to communicate with both technical and non-technical stakeholders, so demonstrating your ability to bridge that gap will be crucial.