p=\âJoin to apply for the Fleet Reliability Operations Engineer role at CoreWeave .\â
p=\âCoreWeave is the essential cloud for AITM delivering a platform that enables innovators to build and scale AI with confidence. Founded in 2017 and traded on Nasdaq as CRWV, CoreWeave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs.\â
p=\âWe are proud to be a Living Wage accredited employer. The Fleet Reliability Operations team manages the dayâtoâday provisioning, management and uptime of CoreWeave\âs expanding fleet of server nodes. The role focuses on configuration, updates, remote troubleshooting and ensuring the highestâtier supercomputing clusters operate at maximum capacity.\â
p=\âShifts run two times a day, from 7 am to 9 pm. Successful candidates will attend onboarding training at our US headquarters for up to two weeks within the first month of employment.\â
h3=\âKey Responsibilities\âul=\âConfigure and maintain largeâscale, highâperformance supercomputing clusters running stateâofâtheâart GPUs.
Troubleshoot hardware and software issues; coordinate with data center, network, hardware and platform teams to drive resolution.
Monitor and analyze system performance and take remediation actions to maintain cloud health.
Create and maintain documentation of team processes, knowledge and best practices for system management.
Collaborate with the team to improve processes and efficiency.
Participate in onâcall rotations, including afterâhours and weekend work.
\â
h3=\âRequired Skills & Experience\âul=\âAt least 2 years of experience troubleshooting or administering data center or onâprem infrastructure (servers, storage, network).
Strong Linux system administration and networking fundamentals.
Ability to perform consistent, reliable system maintenance and hardware/software troubleshooting.
Bachelor\âs degree in a related field or equivalent experience.
\â
h3=\âPreferred (but not required)\âul=\âExperience with bash, python, PowerShell or similar scripting languages.
Knowledge of observability platforms such as Grafana, Prometheus, promsql.
Familiarity with dataâcenter environments, HVAC, fiber trays.
Kubernetes administration skills.
GPUâbased HPC workload experience.
\â
p=\âShortânotice business travel to the United States may be required. Applicants must be able to travel lawfully on short notice, holding the necessary U.S. authorization (e.g., ESTA or a Bâ1 visa).\â
h3=\âBenefits\âul=\âFamilyâlevel medical, dental and vision insurance.
Generous pension contribution.
Life assurance at 4Ă salary.
Critical illness cover.
Employee assistance programme.
Tuition reimbursement.
Work culture focused on innovative disruption.
\â
p=\âCoreWeave is an equalâopportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, disability, age, sexual orientation, gender identity, national origin, veteran status or genetic information.\â
p=\âLegal compliance â The position requires access to exportâcontrolled information and requires a U.S. person or a U.S. employee eligible to access such information. Applicants must meet the U.S. Government export regulations or obtain the required authorization.\â
#J-18808-Ljbffr
Fleet Reliability Operations Engineer employer: CoreWeave
Contact Detail:
CoreWeave Recruiting Team