Job Board

Companies

Kraken

Senior AI Compute Infrastructure Engineer

Full-Time 80000 - 100000 £ / year (est.) Home office (partial)

Apply now

At a Glance

Tasks: Join a dynamic team to build cutting-edge AI compute infrastructure for Kraken.
Company: Kraken, a leading tech company in the crypto space.
Benefits: Competitive salary, flexible work options, and opportunities for professional growth.
Other info: Inclusive workplace that values diversity and offers ongoing learning opportunities.
Why this job: Make a real impact on AI technology while working with top experts in the field.
Qualifications: 5+ years in infrastructure engineering with GPU and ML experience.

The predicted salary is between 80000 - 100000 £ per year.

Kraken is building a dedicated AI Compute and Infrastructure team to power the next generation of model training, inference, evaluation, and experimentation across the exchange. This team sits within engineering leadership and owns the infrastructure layer that lets Kraken run AI workloads with control, speed, reliability, and cost discipline. The team is responsible for GPU and accelerator infrastructure, cluster operations, scheduling, model serving, observability, capacity planning, and cost‑efficient compute at scale. This is the backbone that allows Kraken to train, serve, evaluate, and iterate on AI systems in‑house where it matters for privacy, latency, reliability, cost, or product differentiation. You will join a small, senior, high‑impact team working directly with AI/ML researchers, platform engineers, security teams, and product teams. The mandate is simple: make Kraken's AI ambitions real by building compute infrastructure that is fast, dependable, efficient, and production‑grade.

Responsibilities

Own and operate GPU and accelerator clusters used for training, inference, evaluation, and experimentation, including drivers, runtimes, kernels, device plugins, node configuration, scheduling primitives, and workload isolation.
Design infrastructure that enables Kraken teams to run models locally on GPUs where it is strategically and economically preferable, reducing unnecessary dependency on external providers and containing compute costs.
Build and improve scheduling, orchestration, placement, quota management, and utilization systems across heterogeneous accelerator environments.
Optimize inference pipelines for latency, throughput, reliability, memory efficiency, and cost using frameworks such as vLLM, Triton Inference Server, TensorRT, or equivalent serving stacks.
Partner with ML engineers and researchers to remove bottlenecks in training, evaluation, batch inference, online inference, deployment, and production debugging workflows.
Build observability for GPU utilization, memory pressure, queue depth, saturation, token throughput, request latency, failed workloads, capacity pressure, and spend.
Drive reliability, incident response, alerting, runbooks, and post‑incident improvements for always‑on AI compute infrastructure.
Evaluate and integrate new hardware, cloud instance families, specialized accelerators, runtimes, schedulers, and serving frameworks as the AI infrastructure landscape evolves.
Build tooling that makes GPU usage visible, accountable, and easier for internal teams to consume without needing to become infrastructure experts.
Contribute to long‑term architecture decisions that balance performance, cost efficiency, scalability, operational simplicity, and production safety.

Experience & Skills

5+ years of infrastructure engineering experience, with significant time spent on GPU compute, ML infrastructure, distributed systems, high‑performance computing, or large‑scale production platforms.
Hands‑on experience operating GPU clusters or accelerator‑backed infrastructure in production or production‑like environments, including scheduling, orchestration, utilization monitoring, and cost optimization.
Strong systems engineering fundamentals across Linux, networking, storage, containers, Kubernetes, distributed runtimes, and production debugging.
Experience with ML serving frameworks such as vLLM, Triton Inference Server, TensorRT, TorchServe, KServe, Ray Serve, or equivalent systems.
Proficiency in Python for infrastructure automation, tooling, debugging, integration, and operational workflows.
Practical understanding of performance tradeoffs across batching, concurrency, memory usage, GPU utilization, model size, latency, throughput, availability, and cost.
Track record of optimizing compute costs while maintaining clear performance, reliability, and availability expectations.
Experience building observable systems with useful metrics, logs, traces, dashboards, alerts, and incident workflows.
Comfortable working in high‑stakes, always‑on environments where uptime, throughput, correctness, and operational discipline are critical.
Clear communicator who can translate infrastructure tradeoffs for researchers, product teams, platform engineers, security stakeholders, and engineering leadership.

Nice to Have Skills

Experience at a frontier AI lab, hyperscaler, high‑frequency trading firm, research platform, or high‑scale ML organization.
Familiarity with custom silicon or specialized accelerators such as TPUs, AWS Trainium, Gaudi, or similar platforms.
Background in capacity planning, procurement input, reserved capacity strategy, cloud accelerator economics, or GPU fleet cost management.
Experience with distributed training frameworks such as DeepSpeed, Megatron‑LM, FSDP, Ray, or equivalent systems.
Experience debugging CUDA, NCCL, kernel, driver, runtime, memory, networking, or low‑level performance issues.
Experience with Rust, C++, Go, CUDA, or other systems languages used for performance‑critical infrastructure.
Crypto, financial services, trading infrastructure, or security‑sensitive production infrastructure experience.

Additional Information

Unless a specific application deadline is stated in the job posting, applications are accepted on an ongoing basis. Please note, applicants are permitted to redact or remove information on their resume that identifies age, date of birth, or dates of attendance at or graduation from an educational institution. We consider qualified applicants with criminal histories for employment on our team, assessing candidates in a manner consistent with the requirements of the San Francisco Fair Chance Ordinance. We may ask candidates to complete job‑related skills or work‑style assessments as part of our hiring process. These assessments are designed to evaluate competencies relevant to the role and are applied consistently across candidates for similar positions. Assessment results are considered alongside other relevant information, such as experience and interviews, and are not the sole basis for any employment decision. As an equal opportunity employer, we don’t tolerate discrimination or harassment of any kind. Whether that’s based on race, ethnicity, age, gender identity, citizenship, religion, sexual orientation, disability, pregnancy, veteran status or any other protected characteristic as outlined by federal, state or local laws.

Senior AI Compute Infrastructure Engineer employer: Kraken

At Kraken, we pride ourselves on fostering a dynamic and inclusive work culture that empowers our employees to innovate and excel. As a Senior AI Compute Infrastructure Engineer, you will be part of a high-impact team dedicated to advancing AI technology in a collaborative environment, with ample opportunities for professional growth and development. Located in the heart of San Francisco, we offer competitive benefits, a commitment to work-life balance, and the chance to contribute to cutting-edge projects that shape the future of finance and technology.

Contact Detail:

Kraken Recruiting Team

View Kraken Profile

StudySmarter Expert Advice 🤫

We think this is how you could land Senior AI Compute Infrastructure Engineer

✨Tip Number 1

Network like a pro! Reach out to folks in the industry, attend meetups, and connect with people on LinkedIn. You never know who might have the inside scoop on job openings or can put in a good word for you.

✨Tip Number 2

Show off your skills! Create a portfolio or GitHub repository showcasing your projects and contributions. This is a great way to demonstrate your expertise in GPU compute and ML infrastructure without just relying on your CV.

✨Tip Number 3

Prepare for interviews by brushing up on your technical knowledge and soft skills. Practice explaining complex concepts in simple terms, as you'll need to communicate effectively with both technical and non-technical teams.

✨Tip Number 4

Don't forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, it shows you're genuinely interested in joining our team at Kraken.

We think you need these skills to ace Senior AI Compute Infrastructure Engineer

GPU Compute Infrastructure

Cluster Operations

Scheduling and Orchestration

Model Serving

Observability

Capacity Planning

Cost Optimization

Linux Systems Engineering

Networking

Storage Management

Kubernetes

Distributed Runtimes

Python for Automation

Performance Trade-offs

Incident Response

Some tips for your application 🫡

Tailor Your CV: Make sure your CV speaks directly to the role of Senior AI Compute Infrastructure Engineer. Highlight your experience with GPU clusters, ML infrastructure, and any relevant projects that showcase your skills in a way that aligns with what Kraken is looking for.

Craft a Compelling Cover Letter: Your cover letter is your chance to shine! Use it to explain why you're passionate about AI infrastructure and how your background makes you a perfect fit for Kraken. Don’t forget to mention specific experiences that relate to the responsibilities listed in the job description.

Showcase Your Technical Skills: Be sure to include any hands-on experience you have with tools and frameworks mentioned in the job description, like Triton Inference Server or Kubernetes. This will help us see that you’re not just familiar with the tech, but that you can actually use it effectively.

Apply Through Our Website: We encourage you to apply through our website for the best chance of getting noticed. It’s straightforward and ensures your application goes directly to the right team. Plus, we love seeing candidates who take that extra step!

How to prepare for a job interview at Kraken

✨Know Your Tech Inside Out

Make sure you’re well-versed in GPU and accelerator infrastructure, as well as the specific frameworks mentioned in the job description like vLLM and Triton Inference Server. Brush up on your knowledge of scheduling, orchestration, and cost optimisation techniques to show you can hit the ground running.

✨Showcase Your Problem-Solving Skills

Prepare examples from your past experience where you’ve tackled complex issues in high-performance computing or distributed systems. Be ready to discuss how you’ve optimised compute costs while maintaining performance and reliability—this will demonstrate your ability to contribute to Kraken's AI ambitions.

✨Communicate Clearly and Effectively

Since you'll be working with various teams, practice explaining technical concepts in a way that non-technical stakeholders can understand. This will highlight your communication skills and your ability to bridge the gap between infrastructure and product teams.

✨Be Ready for Technical Questions

Expect to dive deep into your technical expertise during the interview. Prepare for questions about your hands-on experience with GPU clusters, Linux systems, and any relevant programming languages like Python. Being able to articulate your thought process will impress the interviewers.

Senior AI Compute Infrastructure Engineer

Kraken

Apply now

Senior AI Compute Infrastructure Engineer

At a Glance

Senior AI Compute Infrastructure Engineer employer: Kraken

StudySmarter Expert Advice 🤫

✨Tip Number 1

✨Tip Number 2

✨Tip Number 3

✨Tip Number 4

We think you need these skills to ace Senior AI Compute Infrastructure Engineer

Some tips for your application 🫡

How to prepare for a job interview at Kraken

Senior AI Compute Infrastructure Engineer

Land your dream job quicker with Premium