Site Reliability Engineer (Cambridge)
Site Reliability Engineer (Cambridge)

Site Reliability Engineer (Cambridge)

Cambridge Full-Time 48000 - 72000 ยฃ / year (est.) Home office (partial)
Go Premium
A

Find the latest job opportunities in AI and tech.

RunPod offers GPU cloud computing for AI/ML, providing secure and community cloud options, on-demand and spot pods, and serverless GPU scaling.

The flexibility of remote work with an inclusive, collaborative team.

An opportunity to grow with a company that values innovation and user-centric design.

Generous vacation policy to ensure work-life harmony and well-being.

Contribute to a company with a global impact based in the US, Canada, and Europe.

Experience Requirements:

  • 5+ years of experience in Site Reliability Engineering or a similar role
  • 3+ years of experience in a technical leadership or management position
  • Deep understanding of Linux systems, containerization, virtualization, and networking technologies
  • Strong background in managing and monitoring large-scale distributed systems and bare-metal fleets
  • Expertise in infrastructure-as-code and configuration management tools

Responsibilities:

  • Lead and mentor a team of Site Reliability Engineers, fostering a culture of innovation, continuous learning, and technical excellence
  • Develop and implement strategic plans to enhance the reliability, scalability, and efficiency of our infrastructure
  • Collaborate with cross-functional teams to align SRE initiatives with broader organizational goals
  • Establish and maintain SLIs, SLOs, and SLAs for critical systems and services
  • Drive the adoption of best practices in automation, monitoring, and incident response

Software Engineer, Site Reliability Engineer.

Fireworks AI offers a fast and efficient platform for building and deploying generative AI applications with a focus on speed, value, and scalability.

Tyk AI Studio is an AI gateway and management solution that helps organizations harness AI\โ€™s potential while ensuring governance, security, compliance, and control.

Experience Requirements:

  • Proven experience in a senior SRE role or similar.
  • Strong knowledge of cloud technologies and SLA SLO SLI management.
  • Experience leading teams and implementing SCRUM processes.
  • Excellent communication and leadership skills.
  • Experience line managing, mentoring, and coaching.

Responsibilities:

  • Collaborate with the Principal SRE to shape and implement the SRE strategic plan.
  • Lead the SRE team in translating strategy into actionable plans, coordinating these through the SCRUM process.
  • Address wellbeing and performance concerns, fostering a positive and productive team environment.
  • Work with the Principal SRE and Scrum Master to analyze wellbeing survey outcomes and develop improvement plans.

Invisible AI is an on-premise computer vision platform for manufacturing that uses AI to improve worker productivity and safety by analyzing manual assembly work.

Education Requirements:

  • Bachelorโ€™s degree in Computer Science, Information Technology, or a related field, or equivalent experience.

Experience Requirements:

  • 5+ years of experience building and managing infrastructure at scale, particularly on the edge.
  • Proficiency in Python, Docker, Linux systems, and scripting (Bash, Python).
  • Strong expertise with infrastructure automation tools (Terraform, Ansible).
  • Experience managing observability and monitoring systems, particularly Prometheus.
  • Deep understanding of networking concepts and protocols.

Responsibilities:

  • Design, build, and maintain scalable and resilient infrastructure on the edge.
  • Develop automation and infrastructure-as-code solutions using Terraform, Ansible, and scripting languages (Python, Bash).
  • Deploy and manage containerized applications using Docker and related technologies.
  • Ensure system observability by building and optimizing monitoring systems, particularly using Prometheus.
  • Troubleshoot and optimize Linux-based systems (e.g., Red Hat, CentOS, Ubuntu).

xAI\โ€™s Grok is a powerful, multilingual large language model available on X and via API, focused on accelerating scientific discovery.

Experience Requirements:

  • Expert in at least one programming language that compiles to machine code such as Rust, C++, or Go.
  • Expert knowledge of monitoring technologies such as Prometheus, Grafana, and PagerDuty.
  • Expert knowledge of deployment technologies such as Pulumi or Terraform.
  • Expert knowledge of Kubernetes.

Responsibilities:

  • Improving our observability by adding/adjusting metrics.
  • Building easily parsable dashboards.
  • Designing and overseeing our on-call rotations.
  • Improving our deployment process to increase reliability.

Luminance is an AI-powered legal tech platform that streamlines contract lifecycle management with features including AI-powered negotiation and an intelligent contract repository.

Education Requirements:

  • Bachelor\โ€™s or Master\โ€™s degree with a First or 2:1, preferably in a technical subject.

Other Requirements:

  • Excellent problem-solving skills, including diagnosing issues within complex systems.
  • Ability and desire to identify root causes of issues, and propose and implement structural improvements.
  • Strong communication skills and capability to perform in scenarios with urgency.
  • Knowledge of the design and operation of web-based software applications, based on technologies such as node.js, PostgreSQL, or Elasticsearch.
  • Knowledge of modern infrastructure and operational tooling within cloud-based architectures, such as Linux, Python, AWS, Ansible, Prometheus.

Senior Site Reliability Engineer (Remote)

Fathom is a free AI meeting assistant that records, transcribes, and summarizes your meetings, saving you time and improving productivity.

Experience Requirements:

  • 6+ years.

Responsibilities:

  • Scaling existing tools.
  • Enhancing automation for scaling infrastructure.
  • Playing a key role in diversifying and scaling platform.
  • Evaluating options to replace existing real-time data pipeline.
  • Providing platform support to engineering.

AppTek.ai provides AI-powered speech and language solutions including ASR, NMT, NLP/U, LLMs, and TTS, serving diverse industries globally.

Education Requirements:

  • BS in a field related to Computational Linguistics, Computer/Data Science.

Experience Requirements:

  • 2+ years of industry experience (desirable for Site Reliability Engineer role).

Other Requirements:

  • Strong knowledge of Linux.
  • Strong knowledge of AWS.
  • Docker.
  • Scripting languages (Bash, Python).
  • Familiarity with load-testing tools.
  • Must be U.S. citizen capable of obtaining a Secret clearance (for Computational Linguist and Linguist roles).

Responsibilities:

  • On-call first-level response.
  • Respond to customer issue reports.
  • Troubleshoot problems to maintain service SLAs.
  • End-to-end monitoring across infrastructure and services for metrics/alerts/logs.

Linc\โ€™s CX automation platform uses AI to streamline retail customer service, boosting efficiency and delighting customers.

Education Requirements:

  • B.S. in Computer Science or a related field.

Experience Requirements:

  • 1+ years of site reliability engineering experience.

Other Requirements:

  • Familiarity with at least one cloud service provider, preferably AWS.
  • Familiar with basic SQL commands and Intent protocols.
  • Proficient in cloud application orchestration tools like Kubernetes, Helm.
  • Experience with monitoring stacks, preferably Datadog.

Responsibilities:

  • Collaborate with engineering teams to define and maintain services SLA.
  • Monitor metrics, alerts, logs across infrastructure and applications.
  • Create and maintain tools to monitor the platform.
  • Respond to incidents, troubleshoot, investigate root causes.
  • Conduct post-incident investigation and report.

QED.ai provides AI-driven solutions for data scarcity in health and agriculture, offering tools for data digitization, geospatial mapping, and spectroscopy.

Travel to exotic places around the world.

Ask Sage is a versatile, secure Generative AI platform for government and commercial use, offering significant productivity improvements and LLM-agnostic support.

Experience Requirements:

  • 3+ years in site reliability engineering, Kubernetes administration, or related role.
  • Deep expertise of Kubernetes and containers.
  • Strong understanding of cloud infrastructure, automation tools, and best practices for high availability and performance.

Responsibilities:

  • Monitor system performance and reliability.

Hebbia is an enterprise-grade AI platform that empowers knowledge workers by automating complex tasks and providing insights from various data sources. Itโ€™s designed for seamless integration and high security.

Experience Requirements:

  • 4+ years software development experience at a venture-backed startup or top technology firm.
  • Proven experience as a Site Reliability Engineer, DevOps Engineer, or similar role.
  • Strong expertise in managing CI/CD pipelines and deployment automation.
  • Proficiency in cloud platforms such as AWS, Azure, or Google Cloud (we are an AWS shop).
  • Solid understanding of containerization and orchestration technologies such as Docker and Kubernetes.<
A

Contact Detail:

AI Tech Suite Recruiting Team

Site Reliability Engineer (Cambridge)
AI Tech Suite
Location: Cambridge
Go Premium

Land your dream job quicker with Premium

Youโ€™re marked as a top applicant with our partner companies
Individual CV and cover letter feedback including tailoring to specific job roles
Be among the first applications for new jobs with our AI application
1:1 support and career advice from our career coaches
Go Premium

Money-back if you don't land a job in 6-months

A
  • Site Reliability Engineer (Cambridge)

    Cambridge
    Full-Time
    48000 - 72000 ยฃ / year (est.)
  • A

    AI Tech Suite

    50-100
Similar positions in other companies
UKโ€™s top job board for Gen Z
discover-jobs-cta
Discover now
>