Site Reliability Engineer

Site Reliability Engineer

Full-Time 60000 - 80000 £ / year (est.) No working from home possible
CMG (Capital Markets Gateway)

At a Glance

  • Tasks: Design and maintain monitoring solutions to ensure system reliability and performance.
  • Company: Join a fintech innovator transforming global equity capital markets.
  • Benefits: Unlimited PTO, equity, comprehensive benefits, and continuous learning opportunities.
  • Other info: Collaborative culture focused on innovation and diversity.
  • Why this job: Make a real impact in a fast-paced environment with cutting-edge technologies.
  • Qualifications: Experience as an SRE, strong programming skills, and cloud platform knowledge.

The predicted salary is between 60000 - 80000 £ per year.

About the Company

Capital Markets Gateway LLC (CMG) is a capital markets‑focused fintech transforming global equity capital markets (ECM) through data, technology, and connectivity. As the preferred source for ECM analytics and the first network connecting the buy‑side and sell‑side for ECM workflows, CMG is committed to reshaping how capital markets operate. Founded in 2017, CMG has completed three successful fundraising rounds and is backed by prestigious financial institutions. The CMG platform is currently relied upon by nearly 150 buy‑side firms representing $40 trillion in AUM and 22 global investment banks.

The Role

CMG is looking for a Site Reliability Engineer (SRE) with a strong focus on monitoring, observability, and alerting to ensure the reliability, performance, and scalability of our infrastructure and applications. You will design, implement, and maintain monitoring solutions to provide visibility into system health and performance, proactively detect anomalies, and reduce incident response time.

Engineering Team

The CMG engineering team consists of domain experts who work collaboratively within a culture of cross‑domain knowledge sharing. Engineers are encouraged to challenge the status quo, seek improvement, and explore solutions with bleeding‑edge technologies such as AI. The team values research, prototyping, and best practices from code review to production rollouts, including pull requests, test automation, code coverage, containerization, and one‑click deployments.

Responsibilities

  • Monitoring & Observability
    • Design, implement, and maintain monitoring and observability solutions using Prometheus, Grafana Stack (Loki/Grafana/Tempo/Alert‑Manager), Datadog, and OpenTelemetry.
    • Define and implement SLOs, SLIs, and error budgets to measure system reliability.
    • Develop and optimize dashboards, alerts, and reports for system performance and business metrics.
  • Alerting & Incident Management
    • Design actionable alerting strategies to minimize noise and improve MTTR.
    • Integrate alerting systems with Jira.
    • Establish and refine runbooks for on‑call teams to handle alerts efficiently.
    • Empower teams to ensure observability coverage and incident response practices.
  • Performance Optimization
    • Analyze system performance metrics, identify bottlenecks, and implement optimizations to improve system efficiency, scalability, and cost‑effectiveness.
    • Help conduct load testing and capacity planning to ensure systems can handle peak traffic loads.
  • Automation and Tooling
    • Identify opportunities for automation and develop tools to streamline operational processes, such as fail‑over, configuration management, and monitoring.
    • Implement monitoring and alerting systems within automations to detect and resolve issues proactively.
  • Collaboration and Communication
    • Collaborate closely with cross‑functional teams, including software engineers, operations, and infrastructure teams, to understand system requirements, provide technical guidance, and drive solutions.
    • Communicate effectively to stakeholders about system changes, incidents, and improvements.
    • Promote and spread SRE principles and practices across the company.

Qualifications

  • Must be based in Latin America.
  • English level - C1 or C2.
  • Proven experience as a Site Reliability Engineer or similar role.
  • Proficiency in logging, metrics, and tracing frameworks (DataDog, Loki, Prometheus, OpenTelemetry).
  • Experience with cloud platforms (Azure preferred) and infrastructure‑as‑code tools (e.g., Terraform).
  • Strong programming and scripting skills (Python, Bash).
  • Proficiency in containerization technologies and orchestration tools (Docker, Kubernetes).
  • Understanding of Linux‑based systems, networking, and security principles related to containerized applications.
  • Strong problem‑solving and troubleshooting skills, with a passion for identifying and resolving complex technical issues.
  • Excellent communication and collaboration abilities.
  • Ability to thrive in a fast‑paced, constantly evolving environment.
  • Experience with PostgreSQL monitoring and optimization (optional / nice to have).

Tech Stack

  • Azure as an infrastructure provider.
  • Docker + Kubernetes for microservice orchestration using Istio service mesh.
  • PostgreSQL for relational DB, ElasticSearch for indexing, Redis for caching.
  • DataDog, Grafana, and OpenTelemetry for observability.
  • GitHub for version control and CI (with our own runners).
  • CD: Harness and FluxCD.
  • Terraform and Terragrunt as IaC.
  • Python and Bash for scripting infrastructure.
  • React – we maintain multiple single‑page React apps.
  • TypeScript – 99% of our codebase is TypeScript.
  • Latest .NET version for our backend services.
  • GraphQL – our standard for API communication.

Values

  • We innovate with purpose.
  • We focus on outcomes vs. output.
  • We believe diverse and inclusive teams fuel innovation.
  • We are humble yet candid.
  • We do right by the customer.

What We Offer

  • Equity.
  • Unlimited PTO (28 days including bank holidays + unlimited additional paid leave).
  • Comprehensive benefits program managed by Globalization Partners.
  • Premium life and income protection.
  • Top private medical and dental insurance.
  • Employee Assistance Program (EAP).
  • Pension contributions.
  • Hybrid work environment (initially remote until office setup is complete).
  • Education reimbursement.
  • Continuous learning opportunities.
  • Employee referral bonus.
  • Parental leave.

CMG embraces our ongoing commitment to building a culture reflecting the people, perspectives, and passions it represents. We will accept nothing less than equity, inclusion, and belonging for all. With the only constant in life being change, we will always listen, learn, and improve for the betterment of our teams, customers, and communities. CMG is proud to be an Equal Opportunity Employer.

Site Reliability Engineer employer: CMG (Capital Markets Gateway)

Capital Markets Gateway LLC (CMG) is an exceptional employer that fosters a collaborative and innovative work culture, particularly for Site Reliability Engineers. With a strong commitment to employee growth through continuous learning opportunities, unlimited PTO, and comprehensive benefits, CMG ensures that its team members thrive in a dynamic environment while contributing to the transformation of global equity capital markets. Located in Latin America, CMG offers a unique chance to work with cutting-edge technologies and be part of a diverse team dedicated to reshaping the future of finance.

CMG (Capital Markets Gateway)

Contact Details:

CMG (Capital Markets Gateway) Recruitment Team

StudySmarter Expert Advice🤫

We think this is how you could land Site Reliability Engineer

Tip Number 1

Network like a pro! Reach out to current or former employees at CMG on LinkedIn. A friendly chat can give you insider info and maybe even a referral, which can really boost your chances.

Tip Number 2

Show off your skills in action! If you’ve got a GitHub or personal project that showcases your SRE skills, make sure to highlight it during interviews. It’s a great way to demonstrate your expertise beyond just words.

Tip Number 3

Prepare for technical interviews by brushing up on your problem-solving skills. Practice common SRE scenarios and be ready to discuss how you’d handle system outages or performance issues. We want to see your thought process!

Tip Number 4

Don’t forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, it shows you’re genuinely interested in joining the CMG team.

We think you need these skills to ace Site Reliability Engineer

Monitoring and Observability
Prometheus
Grafana Stack (Loki/Grafana/Tempo/Alert-Manager)
Datadog
OpenTelemetry
SLOs, SLIs, and error budgets
Incident Management

Some tips for your application 🫡

Tailor Your Application:Make sure to customise your CV and cover letter for the Site Reliability Engineer role. Highlight your experience with monitoring tools like Prometheus and Datadog, and show us how you can contribute to our mission of transforming capital markets.

Show Off Your Skills:Don’t hold back on showcasing your technical skills! We want to see your proficiency in Python, Bash, and containerisation technologies. Include specific examples of how you've used these skills in past roles to solve problems or improve systems.

Be Clear and Concise:When writing your application, keep it clear and to the point. Use bullet points where possible to make it easy for us to read through your qualifications and experiences. We appreciate a well-structured application!

Apply Through Our Website:We encourage you to apply directly through our website. It’s the best way for us to receive your application and ensures you’re considered for the role. Plus, it shows us you’re keen on joining our team!

How to prepare for a job interview at CMG (Capital Markets Gateway)

Know Your Tech Stack

Familiarise yourself with the specific technologies mentioned in the job description, like Prometheus, Grafana, and Azure. Be ready to discuss your experience with these tools and how you've used them to solve real-world problems.

Showcase Your Problem-Solving Skills

Prepare examples of complex technical issues you've encountered and how you resolved them. Highlight your troubleshooting process and any optimisations you've implemented to improve system performance.

Understand SRE Principles

Brush up on Site Reliability Engineering principles, especially around monitoring, observability, and incident management. Be prepared to discuss how you would design actionable alerting strategies and define SLOs and SLIs.

Communicate Effectively

Practice articulating your thoughts clearly and concisely. Since collaboration is key in this role, demonstrate your ability to communicate technical concepts to non-technical stakeholders during the interview.