At a Glance
- Tasks: Join us as a Site Reliability Engineer to automate and enhance system reliability.
- Company: Be part of a dynamic team in London focused on cutting-edge technology solutions.
- Benefits: Enjoy a collaborative work environment with opportunities for professional growth.
- Why this job: Make a real impact by reducing manual toil and improving system performance.
- Qualifications: 5-9 years of experience in SRE, automation tools, and cloud technologies required.
- Other info: This is a full onsite position, perfect for tech enthusiasts ready to innovate.
The predicted salary is between 48000 - 72000 £ per year.
SRE Expert (Full Onsite)
Location- London
Responsible to perform end to end Self-Healing automation solution to reduce manual effort/TOIL.
Technical Skill –Ansible, Terraform, Python, DevOps, SRE, Dockers, AWS (Atlas), ECS Based internal tooling. Shell Script, Linux, Monitoring tools – Datadog, Splunk, Dynatrace, Grafana,
Thousand Eyes, Gremlin etc.
- 5 to 9 years of experience with Automation principals and tools (Ansible etc.). should have worked with Toil identification and quality of life automation.
- Advanced working experience with two or more of the following: Unix/Linux, Windows Server, Oracle, MSSQL, MongoDB.
- Experience with Python, Java, Curl scripting or any other types of scripting.
- Experience with JIRA, Confluence, BitBucket, GitHUB, Jenkins, Jules, Terraform.
- Experience with two or more of the following observability tools: AppDynamics, Geneos, Dyanatrace, ECS Based internal tooling, Datadog, Cloud watch, Big Panda, Elastic Search (ELK), Google Cloud Logging, Grafana, Prometheus, Splunk, Thousand Eyes etc..
- Experience in creating Dashboard for Infra / APM / E2E workflows.
- Monitoring, logging, Alerting and Error budget (99.9, 99.99, 99.999 %) for software, Operations & Business.
- Define SLO, SLI, SLA with business/ operations / Engineering team
- Experience with logging, monitoring, and event detection on Cloud or Distributed platforms.
- Experience creating and modifying technical documentation such as environment flow, functional requirements, nonfunctional requirements.
- Effective production management – Incident & change Management, Production control, ITSM, Service Now, problem solving and analytical skills with ability to turn findings into strategic imperatives.
- Technical operations application support and stability, realiability and resiliency experience.
- Minimum 4-6 years of hands-on experience into SRE implementation of monitoring system- Dashboards development for application reliability using Splunk, Dynatrace, Grafana, App Dynamics, Datadog, Big panda.
- Experience working on Configuration as Code, Infrastructure as code, AWS(Altas)
- Provides technical direction regarding monitoring and logging to less experienced staff or develops highly complex original solutions. Acts as an Expert technical resource for modeling, simulation and analysis efforts.
- Overall, we are looking for an Automation Engineer, who could reduce the toil issues and enhance the system towards reliability and scalability.
Nature of the Job:
1. Collaborate with Production support team, identify the existing manual activities, and automate.
2. Identify toil area where it can be automated to avoid manual intervention
3. Build Monitoring system and observability platform for more Stack traces and alerts and Dashboards.
4. Ability to define SLA, SLO and SLI and implement the same for better monitoring
5. Scalability, reliability, and observability are the primary goals for reduction of MTTD and MTTR.
Site Reliability Engineer employer: Mphasis
Contact Detail:
Mphasis Recruiting Team
StudySmarter Expert Advice 🤫
We think this is how you could land Site Reliability Engineer
✨Tip Number 1
Make sure to showcase your hands-on experience with automation tools like Ansible and Terraform. Highlight specific projects where you've successfully reduced manual effort through automation, as this aligns perfectly with our focus on self-healing solutions.
✨Tip Number 2
Familiarize yourself with the observability tools mentioned in the job description, such as Datadog and Grafana. Being able to discuss how you've used these tools to create dashboards or improve monitoring will set you apart during discussions.
✨Tip Number 3
Prepare to discuss your experience with defining SLAs, SLOs, and SLIs. We value candidates who can articulate how they have implemented these metrics in previous roles to enhance system reliability and performance.
✨Tip Number 4
Be ready to share examples of how you've collaborated with production support teams to identify and automate toil areas. This collaborative mindset is crucial for the role and demonstrates your ability to work effectively within a team.
We think you need these skills to ace Site Reliability Engineer
Some tips for your application 🫡
Tailor Your CV: Make sure your CV highlights your experience with automation tools like Ansible and Terraform, as well as your proficiency in Python and other scripting languages. Emphasize your hands-on experience with monitoring tools such as Datadog and Splunk.
Craft a Strong Cover Letter: In your cover letter, explain how your background aligns with the responsibilities of the Site Reliability Engineer role. Discuss specific projects where you reduced manual effort through automation and improved system reliability.
Showcase Relevant Experience: When detailing your work history, focus on your experience with incident management, production control, and your ability to define SLAs, SLOs, and SLIs. Use metrics to demonstrate your impact on system reliability and scalability.
Highlight Collaboration Skills: Since the role involves collaboration with production support teams, mention any relevant teamwork experiences. Describe how you identified manual activities and successfully automated them, showcasing your problem-solving skills.
How to prepare for a job interview at Mphasis
✨Showcase Your Automation Skills
Be prepared to discuss your experience with automation tools like Ansible and Terraform. Highlight specific projects where you successfully reduced manual effort and improved system reliability.
✨Demonstrate Your Monitoring Expertise
Familiarize yourself with the monitoring tools mentioned in the job description, such as Datadog and Grafana. Be ready to explain how you've used these tools to create dashboards and improve observability.
✨Discuss Your Experience with SRE Principles
Talk about your understanding of SRE principles, including defining SLAs, SLOs, and SLIs. Provide examples of how you've implemented these concepts in previous roles to enhance system performance.
✨Prepare for Technical Questions
Expect technical questions related to scripting languages like Python and shell scripting. Brush up on your knowledge of Unix/Linux systems and be ready to solve problems on the spot.