Skip navigation EPAM

Lead Site Reliability Engineer (Azure) - Remote Remote in Canada

  • hot

Lead Site Reliability Engineer (Azure) - Remote Description

Job #: 92518

We’re looking for an expert who has a strategic view and can help to build (with close collaboration with the client) a baseline for the SRE team to identify SLOs/SLAs, form error budgets, define system tolerance, metric baseline, etc. Not only ‘motivate and collaborate’ but do things by hand like writing the docs of strategy, driving workshops with clients, etc.



  • Be responsible for the technical solution by providing leadership for the customer, project manager, domain architects, domain specialists and application engineers to advance and deliver solutions
  • Analyzing, executing, and streamlining DevOps practices
  • Automating processes with the right tools
  • Facilitating development process and operations
  • Setting up a continuous build environment to speed up the software development and deployment process
  • Architecting overall, comprehensive, and efficient practices
  • Guiding developers and operation teams in case of an issue
  • Monitoring, reviewing, and managing technical operations
  • Consult and Inform Architects to design and deliver solutions
  • Assess the merits of alternative technical approaches and gain consensus on the best approach
  • Learn, follow, promote, and improve recognized methodologies to design and deliver solutions
  • Ensure that the non-functional requirements are satisfied including, but not limited to, security, disaster recovery, availability, and performance
  • Mentor IT professionals
  • Be able to work with Jira, Confluence, Bitbucket


  • Solid Linux/Unix systems administration background
  • E-commerce domain
  • Continuous Integration orchestration
  • Continuous Delivery and Continuous Deployment orchestration
  • Infrastructure as Code
  • Public Cloud: Azure Cloud
  • Container orchestration: Kubernetes (GKE), Docker Swarm
  • Docker, Docker Compose
  • Helm Charts
  • Configuration Management - Ansible
  • SCM - source control management
  • GitHub, GitHub Actions, gitflow
  • BuildTools: Ant, Maven, Gradle, Node
  • Java support and troubleshooting, Apache Solr, ZooKeeper, SAP Hybris (e-commerce), Tomcat
  • Artifacts management, Artifactory
  • Sonarqube, quality gates, VeraCode
  • Experience with load balancers / reverse proxies (nginx)
  • Network, Network troubleshooting


  • Extended Healthcare with Prescription Drugs, Dental and Vision Insurance, and Healthcare Spending Account (Company Paid)
  • Maternity/Parental/Adoption Leave Top-up
  • Life and AD&D Insurance (Company Paid)
  • Employee Assistance Program (Company Paid)
  • Unlimited access to LinkedIn learning solutions
  • Long-Term Disability
  • Registered Retirement Savings Plan (RRSP) with company match
  • Paid Time Off
  • Critical Illness Insurance
  • Employee Discounts
  • Employee Stock Purchase Program

About EPAM

  • EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential

Witaj. W czym możemy pomóc?