Skip navigation EPAM

Lead Site Reliability Engineer - Remote Remote USA

  • hot

Lead Site Reliability Engineer - Remote Description

Job #: 83573
If you are looking for a high-impact Site Reliability role with a global leader in digital transformation, EPAM is the perfect next step in your career! As an EPAMer, you’ll have the opportunity to work with a supportive team, on a variety of interesting projects for some of the biggest brands in the world. Are you ready for the next step in your career journey? Apply now!



  • Lead development teams through architectural reviews and recommendations
  • Define what it means for a service to be available and develop, monitor, and alert on SLIs/SLOs
  • Define, track, and enforce error budgets
  • Review code instrumentation with development teams and ensure necessary dashboards are created to monitor SLI/SLO/SLAs
  • Establish, test, and tune alerting for varying tiers of applications
  • Participation in on-call rotation
  • Document and maintain runbooks and procedures, automate as much as possible
  • Plan and execute periodic Disaster Recovery exercises including both tabletop and simulated failures (fault injection)
  • Perform periodic load and scalability testing to establish baselines, drift, and capacity planning
  • Design and implement peak readiness reviews for anticipated high-volume times
  • Lead weekly operational state reviews covering performance trends, anomalies, errors and other availability events with SREs, product owners, and development teams
  • Participate in quarterly business and operational reviews aligning on roadmaps, development velocity, efficiency, growth trends, etc
  • Socialize SRE culture across teams within the organization to publicize the value of SRE, mentor and train other engineers around proactive reliability decision making and planning


  • 5+ years of SRE or Systems Engineering experience
  • 2+ years as team lead or SRE champion
  • Bachelor's degree in Computer Science, similar technical field of study, or equivalent practical experience
  • Proven experience troubleshooting, mitigating, and resolving issues in a distributed system
  • Strong communication and collaboration skills for varying groups of stakeholders
  • Be self-motivated and can prioritize effectively between competing priorities
  • Experience with implementing SRE practices for services and applications deployed in production in the cloud
  • Must understand most SRE concepts, including SLI/SLO/SLA, Error Budget, MTTD/MTTR/MTBF, Toil, Capacity Planning, Observability, Monitoring/Alerting, Release Engineering, and Incident Management/Blameless Post-Mortems


  • Medical, Dental and Vision Insurance (Subsidized)
  • Health Savings Account
  • Flexible Spending Accounts (Healthcare, Dependent Care, Commuter)
  • Short-Term and Long-Term Disability (Company Provided)
  • Life and AD&D Insurance (Company Provided)
  • Employee Assistance Program
  • Unlimited access to LinkedIn learning solutions
  • Matched 401(k) Retirement Savings Plan
  • Paid Time Off
  • Legal Plan and Identity Theft Protection
  • Accident Insurance
  • Employee Discounts
  • Pet Insurance
  • Employee Stock Purchase Program

About EPAM

  • EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential


  • This position operates in a remote capacity, but you must live within driving distance to an EPAM office. Your recruiter will discuss specific details about work location during the initial interview process

Witaj. W czym możemy pomóc?