WHAT IS SRE?
Site Reliability Engineering (SRE) emerged (originated) at Google in 2004 through IT operations teams that kept questioning traditional IT admin practices—and that evolved into SRE.
SRE is now considered a discipline that revolutionized the IT culture. The SRE information and methods have been formalized by Google teams and shared with the IT community for its benefit.
This abstract paper provides a general overview of SRE. It helps in the interaction with our collaborators to understand the reasoning behind SRE practices and the best ways (SRE Way) to deploy and maintain IT systems and scale business processes.
In addition, it is key to understand the differences and similarities between SRE and DevOps (SRE vs. DevOps) and related terms and apply them in your context to improve team collaboration and communication.
SRE TEAMS BUILD AND OPERATE DISTRIBUTED SYSTEMS
SREs are engineers who incorporate software engineering and computer science practices into IT infrastructure and improve the design and operations of distributed systems—managing systems and automating operations.
SREs bring features out and keep systems up and running, managing risks and aiming to reach a high level of system reliability—a critical characteristic of distributed systems. They use service level indicators (SLI) or metrics to set the goals while expecting and accepting incidents (risk tolerance) as a percentage of those metrics.
SRE practices suggest creating a reliability support mindset from the start when building systems to avoid future scalability problems.
IT SERVICE MANAGEMENT WITH SRE TEAMS
The traditional approach to running systems with a system administrator (sysadmin) and developers (Dev) and operations (Ops) teams has conflicts that we may already be familiar with.
The traditional service management model becomes expensive and inefficient when the system grows in size and complexity. In addition, the operations and development teams have different skills, visions, and difficulties synchronizing their work, such as release dates, changes, etc.
SRE teams are composed of software engineers (who write code) and have technical skills (system engineering) for designing and running the operations team (Ops). Having software engineers in operations results in the automation of manual tasks.
SRE teams focus most of their efforts on developing code for automating operations and creating stable and reliable services. This practice reduces their operations workloads to less than 50% of their time and permits devoting more time to creating an automatic system—a system that doesn’t require human intervention.
A system that runs by itself can scale, handle growth, and avoid hiring more people to do repetitive tasks (growing sublinearly). Managers support teams and redirect excess operational work to development teams, exposing them to operations, which improves their skills and creates a productive working environment (supporting goals, reducing tensions, etc.).
SRE teams keep feedback mechanisms in place throughout the agile development process. They create postmortem reports that investigate the causes of events or incidents (track outages) and apply corrective actions to learn from them and avoid repeating the same issue.
Engineering reliability is about ensuring that services or features provided by developers to their users are working well. They are involved in the early stages of system design and continuously question how the system will be tested.
They share responsibilities of code deployment, configuration, and monitoring. Also, they ensure that services in production are available and performing efficiently, managing latency, change management, emergency response, and capacity planning, among others.
SRE teams work in software engineering, coding, writing scripts, infrastructure code, and adding service features. Also, they perform systems engineering tasks such as configuring production systems, load balancers, servers, documentation, monitoring and alerting systems, etc.
The SRE way promotes reliability in development to maintain a high level of agility, eliminating complexity and promoting developers to follow simplicity aspects (patterns), minimal APIs, modularity principles, and small releases.
SREs are responsible for making services work for users while maintaining a balance between the pace of innovation and product stability.
Teams plan for failure and expect the systems to be less than 100% reliable. Using the error budget mechanism helps prevent conflicts between development and operations.
SRE teams establish the availability target by considering the acceptable availability level by users, the alternatives users have, and how they use the product at different availability levels.
The error budget is calculated by subtracting one (1) minus the percentage (%) of availability of the service. For example, a service that is available 99.0% of the time will have 1.0% of unavailability, or, what is the same, an error budget or downtime of 1.0%.
The model creates a neutral incentive for the team to guide decisions about where to spend the error budget and manage release velocity (rate of release). The Service Level Objective (SLO) is based on the acceptable downtime or error budget.
SRE practices balance the risk of unavailability with launching new features and efficient operations. However, reaching high levels of reliability may have associated costs. Therefore, it’s critical to measure the risks of improving the system against the features we provide to users and come up with the right availability target.
SREs work with product managers to identify the appropriate level of risk tolerance and consider the level of availability, effects of failures, and other metrics. Risk tolerance varies by use case, consumer, and infrastructure services.
Service Level Terms
SRE teams define the level of service and measure and evaluate the system behavior. These metrics are selected values according to the use case and presented as percentages, rates, or numbers.
The type of service category is key when deciding which indicators to choose and their user base.
• Service Level Indicators (SLI)—a quantitative measure of the service
⁃ Request latency time to respond to requests
⁃ Error rate
⁃ System throughput
⁃ Availability (yield)
• Service Level Objectives (SLO)—services are managed by the SLO. A target value of the service level is measured by an indicator (SLI) to set performance expectations.
• Service Level Agreements (SLA)—a contract that explains the consequences and/or penalties for missing objectives (SLO).
Toil refers to the manual and repetitive tasks with no enduring value that can be automated (in SRE terminology). The SRE team’s goal is to minimize toil as it scales (linearly) as services grow. Teams manage toil and reduce it to bring back valuable engineering hours.
SRE VS. DEVOPS
Businesses are continuously trying to improve their operations and systems. SRE and DevOps practices have similarities in embracing typical problems and improving IT management.
DevOps practices are strong on organizational culture and practices of collaboration between operations and product development. They also strive to reduce risk by introducing small changes, breaking down silos, speeding the resolution of incidents, measuring outputs, using the same tools, etc.
SRE teams adopt DevOps practices and have a similar organizational model and operational approach. However, SREs put less emphasis on cultural change and implement DevOps principles and practices focused on solving operational problems using software engineering.
SRE is more detailed and precise regarding operations services, with more defined responsibilities and end-user-oriented actions. SRE teams also benefit from the support of agile leadership to contribute to the agile culture and create high performance.
TOP SRE PRINCIPLES
1. Treat operations as software problems and solve them using software engineering.
2. Eliminate or reduce toil (i.e., manual and repetitive work).
3. Share ownership and skill set with product developers (shared ownership model).
4. Promote early discovery of problems to reduce the cost of failure.
5. Automate as much as possible.
6. Teams manage to reach service level objectives (SLO).
7. Use the same tools in operations and product development.