SRE Teams for Running Operations and Systems Effectively

by Jose Luis AmorosMay 23, 2023Cloud

Table of Content

What is SRE?
SRE Teams Build and Operate Distributed Systems
IT Service Management with SRE Teams
SRE vs. DevOps
Top SRE Principles

What is SRE?

Site Reliability Engineering (SRE) emerged (originated) at Google in 2004 through IT operations teams that kept questioning traditional IT admin practices—and that evolved into SRE.

SRE is now considered a discipline that revolutionized the IT culture. The SRE information and methods have been formalized by Google teams and shared with the IT community for its benefit.

This abstract paper provides a general overview of SRE. It helps in the interaction with our collaborators to understand the reasoning behind SRE practices and the best ways (SRE Way) to deploy and maintain IT systems and scale business processes.

In addition, it is key to understand the differences and similarities between SRE and DevOps (SRE vs. DevOps) and related terms and apply them in your context to improve team collaboration and communication.

SRE Teams Build and Operate Distributed Systems

SREs are engineers who incorporate software engineering and computer science practices into IT infrastructure and improve the design and operations of distributed systems—managing systems and automating operations.

SREs bring features out and keep systems up and running, managing risks and aiming to reach a high level of system reliability—a critical characteristic of distributed systems. They use service level indicators (SLI) or metrics to set the goals while expecting and accepting incidents (risk tolerance) as a percentage of those metrics.

SRE practices suggest creating a reliability support mindset from the start when building systems to avoid future scalability problems.

IT Service Management with SRE Teams

The traditional approach to running systems with a system administrator (sysadmin) and developers (Dev) and operations (Ops) teams has conflicts that we may already be familiar with.

The traditional service management model becomes expensive and inefficient when the system grows in size and complexity. In addition, the operations and development teams have different skills, visions, and difficulties synchronizing their work, such as release dates, changes, etc.

SRE teams are composed of software engineers (who write code) and have technical skills (system engineering) for designing and running the operations team (Ops). Having software engineers in operations results in the automation of manual tasks.

SRE teams focus most of their efforts on developing code for automating operations and creating stable and reliable services. This practice reduces their operations workloads to less than 50% of their time and permits devoting more time to creating an automatic system—a system that doesn’t require human intervention.

A system that runs by itself can scale, handle growth, and avoid hiring more people to do repetitive tasks (growing sublinearly). Managers support teams and redirect excess operational work to development teams, exposing them to operations, which improves their skills and creates a productive working environment (supporting goals, reducing tensions, etc.).

SRE teams keep feedback mechanisms in place throughout the agile development process. They create postmortem reports that investigate the causes of events or incidents (track outages) and apply corrective actions to learn from them and avoid repeating the same issue.

Engineering reliability is about ensuring that services or features provided by developers to their users are working well. They are involved in the early stages of system design and continuously question how the system will be tested.

They share responsibilities of code deployment, configuration, and monitoring. Also, they ensure that services in production are available and performing efficiently, managing latency, change management, emergency response, and capacity planning, among others.

SRE teams work in software engineering, coding, writing scripts, infrastructure code, and adding service features. Also, they perform systems engineering tasks such as configuring production systems, load balancers, servers, documentation, monitoring and alerting systems, etc.

The SRE way promotes reliability in development to maintain a high level of agility, eliminating complexity and promoting developers to follow simplicity aspects (patterns), minimal APIs, modularity principles, and small releases.

Error Budget

SREs are responsible for making services work for users while maintaining a balance between the pace of innovation and product stability.

Teams plan for failure and expect the systems to be less than 100% reliable. Using the error budget mechanism helps prevent conflicts between development and operations.

SRE teams establish the availability target by considering the acceptable availability level by users, the alternatives users have, and how they use the product at different availability levels.

The error budget is calculated by subtracting one (1) minus the percentage (%) of availability of the service. For example, a service that is available 99.0% of the time will have 1.0% of unavailability, or, what is the same, an error budget or downtime of 1.0%.

The model creates a neutral incentive for the team to guide decisions about where to spend the error budget and manage release velocity (rate of release). The Service Level Objective (SLO) is based on the acceptable downtime or error budget.

Embrace Risk

SRE practices balance the risk of unavailability with launching new features and efficient operations. However, reaching high levels of reliability may have associated costs. Therefore, it’s critical to measure the risks of improving the system against the features we provide to users and come up with the right availability target.

SREs work with product managers to identify the appropriate level of risk tolerance and consider the level of availability, effects of failures, and other metrics. Risk tolerance varies by use case, consumer, and infrastructure services.

Service Level Terms

SRE teams define the level of service and measure and evaluate the system behavior. These metrics are selected values according to the use case and presented as percentages, rates, or numbers.

The type of service category is key when deciding which indicators to choose and their user base.

• Service Level Indicators (SLI)—a quantitative measure of the service

⁃ Request latency time to respond to requests

⁃ Error rate

⁃ System throughput

⁃ Availability (yield)

⁃ Durability

• Service Level Objectives (SLO)—services are managed by the SLO. A target value of the service level is measured by an indicator (SLI) to set performance expectations.

• Service Level Agreements (SLA)—a contract that explains the consequences and/or penalties for missing objectives (SLO).

Reducing Toil

Toil refers to the manual and repetitive tasks with no enduring value that can be automated (in SRE terminology). The SRE team’s goal is to minimize toil as it scales (linearly) as services grow. Teams manage toil and reduce it to bring back valuable engineering hours.

SRE Vs. DevOps

Businesses are continuously trying to improve their operations and systems. SRE and DevOps practices have similarities in embracing typical problems and improving IT management.

DevOps practices are strong on organizational culture and practices of collaboration between operations and product development. They also strive to reduce risk by introducing small changes, breaking down silos, speeding the resolution of incidents, measuring outputs, using the same tools, etc.

SRE teams adopt DevOps practices and have a similar organizational model and operational approach. However, SREs put less emphasis on cultural change and implement DevOps principles and practices focused on solving operational problems using software engineering.

SRE is more detailed and precise regarding operations services, with more defined responsibilities and end-user-oriented actions. SRE teams also benefit from the support of agile leadership to contribute to the agile culture and create high performance.

Top SRE Principles

1. Treat operations as software problems and solve them using software engineering.

2. Eliminate or reduce toil (i.e., manual and repetitive work).

3. Share ownership and skill set with product developers (shared ownership model).

4. Promote early discovery of problems to reduce the cost of failure.

5. Automate as much as possible.

6. Teams manage to reach service level objectives (SLO).

7. Use the same tools in operations and product development.

About Us

Krasamo is a cloud solutions provider helping businesses design, migrate, and optimize cloud architectures that are secure, scalable, and cost-efficient.

Learn More

17 Comments

Arne Karlsson on September 9, 2025 at 11:04 am
I just wanted to add some extra info about durability in SRE teams! It’s super crucial for cloud infrastructure design, and cloud consulting services can also help with that! By implementing autoscaling and load balancing, you can ensure your app remains available even during traffic spikes. Thanks for the insightful post!
Reply
Arben Frei on October 2, 2025 at 5:48 pm
What a thought-provoking article! I completely agree with the importance of service level objectives (SLOs) in ensuring our systems run smoothly. Have you considered integrating cloud advisory services to help monitor and optimize SLOs? It would be fascinating to hear about your team’s experiences with reducing toil and its impact on overall efficiency.
Reply
Chantelle Cafferky on October 15, 2025 at 4:29 pm
I’d love to hear more about implementing SRE teams with AWS consultancy expertise.
Reply
- Ādams Cīrulis on October 22, 2025 at 4:20 pm
  I’ve read through the article about SRE teams, but I don’t see how implementing them with AWS consultancy expertise is a relevant topic. The post already provides an in-depth look at the benefits and practices of SRE teams. If you’re looking for cloud advisory services on how to implement SRE, perhaps consider hiring a consultant who can provide tailored advice rather than relying on a generic blog post?
  Reply
- Marcella Lemmens on February 4, 2026 at 3:31 pm
  I completely agree that implementing SRE teams with AWS consultancy expertise can be a game-changer for organizations. By adopting this approach, teams can leverage automation, streamline operations, and improve overall system reliability. I’d love to see more discussion on how to integrate SRE best practices into DevOps workflows, particularly when it comes to leveraging cloud-based services like AWS.
  Reply
- Victor Khurana on February 5, 2026 at 3:50 pm
  I’d love to see a follow-up post on implementing SRE teams with AWS consultancy expertise! The concept of embedding software engineering practices into IT infrastructure is fascinating, and I’m sure many readers would appreciate guidance on how to apply this in the cloud. Perhaps you could explore the role of cloud consulting services in supporting SRE teams?
  Reply
Elif Malakova on November 24, 2025 at 3:43 pm
I’m not impressed by this list, all these SRE principles are just common sense. Anybody who’s worked with cloud services like AWS knows that automation is key. SRE teams should be using tooling like Terraform and CloudFormation to automate ops, duh!
Reply
John Linton on December 16, 2025 at 12:00 pm
I must say that I appreciate the effort to formalize SRE practices, however, I find the blog post to be somewhat superficial in its explanation of the discipline’s intricacies, especially for those familiar with DevOps and sre teams.
Reply
Pedro Mccarthy on February 6, 2026 at 5:56 pm
I thoroughly enjoyed this blog post on the evolution of SRE and its impact on IT culture! As a software engineer, I’ve had the opportunity to work with several teams that have adopted SRE practices. It’s essential for organizations to leverage cloud advisory services to ensure seamless integration of SRE principles into their infrastructure. This enables them to achieve higher system reliability and scalability.
Reply
Deane Lahy on February 25, 2026 at 10:55 am
Honestly, I’m so down for this conversation about SRE vs DevOps! 🤓 As a data scientist, I’ve worked with both practices and can attest that they’re not mutually exclusive. In fact, many companies leverage cloud advisory services to integrate the best of both worlds. By adopting SRE principles, teams can improve operational efficiency while still benefiting from the cultural shifts promoted by DevOps. Great food for thought! 💡
Reply
Jimena Burgos on March 5, 2026 at 1:51 pm
Love this post! I completely agree that treating ops as a software problem is crucial for efficient system management. Additionally, leveraging DevOps practices can help SRE teams streamline their workflow and improve collaboration with product dev teams. By automating repetitive tasks and utilizing the same tools in both ops and development, we can truly elevate our service level objectives (SLO). Thanks for sharing!
Reply
Ambar Perez on March 13, 2026 at 9:50 am
Great post! 🤩 I’d like to add that our site reliability engineering team works closely with DevOps and cloud teams to ensure seamless integration of monitoring, logging, and alerting tools. This allows for a more comprehensive understanding of system behavior and enables data-driven decisions on risk tolerance and availability targets.
Thanks for sharing this insightful content! 😊
Reply
Reet Kalmus on March 17, 2026 at 1:17 pm
I couldn’t agree more with the importance of SRE teams in ensuring system reliability and scalability! I’d like to add that cloud advisory services can also play a crucial role in helping organizations navigate the complexities of cloud migration and deployment. By leveraging cloud expertise, companies can focus on innovation while leaving the operational heavy-lifting to experienced professionals. Thanks for sharing your insights!
Reply
Grant Sanders on April 15, 2026 at 11:51 am
I feel u, traditional IT ops is a real pain in the neck! I’ve had my fair share of dealing with ticket storms and outages. SRE teams are def the way to go! We just started implementing AWS consultancy best practices for our client and it’s already saved them so much time and headache. With automation and code-driven operations, you can scale your team and reduce opex. Great post, keep it up!
Reply
Alea Gasser on April 21, 2026 at 2:52 pm
I completely agree with your SRE principles, especially the shared ownership model where ops engineers work closely with product developers to tackle complex issues together. It is also important to note that having a good cloud consulting services provider can help teams implement these principles and achieve their desired service level objectives (SLO).
Reply
Saskia Deáková on April 23, 2026 at 3:47 pm
I’m totally stoked about this post! I’ve been trying to implement DevOps principles in my own projects, but I didn’t realize how similar they are to SRE practices. One thing that’s super important for companies is cloud advisory services – having the right experts on board can make all the difference in scaling and optimizing operations. Great job breaking down these complex concepts! Keep it up!
Reply
Ndumiso Gatsheni on May 13, 2026 at 2:15 pm
Hey there! I really enjoyed this post on SRE teams and their goals, especially the part about reducing toil. I’m curious though – how do you envision cloud consulting services playing a role in implementing these concepts? Can you elaborate more on how companies can leverage SRE principles to improve their ops and systems? Thanks!
Reply