
SRE's holy trinity decoded: SLIs, SLOs, SLAs let you tolerate just enough fuckups to deploy without lawsuits or refunds
In the realm of site reliability engineering (SRE), a trio of key metrics forms the foundation of measuring reliability: Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs). SLIs provide real-time insights into a service's behavior from the user's perspective, focusing on aspects such as latency, error rates, and availability. SLOs represent the acceptable threshold of these metrics, acknowledging that perfection is both unattainable and unnecessarily expensive. For instance, an SLO might aim for 99.9% of requests to complete within 500ms. SLAs, on the other hand, are legally binding agreements with users, outlining refund or penalty conditions in case of violations, such as a refund if checkout API availability falls below 99.5%. Each SLO comes with an error budget, which serves as a decision-making tool to balance deployment velocity with reliability, allowing for calculated risk-taking. By understanding and leveraging these metrics, SREs can ensure manageable reliability, prioritize user experience, and make informed deployment decisions, ultimately protecting against potential legal and financial repercussions. This approach emphasizes measuring what matters to users and accepting a degree of failure as part of the process.