[soft electronic music] Barry: Have you ever logged on to a website and come across one of these messages? "Page under construction," or "503 service unavailable," or something similar? The message may be the result of a planned maintenance. In this case, the company wants to release updates to their website, so they need to take their service offline while changes are being implemented. Alternatively, the message on the screen may be the result of an unexpected system failure, and engineers are trying to fix the problem as quickly as possible. Unexpected or prolonged downtime can be irritating for end users and costly for businesses, including from loss of customers. For this reason, IT leaders want to avoid service downtime. But service downtime is unavoidable for IT teams, and it's also a source of two operational challenges. First, developers are expected to continuously improve customer facing services. To do this effectively, they have to schedule system downtimes on a monthly, quarterly, or yearly basis. This allows them to release new updates at a regular cadence. Even though system updates are typically scheduled outside regular business hours, in today's global digital economy, service downtime can still be disruptive for some users. Next, if a service disruption happens unexpectedly, this may be the result of a team structure issue where developers and operators are working in silos. The structure of these teams restricts collaboration and obscures accountability. It doesn't help that these two groups also tend to have competing objectives. Developers are responsible for writing code for systems and applications, and operators are responsible for ensuring that those systems and applications operate reliably. Additionally, developers are expected to be agile and are often pushed to write and deploy code quickly. Their aim is to release new functions frequently, increase core business value with new features, and release fixes fast for an overall better user experience. In contrast, operators are expected to keep system stable, and so they often prefer to work more slowly to ensure reliability and consistency. Traditionally, developers would push their code to operators who often had little understanding of how the code would run in a production or live environment. When a problem does arise, it becomes very difficult for either group to identify the source of the problem and resolve it quickly. Worse, accountability between the teams is not always clear. So for organizations to thrive in the cloud, they need to adapt their IT operations in two ways. First, they'll need to adjust their expectations for service availability from 100% to a lower percentage. Second, they need to adopt best practices from the developer operations or DevOps, and site reliability engineering or SRE so teams can be more agile and work more collaboratively with clearer accountability. I'll examine service availability now and present DevOps and SRE in the next video. You might be thinking, "Why would any business leader "adjust their expectations for service availability? "Wouldn't they want customers to be able to access their online services 100% of the time?" As I covered earlier, 100% availability is misleading. In order to roll out updates, operators have to take a system offline. Ensuring 100% service availability is also incredibly expensive for any business. This means that, at some point, the marginal cost of reliability exceeds the marginal value of reliability. To address this challenge, cloud providers use standard practices to define and measure service availability for customers. This practice includes a service level agreement, service level objectives, and service level indicators. A service level agreement or SLA is a contractual commitment between the cloud service provider and the customer. The SLA provides the baseline level for the quality, availability and reliability of that service. If the baseline service is not met by the provider, end users and end customers would be affected. In this case, the cloud provider would incur a cost usually paid out to the customer. A service level objective or SLO is a key element within the SLA. It's the goal for the cloud service performance level, and it's shared between the cloud provider and a customer if the service performance meets or exceeds the SLO, It means that end users, customers, and internal stakeholders are all happy. If the service performance is below the SLO and above the SLA or baseline performance expectation, it does not directly affect the end user or end customer, but it does give the cloud provider the signal to reduce service outages and increase service reliability instead of pushing out new updates. A service level indicator or SLI is a measure of the service provided. For example, SLIs often include reliability and errors. This brings me to another important term: error budget. An error budget is the amount of error that a service provider can accumulate over a certain period of time before end users start feeling unhappy. You can think of this as the pain tolerance for end users, but apply to a certain dimension of a service such as availability, latency, and so forth. The error budget is typically the space between the SLA and the SLO. This error budget gives developers clarity into how many failed fixes they can attempt without affecting the end user experience. By adjusting service performance expectations, with an SLA, SLO, SLIs, and error budgets, businesses can optimize their cloud environment and create better, more seamless customer experiences. In the next video, I'll explain what DevOps and site reliability are and how organizations can adopt best practices from each to adjust service availability expectations and improve their team's IT operations.