[MUSIC] Welcome to the introduction to statistical forecasting module. During this section of the course, we will explore fundamental statistical methods which are useful in using data to develop forward expectations or forecasts. Course participants are assumed to have had some previous exposure to statistics. Though we will provide reference materials to concepts presented in this module. We will explain the statistical concepts employed in the following analyses, that encourage participants to further their study independently. We will spend a bit longer introducing concepts in this module than we have in the others. Given the statistical nature of these concepts, it's important that participants spend time to understand the statistical methods employed in this module, as they are powerful, but nuanced. In order to use data to produce statistical forecasts, we need to understand regression analysis. Regression analysis is one of the most commonly used statistical methods to produce data driven forecasts. Simple linear regression analysis uses one variable, the independent variable to explain another variable, the dependent variable. For example, you might use a person's height to explain and predict a person's weight. A thorough exploration of linear regression is beyond the scope of this module, but we encourage course participants to study this powerful statistical method, or to take one of the many related courses on Coursera. There are a few concepts that you should understand, at least at a high level before we proceed to their application in our Excel problem sets. We encourage you to spend time studying these concepts independently if you are unfamiliar. Standard deviation is a measurement of the average dispersion of values in a data set around their average value. That is how spread out the data are from their average value. This is related to variance but it is more frequently used to describe the average dispersion of a data set. The implication follows that a variance in the highest standard deviation should imply lower confidence in the outputs of a statistical forecast using the data. Standard deviation is the square root of variance and as such, relates more directly to the values in the dataset. Variance, as previously mentioned, also measures how far on average a set of data values are spread out form their average or mean. Higher variance in your data should result in you being less confident and the accuracy of your prediction because your data is so widely spread out around their average value. Again, variance has the same implication as standard deviation, it is simply squared to amplify dispersion from the main value. Covariance is a measure of how two variables change together. Covariance is not normalized, meaning that there's no meaningful way to compare covariances across different variables. We need another measurement to compare the way two variables change together in order to draw meaningful conclusions. Correlation provides us this normalized measure of covariance, that is, how two variables change together. Normalization of covariance results in a measure which we can use to meaningfully compare how two variables move together. This is a normalized measure, and so it results in a value between -1 and 1 and gives an objective indication of the relationship between two variables. It also tells us the direction of that relationship, positive or negative. Values close to zero indicate that the relationship is not very strong. R-squared, or the coefficient of determination, is a number that indicates the proportion of the variance in one variable that is predictable from the other variable. A higher R squared value indicates a better fit of our statistical measurement of the relationship between the variables to the data itself. The definition above is important to note, it has a similar interpretation as correlation though since it is squared the direction of the relationship cannot be determined. Let's have a high level overview now of linear regression. Using linear regression we can quantify the relationship between changes in the independent, or input, variable, and changes in the dependent, or outcome variable. For example, let's look at the relationship in the variables Y, X, m and B as shown on the slide below. We see here Y=mX+B this relationship could be read as, Y is equal to m multiplied by X and added to B. You may recognize this as the classic slope intercept form of an equation for a straight line, as we are all taught in algebra. Simple linear regression analysis of a dataset may result in a similar quantified relationship for a dataset. For example, our regression analysis may result that m=3 and B=100. Using this quantified relationship, we can input values of X to predict values of Y, which we don't have in our dataset. For example, if we input a value of 100 for X, we can use this quantified relationship to produce a value of 400 for Y. We will explore regression analysis in a simplified example. First though, we must develop a thesis, or hypothesis, for our forecasting relationship. A few additional statistical forecasting concepts that are important to understand include the Y-intercept, which is the point where the graph of a function, or in this case, the graph of our relationship between our two variables, intersects the Y-axis. This is the value which cannot be explained by our regression analysis and is constant despite our measured relationship between two variables. This also represents the value of our dependent variable when your independent variable is equal to zero. The slope of a function is a number that describes both the direction and the steepness of the line graph of that function. This tells us how much we expect our dependent variable to change with every one unit change in our independent variable. We show two examples of two data sets with different standard deviation variants, correlation, and R squared values. Notice that higher standard deviation and variance results in a much more dispersed or spread out data set around the mean value. While lower standard deviation and variance result in a more tightly dispersed data set around the mean. Let's discuss our specific example. It will be simple and explore the use of regression analysis to predict visits to a website based on the number of social media mentions of the site. Our thesis here is that it's reasonable to think that as mentions of a website on social media increase, the number of people who visit the site will also increase, and result in increased web traffic or page hits to the site. Starting with a thesis like this is fundamental to regression analysis. This statistical method can be used to determine if there is a correlative relationship between two variables. In our case, the relationship between the number of social media mentions, the independent variable, and visits to our website, the dependent variable. We have measures of the strength of this relationship which we will discuss later. If these measures indicate a strong relationship we can conclude that social media mentions and visits to our website are related. A strong relationship determined via linear regression analysis does not, however imply a causal relationship between the two variables. Said differently, it does not imply that traffic increased to our website because of social media mentions, but, rather, that increased web traffic to our website tends to increase with increases in social media mentions. This is an important distinction, though causal analysis is beyond the scope of this module. In order to complete the exercises in this week's problem set you'll need to enable the Analysis ToolPak in Excel. We've included a link to instructions on how to do this in the reference materials. Please take a moment to ensure that you have the ToolPak enabled before continuing. You'll know that you've successfully enabled the ToolPak if you see it on the Data tab of the ribbon, as in our image below. [MUSIC]