[MUSIC] Our dataset is located conveniently on the Dataset tab. You'll notice here that the dataset is comprised of two columns: SocialMedia_Mentions and WebsiteVisits. Each row of data reflects counts of the number of mentions of our website on social media each day, and the number of visitors to our website on that same day. We have 100 observations, or rows, in our dataset. For convenience of reference we have named each of these two columns of data SocialMedia_Mentions and WebsiteVisits respectively. These named ranges will save us the hassle of having to click back and forth between the sheets when building our forecasting model. Let's get started on Problem 1A. You see that we are asked to calculate the standard deviation of the SocialMedia_Mentions in our data set. This is a common first step when exploring an unfamiliar dataset. To do this, we'll use Excel's standard deviation function, STDEV. As we are working with a sample set of data for our fictitious website, we'll use the sample variant of this function STDEV.S. The STDEV.S formula requires, as parameters, only the data you wish to calculate the standard deviation of. In this case, we wish to compute the standard deviation of SocialMedia_Mentions. We will parametrize STDEV.S with the named range SocialMedia_Mentions. We see now the standard deviation of SocialMedia_Mentions. I've calculated the average, or mean value of the dataset as 753.46. With a standard deviation of 269.63, how spread out are our data around their mean value? If we were to quickly calculate the ratio of these two values, standard deviation to mean, we'd get a value of about 36%. This suggests that our standard deviation is approximately 36% of our mean. Or that our data are somewhat dispersed, or spread out around the mean value. If this ratio was higher, it would suggest even more dispersion in our data. The higher this ratio is, the lower our confidence should be in the conclusions drawn from the dataset using linear regression. We'd like to better understand our dataset, though, in particular, its shape and distribution. To better understand our data, we will also calculate other's descriptive statistics. Let's calculate the minimum and maximum values in our SocialMedia_Mentions data in order to get a sense of the range of the values these data take on. To calculate the minimum value, we'll use Excel's MIN function. The MIN function requires, as parameters, the values of which you'd like to find the minimum. In this case, we'll parametrize MIN with the SocialMedia_Mentions named range, and set it to work. We can find the maximum value in a similar manner using Excel's MAX function. Similar to MIN, MAX requires, as parameters, the values of which you'd like the find the maximum. Let's parametrize MAX with our SocialMedia_Mentions named range, and set it to work. For further convenience of reference, we have created named ranges for each of the values and ranges we calculate as part of this problem. We have named the cells in which we've just calculated the minimum and maximum values, Min_SocialMedia_Mentions and Max_SocialMedia_Mentions, respectively. We will follow a similar intuitive process for creating named ranges throughout these exercises. When you see us type formula going forward, please note that the named ranges referenced have been created to assist our exercise. We suggest you use named ranges also when completing the exercises in this module and when conducting spreadsheet analysis of your own. It would be ideal to visualize the shape of the data to get a sense of how they are dispersed. One way to do this is to plot a histogram. Histograms show counts of data values which fall within equally-sized buckets across the range of values our data take on. This maybe a bit hard to follow just now, but as we work through this exercise, you'll develop a better understanding of histograms. We've also provided references to additional material on histograms, which we advise you to explore. In the meanwhile though, let's prepare to plot a histogram of our data. Our first step is to determine the number of buckets, or more formally bins, which we will use to collect or bucket values from our dataset. As we have only 100 observations in our dataset, a nice number of bins would be something between 10 and 15. Let's choose 12 as our number of bins, and enter this value into the spreadsheet. Next, we need to determine the size, or width of our bins. Again, think of these as buckets. Let's take our maximum value observed in the dataset and divide that by our desired number of bins to ensure we get a reasonable fit to our dataset. Let's use the INT function to round our result down to the nearest integer. You'll notice that our next highlighted range has automatically populated. These are the values which demarcate each of our 12 bins. Let's explore the calculations necessary to produce the values for each bin. You'll notice that the value of our first bin is 114. When we build our histogram, the number of observations in our dataset of SocialMedia_Mentions, which are less than or equal to 114, will be countered for this bin. We will effectively count the number of observations in the 0 to 114 inclusive bucket of our chart. It follows, then, that we want to increment the next bucket or bin boundary by our bin size of 114. This is exactly what we do to calculate the next bin value. You'll see here, that we take our previous bin value and add our bin size to it. Please note that this logic pervades throughout the balance of our bin calculations. You may wonder why, if we're bucketing our data, we don't calculate our last bucket as a value greater than the maximum value of our data set. We'll address this shortly when we calculate the values for our histogram. Now, let's use the histogram feature of the Analysis ToolPak to bin or bucket our SocialMedia_Mentions dataset, using the bins we've just calculated. To access the Histogram tool, click on the Data tab of the ribbon, and select the Data Analysis button in the Analysis grouping. Next, select the Histogram tool from the list box. This tool will guide us through the process of creating the data for our histogram. The first input to the tool is Input Range. This is the data in our SocialMedia_Mentions named range. You can either type the name of our named range into this field, or select the range from the Dataset tab. The next input to the Histogram tool is the range containing the values of our bins. Let's point the tool to the bin's SocialMedia_Mentions range either by typing the name of the named range, or by selecting the range using the range selection button. We've created a placeholder, which is also linked to a lovely chart, in the spreadsheet for Problem Number 1A. Lastly, let's tell the Histogram tool where to output the data for our histogram too. Let's tell the Histogram tool to output its calculations starting in the cell immediately below the row containing the bolded header, Data Analysis ToolPak Output. We encourage you to examine how this simple Excel column chart was created though, independently. We can determine by examining the histogram, that our data tend to be disbursed around the center-most value bin. It seems that there are fewer and fewer observations of data with values significantly different than the center-most value bin as we go to the left or right of it. In fact, this histogram rather resembles the famous bell curve with which you may be acquainted. The histogram is a very useful tool in developing a sense of the shape and distribution of a series of data. We can also see that values between 685 and 798 occur most frequently within this dataset. As the 798 bin is the tallest column when plotted in our histogram. You will recall our standard deviation of about 270. With a bin size of 114, we can see visually that one standard deviation is about two and a half bins to either side of the central bin. A histogram is a useful tool to allow us to quickly determine the shape of our data. Let's continue calculating descriptive statistics of our SocialMedia_Mentions dataset. You'll recall from our introduction to this module, that the variance is another measure of how spread out values are in a dataset from their mean value. Let's calculate the variance of the SocialMedia_Mentions series. Excel's built-in VAR function will calculate variance for us. Again, as we're working with a sample set of data, we'll use the VAR.S() function to calculate sample variance. VAR.S() requires, as parameters, the values you wish to calculate sample variance for. Let's parametrize VAR.S() with our SocialMedia_Mentions named range and set it to work. Notice that the value of the variance is quite large relative to the values in the SocialMedia_Mentions dataset. It is difficult to relate this value intuitively to the values in the dataset. For this reason, standard deviation, which is the square root of variance, is used more frequently as a measure of the dispersion of the dataset. Let's now examine how the two series in our dataset, SocialMedia_Mentions and WebsiteVisits, change or vary together. One measure of this is covariance. Similar to variance, covariance is a bit difficult to interpret, and so we will subsequently calculate the correlation of these two datasets. As we are working with sample data, we will calculate sample covariance. Excel's COVARIANCE.S function will calculate the sample covariance of two arrays. We will parameterize COVARIANCE.S with the named ranges of our two series of data, SocialMedia_Mentions and WebsiteVisits. Let's do so and set COVARIANCE.S to work. As we are working with sample data, we will calculate sample covariance of two arrays. We will parametrize COVARIANCE.S with the named ranges of our two series of data, SocialMedia_Mentions, and Website_Visits. Let's do so, and set COVARIANCE.S to work. Notice here, similar to our variance calculation, that the covariance number is quite large and difficult to intuitively relate to values in the data. It would be ideal to have a normalized expression of how these two variables, SocialMedia_Mentions and WebsiteVisits, change together. Fortunately, we have just that in correlation. You'll recall that correlation is a normalized measure between -1 and 1, which indicates the direction of the relationship between two variables as their values change. A correlation of 1 would imply that for every one unit of increase in SocialMedia_Mentions, we get one unit of increase in WebsiteVisits. Conversely, a correlation of -1 would imply that for every one unit increase in SocialMediaMentions, we get a one unit decrease in WebsiteVisits. A correlation close to 0 would imply that there is little change in WebsiteVisits, as SocialMedia_Mentions change, either positively or negatively. Let's calculate correlation between SocialMedia_Mentions and WebsiteVisits. Excel's CORREL function will calculate this for us. It requires, as parameters, the two arrays for which it will calculate correlation. Let's set it to work on our two variables. Notice that we have calculated the value of 0.79 as the correlation between SocialMedia_Mentions and WebsiteVisits. As this number is positive and close to 1, we can conclude that these two variables are fairly strongly positively correlated. Let's explore one final measure of the strength of the relationship between our two variables. R-Squared, or the coefficient of determination, is a measure of the proportion of variance in one variable which can be predicted by the other variable. It takes values between 0 and 1. An R-Squared of 1, for our two variables, could be interpreted as 100% of variance in WebsiteVisits can be explained by SocialMedia_Mentions. Conversely, an R-Squared value of 0 could be interpreted as 0% of variance in WebsiteVisits can be explained by SocialMedia_Mentions. Let's calculate R-Squared, or the coefficient of determination, for our two variables, SocialMedia_Mentions and WebsiteVisits. Excel's RSQ function will calculate R-Squared for us, and requires as parameters, the array of observations of each of SocialMedia_Mentions and WebsiteVisits. Note that it asks for known Ys first. So let's be sure to enter our WebsiteVisits, the dependant variable, as the first parameter. Let's set it to work. Notice that we calculate a value of 0.62 as our R-Squared, or co-efficient of determination. We can interpret this as 62% of changes in WebsiteVisits can be explained by SocialMedia_Mentions. This is not a particularly strong R-Squared, as we would prefer a value much closer to 1. Now it's your turn to demonstrate your understanding of standard deviation, variance, histograms, covarience, correlation, and R-Squared as you work through Problem 1B. [MUSIC]