[SOUND] In the last module we looked at comparing a single mean or proportion against a fixed value. In this module we'll extend this by comparing two means or two proportions against one another. In most settings, we often wonder if doing one thing differently will change an outcome versus not doing anything at all. Some examples are, will you live longer if you exercise versus not exercising at all? Or will customers use their credit cards more if you gave them cashback or points for free travel as a reward? Clearly, some, if not most of us, exercise because we want its health benefits. But, how do we get to have such beliefs? Well, it's because scientists have compared the health and longevity of those who exercised versus those who don't. Or why companies decide to use one promotional tactic versus another. Because of comparing results under each tactic has given them cause to think that one outperforms the other. That is what we will be learning in this module. Basically, we will learn how to correctly conduct comparisons between groups, and then how to decide if we see significant differences between groups. I'm proud to tell you that what we are about to learn has deep roots at the University of Illinois, right here across from my office. From my office window, I can see the Morrow's Plot, which looks beautiful during summer months with its growing field of corn, kind of like what you see in this picture. In 1876, George Morrow and another researcher, Manley Miles, used this plot to conduct experiments. They wanted to know how to increase the yield for corn. They applied fertilizer to some parts and not to others, and then compared the two to establish what type and how much fertilizer to use for better harvest. They repeated the experimentation for other crops as well. They did many more experiments and analyzed all the results using statistics. And through their studies came many important findings that improved farming significantly. Everything they did was based on two science, an understanding of plant science and statistics. Back in the 1800s, statistic was mostly used in the field of agriculture and the statistical analysis they used more than a century ago is applied the same way to other disciplines today. So let's see how we do the comparisons between two groups. You know by now that if I take a sample from a population and then take another sample from the same population, more than likely the two samples will have means that will be different. Then it comes as no surprise that when I take a sample from two different populations, I would see a difference between them. The real question here is when a difference is a real difference and not just a random and acceptable natural variation. We do this by doing the hypothesis testing about the observed differences at some given level of significance. When comparing means of two populations through samples we have taken from each population, the process of hypothesis testing is the same as we have seen in the last module. That is that we start by stating the null and alternate hypothesis, then specify the level of significance level, alpha, calculate the p-value, and then based on the results, we either reject or not reject the null hypothesis. To take you through this lesson, you will begin with an example. A store brand battery claims that it lasts as long as more expensive national brand. A consumer watch dog group wants to make sure that the claim being made by the store brand is not false and it would like to test this at 5% level of significance. Consumer Report is one such organization which tests many consumer goods and then publishes the results to help consumers find best value for their budget when shopping for new products. Since here we are concerned about the mean time the batteries last, we can frame the problem as the difference between the two populations. Let's call the store brand A, and the national brand B. Then mu sub A represents the mean time brand A lasts, we will record this in hours. And mu sub B represents the mean time brand B lasts. The claim that the store brand makes is that it lasts as long as the national brand. In another words, mu sub A is equal to mu sub B. To rewrite this as a set of non-alternate hypotheses, we will be for now, the difference between the means is zero, and the alternate will be that the difference is not equal to zero. Just like it was for a single sample hypothesis testing, this is a two-tailed test. We can also have one-tail test and I will do an example for that later on. For now, we need to look at the collected data to settle this question. The watchdog group wants to check on this at 5% level of significance. Before we can move on to step three, we need to have data for comparing these two brands. Just a reminder that this part, gathering up the data, is not a trivial task. You are comparing two different brands and you would want to make sure that somehow, as they say, you're comparing apples to apples and not to oranges. You have to be very careful in collecting this data. For example, as best as you can, the method used for testing should be similar, or the set of batteries tested, like AAA or D, in each group are about the same. One thing that is not required is the number of observations in each sample. We can have different sizes of sample. But here, since the agency is doing the testing, they would use the same sample size for both brands. But if you were giving a survey, you may not get the same number of responses back, and that would be okay. For now, we will assume that the data has been produced based on batteries which were selected randomly and independently from the two populations. Like before, when we did hypothesis tests for one sample, the p-value is going to be the probability of finding sample results like the ones we have found. This probability, which we call p-value, is found by knowing how many standard errors separates our estimate for the difference from the hypothesized difference. Which we refer to as the test statistics, and that is denoted by t. Remember that when we did the hypothesis testing about the mean of one population, this is how we calculated the t value. When comparing two populations, we use the same structure, but rather than use single mean, it uses the difference between the two sample means. Just let's take a moment and dissect the second equation and see how it relates to the first equations which you have seen before. In the numerator, we have the difference between the two sample means, each coming from their respective populations and is the first argument, replacing the single sample mean which you see in the first equation. The second argument in the numerator is D sub 0. This is the hypothesized difference between the two populations, and it's replacing the hypothesized mean that you see in the first equation. In this example, we hypothesize this difference to be 0. The denominator is the measure of standard error. Calculating standard error can get quite complicated, but fortunately, software programs can do this with a few commands. So I'm not going to focus on how to calculate the standard error. Instead, I will rely on Excel to give me the answers. So I will use Excel outputs in these slides to solve for our examples. And of course, you can watch the Excel demo videos later on to see how I get these outputs you see here. I just want you to see that the logical steps of comparing two populations is pretty much the same as we did when we had only one population and one sample. Math gets a little bit more complex but that's it. Okay now let's use this, within our example, using the data we had, we use Excel to analyze the data for these two brands. This is the output. The T value is highlighted here, and is 1.547. Excel will give you all values. One tail as well as the two tail. You need to focus on what is appropriate for a given problem. Here, we are doing a two-tail test. So we will only focus on those values in the output. For a two-tail test, p-Value here is 0.123. So these are the numbers, but what do they imply? So let's better understand what we have here. Based on this result, the mean time brand A, the store brand, lasts is 9.99 hours. And for brand B, the national brand, it is 9.92 hours. Wow. Looks like the cheaper brand is a better brand. But wait. This could only because of national brand variation. So don't pass the judgment yet. The t value for how far is the difference that we observe from the hypothesized difference of 0 is 1.547. Then at 5% level significance, what is the probability of observing what we have observed in our sample, that is the p-value for two-tail tests, which is 0.123. So now let's go back to our formulation and make a decision. Looking at p-value of 0.123, it's greater than the alpha of 0.05. We will not reject the null hypothesis. This means that our sample data did not produce a result, which would lead us to think that the less expensive store brand is of a lesser value and quality than the more expensive national brand. This is good news for consumers. They can save money while getting the same quality. Now if you recall, the mean time of brand B was slightly higher than brand A. But based on our complete analysis, we find no meaningful difference between the two brands. The difference you see here is considered noise and is statistically insignificant. Which means don't go and advertise in national tests our brand did better than the more expensive brand. That is not the case, at least not based on this data set. So, now let's practice. Manager of a store would like to know if the average daily sales, measured in dollars, through her website is any different than her in-store sales. The data has been collected and analysis has been performed at 5% level of significance. Help her understand the results. Start by stating the hypotheses and the significance level. Since the manager is interested in knowing if these two channels of sales are different or not, we will formulate the null and alternate hypothesis as a two-tail test. Which means the null states difference between the average sales through the online is not different than in store. And the significant level is at 5% level. Here's the analysis of her data. It has weekly sales for 52 weeks. What do you think? Will you reject the null hypothesis or not? What does it mean if you say reject or do not reject? So again this is a two-tail test and we should focus on the part of the analysis that is for two-tail test. I am just going directly to the p-value and comparing that to alpha of 0.05 to make a decision. Here the p-value is 0.0178, and that is less than our 0.05. So reject the null hypothesis. This means that the average weekly sales are not the same for the two channels of sales. In this example we just saw, we came to an interesting revelation. There is a difference between average weekly sales of online and in store. Now one natural question will be, what is the difference? Most people are not just satisfied to know that there is a difference. They want to know the magnitude of the difference as well. Clearly this matters. Is that a dollar difference? Or thousands of dollars of difference? So let's explore how we would answer this question. Looking at the output from our analysis, we see that the average sales for in store is a little bit higher, $105.72, based on the sample data we have. But we know that if I took another sample, I would get different results. So a better way of estimating a difference is by using a confidence interval. We can develop a 95% confidence interval based on the results we have here. Once again, I want you to relate what we need to do here when we have two samples to what we learned in creating the confidence interval for one sample. When we had one sample, the confidence interval was calculated by taking the sample mean and adding and subtracting the margin of error, which was represented by this equation. For two samples, the idea remains the same. That is, we are going to add and subtract the margin of error to the observed difference. And the equation will now look like this. Do you see the resemblance between the two? We are essentially doing the same thing. We'll now go back to our example and calculate the 95th percentile confidence interval for the manager. We can find the values for all the notations in the same output and do the math to get the confidence interval. Please pay attention to the equation on the right as I substitute values for notation. So first, the value of mean spending in each sample is substituted to get the mean difference. It doesn't matter which one is written first, by the way. Then, we need the value for t of alpha over 2, that is also known as the critical value, which is the terminology Excel uses in its output. In our case, we want a 95% confidence interval and using the t value for two-tail test will be used and that is 1.993. Now, the values for the standard error. The notations here are the variance of each sample divided by each sample size added together. And then we take the square root of the sum. In this case, we are taking the variance of In-Store and dividing it by the sample size. And then variance of Online divided by its sample size. Now we can proceed with performing the operations to get the final result. So if everything is done correctly, we get the mean difference of $105.72 plus or minus the margin of error of $86.93. Which means the 95% confidence interval for average weekly sales for in-store as compared to online is a value between $18.80 and $192.65. Any value in this interval is likely valued for the true differences between the two channels of sales. Now a manager may look at this and say, sure, there is a difference between the two. But if the true difference is only about $19, then I would personally consider that not that important. And that is what I want you to pay attention to. Just because we find significant differences doesn't necessarily mean that we have found something substantive. The decision maker who understands statistics now can apply their own insight and decide for themselves what to do. [SOUND]