[SOUND] When you have a large dataset, just looking at the data is not going to give you much insight. We can summarize the data so that we can quickly communicate some major characteristics about the data. There are two major ways to summarize the data. One is graphical method, some of which we discussed in the previous module. Another is using numerical methods. An example of this would be the average value for instance. In this lecture, we will focus on numerical summarization. The goal is not only to generate these but also learn how to effectively use these summaries. One very useful characteristic is the central tendency of the data. Measures of central tendency tells us where the center of the data tends to be. Sometimes we think of these as the central tendency as a special value. However, as you will see, not all measures of central tendency are necessarily typical values. We will focus on two measures, mean and median. The mean is simply the average. It is also known as the expected value. The population mean is represented by the Greek letter mu, while the sample mean is represented by the symbol x bar and it represents a statistics which we will use to estimate mu, which is a population parameter. One thing that I wish to bring to your attention is the notation that are used in the statistics. When we want to show that we are using the population to calculate these parameters, then we will use Greek notations in capital form. When you sample data to generate statistics, then the notation would be lower case and in English. While in this class I don't expect you to calculate anything manually, it is important to know how these values are calculated when you're using any software. For that purpose, I will share with you the equations for basic understanding of how numbers are generated. The first equation you see is how we calculate the mean for population represented by letter mu. The mean is calculated by simply adding the value of the variable for each element of the population. And then dividing it by the size of the population which is the capital N. If you had data from a sample study, the principle of the communication method will stay the same. The notations are changed to represent the fact the data is collected based on a sample. The second equation is for the datasets collected from a sample. In this case, the variable for each element of the sample is represented by lowercase x and lowercase n represents the sample size. The sample mean is calculated by adding all of the x's and then dividing that sum by the sample size which is the lowercase n. Then the sample mean is used as a point estimator for the population mean. We want to measure the average money spent by customers that come to our website. Data collected from seven customers, what is the mean spending? So here, variable x represents dollars spent by the customer. For example, x1 represents the money spent by the customer one. And x2 represents the money spent by customer number two and so on. The mean, based on this sample, is just the sum of the money spent by all seven customers divided by the number of customers in our data, seven in this case. The sample statistic shows an average of $58.77 by the customers. The median is the middle value when the data is sorted in either ascending or descending order. It is sometimes represented by the capital M with the d subscript. There is no difference between population and sample symbols in this case. If the number of measurements is odd, the median is the middlemost measurement in your ordering. If the number of measurement is even, the median is the average of the two middlemost measurements in the ordering. To find the median, first we sort the data in ascending order. The datapoint in the middle, in this case, $56.98 is the median. So 50% of the customers spent more than $56.98, and 50% spent less than $56.98. Now that we know how to calculate the mean and median, you may be asking yourself, so what? Calculating these values is not just an intellectual exercise. There are times that one of these measures will be most useful. Remember, we often think of this value as the typical value. Let me give you an example. You're sitting next to nine of your friends, so there are ten of you in total. You all have graduated from the same college, same degree. It is three years later. You all have slightly different experiences, but you're making more or less the same salary shown here in this table. The average salary income here is $65,000. And the median income, since we have an even number of data, is average of the two values in the middle. Those are the 64,000 and the 65,000. So your median is 64,500. As you all are sitting there and talking, an old classmate walks in. This classmate who also graduated with you had the same degree. Upon graduation, Harbor was drafted by the professional basketball team, and now his salary is $8 million. He joins your group and now there are 11 of you. What happens to your group's mean salary? Now the mean salary for your group is $786,000 and some. How typical is that value for the group? What happens to the median value for the group? It stays at 65,000. Which value represents the group expected income better? In this case, the answer is median. Why? As you can see, the mean is sensitive to the newcomer's salary, but not the median. So if your dataset has one or two extreme values, we call these outliers in the statistics, the mean is less representative and median is more robust. Have you ever looked at home prices for a given market? If not, just pick a city or a neighborhood and see what you get. What gets reported is the median price of the homes. That way a customer will know for sure if the median is at the top of their price range, there are still 50% of the homes that are below that value. But if you rely on the mean, the value might have been pulled to the low end because of few homes are dilapidated and very cheap or may look like too expensive because a few mansions that are very expensive. So now let's practice. We have looked at five possible sites for our new business. The monthly rentals are as follows. Location A has an annual rent of 84,000, B annual rent of 78,000, C, annual rent of 114,000, D, annual rent 103,200, and E has an annual rent of 93,600. What is the mean and the median for this dataset? The mean is the average of the five numbers. Which is 94,560. And the median is after ordering the numbers in ascending order, is at the location E with the annual rent of 93,600. And that is your median. This graphical distribution shows number of stocks that increase their value at the end of trading day. This graph is fairly symmetrical around its peak, which means it has roughly the same mean and median. When we have outliers, values that go far to the right side or left side of the shape of the curve will start getting skewed. Let me show you what happens to the mean and the median as we start having skewedness in our data. So here I have a graph that I have generated based on a dataset. And at first what you see is a fairly symmetrical histogram. So, in this case, the red bar represents the mean and the green bar represents the median. And all the other observations will fall on either side of the mean and the median, those are the blue bars. As you can see, when they are fairly symmetrical, the mean and the median are on top of one another, which means they're about the same value. Let me show you how this will change as I use the spinner button to change my data skewedness. So as I'm going down, you will see that the data's pulling more and more to the left with the long what we call our right tail. So the right tail is this skewedness. So what happened to the mean? As the right tail starts elongating, as it becomes skewed to the right, mean starts to pull toward the right. This is similar to what happened when your friend who is a basketball player joined your group. Your mean went up, but the median doesn't move as much. And the opposite will happen if I spin it the other way and start having a left tail. You would see that again, mean and median start separating, and the mean starts going toward the tail and the median stays a little farther back, so the median is less likely to change as quickly as mean does.