Today I'm going to start a new topic, Knowledge Area 2, Probability and Statistics. Here is the outline for Probability and Statistics from the reference handbook. It covers four to six questions. With topics such as central tendencies and dispersion, estimation of single mean properties, regression and curve fitting, and expected values. Here is my somewhat simplified outline that I'll be following. First, we'll look at basic parameters. Then, permutations, combinations, probability, confidence levels, hypothesis testing. And finally, linear regression. So, here is the outline again. And in this segment, we'll start looking at some basic parameters. And here we will first of all look at some fundamental dimensions of statistical properties. And then particular ones. Mean, standard deviation and dispersion. So first we'll learn some basic principles and definitions. Population refers to the entire collection of objects. In other words, all of them. Whereas a sample is a subset of that population. A random variable is the variable of interest being measured. For example, the lifetime of light bulbs for example. And the random variable can be either continuous or discrete. It can take on either continuous values, or only have discrete values. A parameter is a numerical attribute of the entire population. For example, the average or mean value of the population would be a parameter. Whereas, a statistic is a numerical attribute of the sample or the subsample. For example, the average value of some sample property is a statistic of that sample. There are various measures of central tendency. Which are illustrated in this extract from the Reference Handbook here. And the most common, of course, is the sample mean or average. Defined as here x bar is the sum of the values divided by the number of samples. Where xi is the value of the individual sample data. And n is the number of samples. And we usually denote that by x bar. The population mean, we'll usually denote by mu. Is just the average, or the mean of the entire population. Where capital N is the number of items in the entire population. We also have a weighted arithmetic mean, defined as here. Where W are the weights of the individual sample values. Dispersion is a measure of the scatter, or the spread, or the variability of the data. And this example, these datas are measures of velocity in a turbulent velocity field. Where the velocity here is in centimeters per second, and the time in seconds. And, obviously they have some seemingly random scatter to them. So, from this, we can easily compute the mean value, X bar. Which turns out to be 5.82 centimeters per second. Which I've indicated by the red line here. But, measure of the scatter, or the dispersion. The most common one is the standard deviation of the sample. Which is defined here, if this is a sample of the larger record, s is one over n minus one. The summation of the value minus the mean value squared. And in this case, if we compute that value, it turns out to be 0.53 centimeters per second. Which I've indicated by plus, minus levels around the mean value here. So, this is a measure of how scattered the data are from the mean value. And a related parameter is the sample variance. Which is just the standard deviation squared. Another useful property is the coefficient of variation. Written as either COV, or CV here. In the reference handbook. And that is defined as the standard deviation divided by the mean value. If we have a population, the standard deviation of the entire population. We usually denote by sigma, is slightly different equation, one over n. Where n is the total number of samples, rather than n minus one. Times the estimation of the value minus the mean value, mu, of the population squared. And similarly, the variance of the population is sigma squared. Some other values or parameters which are given. Are the sample geometric mean, defined as here. But more often, useful one is the root means square value. Which is defined as the square root of the summation of the squares of the values, divided by n. Another one is the median value. And the median value Is the middle value. When the data are ordered from smallest to largest. Or in other words 50% of the values are smaller than the median value. And 50% are larger. And in particular, if the number of samples n is an odd number. It's equal to the value of the middle sample, or the n plus one over tooth value. We also talk about percentiles. And a percentile, is the percentage of values above or alternately below some particular value. For example, the 20th percentile. 20% of the values would be higher than the 20 percentile value. And the median then is just equal to the 50 percentile value. 50% of the samples are below, and 50% are above. And we also sometimes talk about quartiles. And quartiles correspond to 25 percentile, 50 percentile, or 75 percentile values. Either above, or below some specified value. So let's do an example on that. We have eight sample measurements of the waiting time for a bus, in minutes. 4, 2, 5, 6, 7, 10, 9, and 4. So the first question is, the mean value is most nearly which of these four? So, a basic definition, assuming this is a sample. Is x bar is 1/n summation of xi. In this case we just add them all up, and divide by the number of samples, which is eight. And the answer is 5.88 minutes. So the answer is B. Next, we want to compute the median value. The median is most nearly which of these four? So to do that, we rewrite the sequence in order from lowest to largest. So in sequence the numbers are 2, 4, 4, 5, 6, 7, 9, and 10. And in this case we have an odd number. So the median value, is the average of the middle two values here. Which are five and six. So the average of those two terms, the fourth and the fifth terms in that sequence. Is 5 plus 6, over 2 is equal to 5.5. So the answer is C. Next question is the same sequence. The eight sample measurements of the waiting time are of the same numbers. But now, first of all we're going to calculate the standard deviation of those numbers. Which of these alternatives is it? So here's that standard definition of the standard deviation. Assuming this is a sample of a much larger population. S is equal to square root of one over n minus one, etc. From the previous example, we've already computed the mean value X bar was 5.88 minutes. So writing this out in all it's detail. we have one over n minus one. Which is one over seven, times the first value four, minus the mean value squared, etc. And those are all the terms involved. And computing that out, the answer is 2.70 minutes. And the answer is B. Now if this had been a population. In other words, the bus only ran eight times in total. It's never going to run again, ever. It's the entire population. Then the equation would be slightly different. We'd use the symbol sigma for the standard deviation. And the equation would be one over n, where n is the total number. Times the summation of xi minus mu. Where mu is the average of the entire population. But the average, of course, is just the same. Mu is equal to x bar here. So, in this case, the equation becomes one over eight, multiplied by the same parenthesis. And the answer is then a little bit different, it's equal to 2.52. But, of course, if n, either lower case n, becomes a very big number. Then the standard deviation of the sample becomes essentially equal to the standard deviation of the entire population. When n is a big number. The next question we want to compute the root mean square value of the data. So the root mean square value is which of these four alternatives? Here's our definition of the RMS value. Summation of the sum squared divided by n. So this is the same equation. The only difference is that we don't subtract the mean value, from the numbers when we sum up the squares. So it's equal to that and the answer is 6.39. So the answer is D. Now, we would not ask for this, but its also of interest to calculate the coefficient of variation. Which is the standard deviation, divided by the mean value. Which in this case the standard deviation is 2.7. The mean value is 5.88. So the coefficient variation is .46, or 46% in this case. Now, this concludes our preliminary discussion of basic parameters.