So, I started this course with a classic example from history on using statistics in public health. In 19th century England, cholera killed thousands, and errors in data analysis and interpretation by the Registrar General led to further outbreaks and deaths. To do better, we need statistical thinking to overcome the key research challenges. Doing the first thing that comes into your head won't work in science, even though you might get away with it in real life - you need statistical building blocks. The three statistical building blocks I'm going to introduce are; types of variables, distributions, and sampling. First building block or concept is types of variables. One of the tasks of learning about any science is to become familiar with its particular language. So, what do I mean by variable? A useful definition is actually harder to come up with than you might think. One way to say that it's just the opposite of a constant, which is a mathematical entity that only ever has one value like the speed of light in a vacuum, c in Einstein's famous equation E=mc squared. We're going to use the word variable to describe characteristics of people and their environment. So, these can change during a person's lifetime, they can also differ from person to person, and some variables like vegetable consumption, can both change over time, and differ between people. We want to know how such variables relate to some patient outcome - by outcome, I mean what happened to the patient. So, in the example we're focusing on this course, the number and types of fruit and vegetables that someone eats, are examples of variables, and getting cancer is the outcome of interest. The second building block is frequency distributions. As you'll see, there are different types of distributions, but also different ways of describing each one. You may already have heard of the normal distribution, it's also known as the Gaussian distribution after the hugely influential 19th century German mathematician, Johann Gauss. We'll look at that one in more detail later in the course. Distributions describe the range of variables at values the variable takes, and how common each of those values is. So, a simple one is the number of kidneys a person is born with. This can be zero, one, two, three, or even four, but vast majority of people have two kidneys with very small proportions having three, and at most one in a million having four. So, knowing the distribution that's a given variable has or can be assumed to have, is important for two main reasons. So, it helps you decide how common or unusual a given value is, and it helps us make predictions. So for instance, many blood test results are interpreted by comparing a patient's results with the distribution of results across a huge number of people. Rather than having to know all those individual values and frequencies that make up that distribution, it's much easier if we know the formula that describes that distribution. Say for the kidneys example, there is no such formula, but we know from countless observations that most people have two. However, common distributions like the normal, have nice formulae, and can be described simply by their mean and standard deviation. The third building block is sampling. So, everything we know in medicine, we've learned from a sample of people. We then extrapolated the results of that sample to the whole population. But this is only valid if that sample is representative of the population in key ways like mix of age and gender. By extrapolate, I mean that we assume that the results for that small sample are also true for the whole population. So, for instance, suppose we're interested in how popular broccoli is? We ask 100 people how many of them have eaten broccoli, say at least once in the past month, and 20 of them say yes, let's suppose. We would then assume that 20 percent of the whole population have eaten broccoli in the past month. Now, going out and recruiting a representative sample of people isn't always easy, and assumptions which are critical in statistics are not always valid in practice. So to sum up, we cannot do good public health research ourselves, or correctly interpret other people's work without understanding the fundamentals of types of variables, distributions, and sampling. We'll look at each them in more detail throughout this course.