Congratulations, you are now the proud owner of a dataset of patient information. What do you do now? One of the first things is to understand the nature of each variable in your new dataset. You need to get to know them as individuals with their strengths and their quirks. So let's look at how to get to know a data set. The documentation that comes with it should at least list the variables and tell you what the categories mean. However, as is common with a lot of healthcare information, particularly that from government websites, the documentation may not tell you anything else. So, open up your data set and just look at a few rows. Get a sense of which variables are continuous and which are discrete. Which consist of text and which of numbers. Depending on how many rows your data set has the next thing to do is just to tabulate each variable separately. However, if you've thousands or millions of records, and you have a continuous variable, then this won't be a good idea as you'll just fill up your output window and waste time. But for categorical variables, this is fine. What values are common and what values are rare? Are there any missing values? Or any values that are set apart from the others? For instance, values such as 9 and 99 are often used to indicate unknown or other, so gender might be coded as one for male, two for female, and nine for other or unknown. For continuous variables, or for integer variables with lots of values, tabulation won't be useful either. You need to get summary measures like the mean, the median, and the range - that is to say, the minimum and the maximum. It's also particularly useful to plot continuous integer variables in a histogram. This is a frequency tabulation displayed as a graph that shows how common each value is, like this. The plot gives you an immediate, though approximate sense, of whether the variable is roughly normally distributed, skewed, or has some other shape entirely. Having plotted it, how should you summarise the values? If it looks about normal, then you can use the mean and standard deviation. But if looks a bit or even very skewed, you should get and report the median, the lower quartile, and the upper quartile. If you're not familiar with those terms, let me demonstrate. Take the numbers 1, 2, 4, 5 and 48. The average of these five numbers is 12. Now remember what the main job of the mean is, it's to summarise the heart, the main chunk of a distribution, in one single number, but in this case four of the five numbers are quite a bit less than 12, and one number is a lot more than 12. So this mean of 12 is just not up to the job. The median however is the middle value. More precisely, if the variable is properly continuous, and values can have decimal places, it's the value at the mid-points of the distribution, such that half of the values lie above it, and half lie below it. What would you say is the median of 1, 2, 4, 5 and 48? It's 4. None of these five numbers have decimal places so you're looking for the middle value. With 5, 2 numbers, it's 5 and 48 lie above it, and 2 numbers, 1 and 2, lie below it. Job done. If you have an even number of observations, you'll need to take an average of the two middle ones. For instance if your numbers are 1, 2, 4, 5, 11 and 48, then the two middle numbers are 4 and 5 - an average of them gives you 4.5 for the median of these six numbers. Now, another term for the median is the 50th percentile. Percentiles, which are centiles, break the distribution into 100 equal chunks. You may have seen these on children's growth charts. A percentile is a number where a certain percentage of values for that variable fall below that number. So being the middle value, 50% of values lie above the median and 50% below it. So, it's the 50th percentile. The 25th percentile is the number below which a quarter of the distribution lies, which is why it's more commonly called the lower quartile. Similarly the 75th percentile is otherwise known as the upper quartile, as 75% of values lie below it. Another term you'll see is the inter quartile range. This is just the difference between the upper and the lower quartiles, usually written by giving the median and then the two quartiles like this example for how many days patients stay in hospital. The definitions of all these percentiles apply no matter how skewed or wacky the distribution is, that's why for non normal continuous variables we generally want to give the median and the lower and the upper quartiles in order to summarise the variable's distribution. So with any data set, you need to do these kind of basic calculations before you're doing anything more sophisticated. You have to pick the right statistical test for each type of variables. Get it wrong, and all your results could be invalid and worse than that so will any public health policy that's based on them. So don't be tempted to skip these steps.