So, before we actually going to analyze the data, we need to do some cleaning of the data. And, for this, I'm going to present some methods that are related to data normalization. So, in biology, when we repeat experiments, we measure exactly the same thing, we rarely get the same exact result. There is a scatter of the results that are repeated. So, the data that represent the measurement of protein, mRNA or a gene has a center or an average. And some scatter around it. In some cases the source of this scatter is biological because of the inherent noise within the biological system. And in some cases, this is an experimental procedure or an instrumental error. If it is the latter, we want to normalize the data so that experimental bias is removed as much as possible. So now, we going to go over several methods for normalizing data. Trying to remove this shift in the mean of the data. So first, how do we quantify the center of the data and the spread of the data. So this is, a, simply can be computed using the mean and standard deviation. The mean is simply sum of all of the values, the repeated values that were measured, divided by the number of measurements. So to compute the standard deviation, we simply subtract the mean from each value. We square those differences to make them positive, then we divide the sum of all those, those differences by the number of measurements. And then we take the square root of this product. And this give us a quantification of the spread in the data. So the first method of normalization is called Z-score normalization and what Z-score normalization does to data, it makes the mean become 0. It centers the data on to 0 as well as making the standard deviation the same. The standard deviation after Z-score normalization becomes 1. So in order to do that, first, you first computing the mean and the standard deviation for each row. Then we subtract the mean from each element in the matrix, and then divide by the standard deviation. The final product is a new data set with a centered mean. And are even standard deviation. Another data that is measured in biology follows this normal distribution that I'm going to mention later. Quantile normalization is less sensitive to the type of distribution the data has. And it's also a common method used to normalize data from gene expression microarrays. With quantile normalization we normalize the columns. So first, we sort each column by the values, keeping track of where the values, which row the values came from. After we sort the values, we compute the average for each row. And then, we replace the value in each row with the average value of each row. And then, we put the values back to their original rows where they came from. So this makes the new data matrix, the normalized data matrix, have the same sum across all columns. A third normalization method is called Median Polish Normalization. And it's also known to be the last step of RMA normalization applied to gene expression microarrays. So the median polish is normalizing both the rows and columns at the same time. And it follows several steps. So in the first step, we identify the medians of each row. The median of the first row is 4. And then, we subtract that median number from all the elements of each row. Producing a new data matrix that now have the differences between the median and the other values in the data matrix. And then, we take that data matrix, the resultant difference matrix. And then, divide the medians of the column, and we subtract those medians from the columns from each value, resulting in another data matrix. And then, we repeat this process. Again, looking for the medians from each row, and we repeat this process until we converge to have medians of 0 across the rows and columns. Once the algorithm converged, we can use that newly generated data matrix and subtract that matrix from the original data. To normalize the original data set and what we are remained with is a normalized matrix. And now, the averages of each row are considered robust means of RMA normalized data matrix. Another data cleaning strategy that is very common is log transforming the data. So many times when we measure a lot of variables, we faced with extreme values. So here in this particular example it's the data sets from phosphoproteomic experiments. And we see that we have extreme values. So, when we plot the role values, we can see a sharp peak in the center while there is a lot of, single values that are very high or very low. So, typically, what we, we do to avoid that dominance of extreme value, we log transform the data. And this is before and after log transform this particular phosphoproteomic data, making it a more normal looking. So what we looked at now is a histogram of all of the values of a specific experiment. In the first few slides, we talked about normal distribution. This is the most famed distribution, it's the bell shaped curve distribution. I have a defined mean and the standard deviations from the mean specifies this, the spread of the data. As well as the probability of observing extreme values. In biology, we face many other types of distribution. And this is an example of a bimodal distribution that we came across when we analyzed single cell gene expression data. Measured in mouse embryonic stem cells. So here, for many of the genes that are expressed in single cells and stem cells, we identify that there is bimodality. The bimodal distribution of expression value. This means that man of the genes they act as switches that turning on and off. Another very common distribution that is observed typically in networks is power-law distribution. In this particular example, we're looking at the connectivity distribution of the human kinome as it was collected from the literature. If you plot the raw values of the connectivity distribution of this network, you can see that there are many nodes with low connectivity. While there are a significant number of hearts, when you transform the data of both axes into a log scale. The data feeds a straight line and this distribution will be discussed later on in the seventh model of the course. And this is also called the scale free distribution, because it does not have an average and standard deviation. So this is it for the data normalization. There is many more to explore about data normalization. There is a course that is offered on Coursera by Jeff Leek and colleagues from Johns Hopkins. And I recommend going to more details about cleaning data, next talking about and finding differentially expressed genes. [MUSIC]