Hi. My name is Qiaonan. I'm a PhD student at Dr. Avi Ma'ayan Lab at Icahn School of Medicine at Mount Sinai. In this lecture, I will show you how to make a PCA plots in MATLAB. PCA is abbreviation for Principle Component Analysis. I think by now you already learned what PCA is from Neil's lecture, and got some ideas about its application. But still, I want to reiterate its merits. So, why people like to make PCA plots in data analysis. First of all, it is a powerful tool to visualize high-dimensional data. And it shows quantified difference among observations, and it is used to assess data quality. And discover relationships between data points. To have a concrete idea of all the advantages. Let's look at some examples at first. Example 1 is a PCA plot of gene expression data from patient tumor cells of different subtypes. each dot is the gene expression status of a tumor cell from a patient and is colored by its sub type. The three axises are the first three principle components and the numbers within the parenthesis suggest the percentage of variance that are captured by each component. You can see that the first component, the PC1, captured the most variance of 54%. The second and the third capture only very small, 8% and 5%. In this figure, dots of the same subtype tend to cluster together. Which means tumor cells of the same subtype have similar transcription profiles. The other thing that can be interpreted from this figure is that subtype one and subtype two are more similar to each other than to subtype three. Because the difference between subtype 1 and subtype 2 is mainly on the second component, that captures only 8% variance. While the difference between subtype 3 and all the other two. is on the first component which capture the 54% variance. So the distance of the dots on each axis should not be treated equally. Difference on the first component should be taken into more consideration. Example 2, is simulated gene expression data by random numbers, mimicking the first example. I put this figure here to show howa random data set would look like in a PC plot. One understanding feature is that dots of different classes mix all together. The other feature is that the first three principle components capture almost equal and small variance. If the gene expression data we got from tumor cells, looks like example 2. We can say gene expression profiles of different subtypes are not distinct from each other. Or subtype has no influence on tumor cell transcriptome. Here are two more examples. Example 3 is gene expression data from drug treated cells. Brown dots are untreated cells, pink dots are 48 hour treated cells, and blue dots are 72 hour treated cells. dots of the same color, are biological replicates of the same treatment. From the figure we can see, there is a major difference between treated cells, and untreated the cells. There are very small difference. Between treated cells of different time points because the difference between the cells of different time points are mainly on the second. And the third components, which capture very small variance. Another thing we can observe is that untreated replicates. pack up tightly but replicates of treated cells tend to scatter. This reflects that the transcriptome of treated cells has larger variance. Example 4 is also gene expression data from drug treated cells. There are four biological replicates for drug treated cells, measured on four different micro array plates. Each plate has about 20 control replicates. Pink, blue, brown and red dots are the control replicates on plate one to four. Light yellow dots are replicates of drug-treated cells on four plates. From figure We can see control dots from same plate cluster together. This is certainly a technological artifact and makes no biological sense. This dataset should be re-normalized or processed individually by plate in further analysis. I will use microgene expression data as an example. To make a PCA plot in MATLAB. The gene expression data is usually stored in a tab-delimited text file, and the extension of such files could be. csv, . soft, . xls etc. Use Excel or Sublimetext to open and preview the file. Gene expression values must be normalized before PCA plotting. Okay, we will do the demo now. Now, we will begin our PCA plotting in Matlab. First import data from a CSV file. Click import data button. Select the CSV file, wait for loading. We first import all the numeric values as a matrix. We select Matrix and give it a name as expressions and then click Import. Next step we import the column labels. We give it a name as subtypes and then we import them as a cell array. And the data type should be text. And then click Import. Now, in our workspace, we have two variables, expressions and subtypes. You can double-click to browse the expressions. Expression is a 55 by 56 matrix. So it has 56 gene expression profiles of 55 genes. Double click subtypes to browse it. There are 56 subtype labels matching 56 columns of the matrix. There are three subtypes in this data set. Subtype one, two, and three. After got all the data, we will begin our plotting. First, I open this script I wrote to do the whole thing. I will explain it line by line. The first step is to perform prinicipal component analysis. There is a single function called princomp in MATLAB that will do the job. I've copied this command here, the apsotrophe on the expressions means transpose, I needed to transpose expression matrix Because in it's design, the princomp function requires the rows of input matrix are observations. and columns are variables, which means rows to be gene expression profiles, and columns to be the genes. There are three outputs of the function. The first output is coefficient matrix Which we won't use here, and is substituted with a tilde. In Matlab, you always put a tilde for unused output. The second output is scores, which are the transformed coordinates by PCA. The third output pcavars, stores how much variance each component captures. If a function has multiple output MATLAB requires to put square brackets around them. Press Enter to run the command. Let's first look at pcvars. You can see the values are in descending order so the first several components capture most variance of the data. Since we will plot our data in three dimensions, we will only use the first three components. For the scores matrix, it has the same arrangement as expression matrix, which are rows are gene expression profiles and columns are genes. We will pick the first three columns, namely the first three components. I will let the first component to be the x-axis, and the second to be the y-axis, and the third to be the z-axis. The syntax here means that select the specified column of the scores matrix We select one, two. three column. Now we got the x y z coordinates. The next step is to plot them. I use this gscatter3 function. It's not a matlab built-in. I downloaded it from the file exchange. File exchange is the official MATLAB forum run by MathWorks, where MATLAB users share their code. You can download the function from this URL, or search the Google for its name. The first three arguments of the function are the xyz co-ordinates The fourth argument is so-called group variable, here it is the subtypes variable which, for each data point specifies their group. We have three groups here, subtypes 1, 2, 3. The next argument specifies color for each group. We have three groups, so we specifies three colors. The color here is represented by single characters wrapped in a cell array. The curly brackets indicates it is a cell array. Matlab recognizes eight colors by a single character name. Here b is for blue, g is for green, m is for magenta. The following argument specifies marker type for each group. Since we want the marker of all groups to be filled circles, we put three dots here in Matlab dots means filled circles in plotting. Then, the final argument gives the size of our marker, I choose 15 here. For detailed usage of the function, open it in a text editor to see the documentation, press Enter, we got our figure. The next step is to pull some annotations. Use title function to add a title to the figure, just put the name of the title, as a String, in the function. Here I wlll give the name as gene expressions of cancer subtypes. Now, let's look at the figure. You can see it has a title. The next step is to annotate the axis. Recall pcvars variable stores variance for each component. To calculate the percentage of the variance the first component capture, We divide first component variance by total variance, Which is the sum of variance of all components. And then multiply it by a 100. Press enter we got this number. We do not want to have such precision, so we round this number to 54. Using the round function. Then we convert this number into string using num2str function. Then we will concatenate this number string with other annotation parts which are string literals, using square brackets. Then use xlabel function to add this annotation to the figure. Look at the figure again, we see we have this annotation on x-axis. Oh, I made a mistake here, it should be PC1, and I only type PC here. Never mind, if you'll put all the previous steps in one line it will be like this. You first look at the variance of the first component divided by the total variance, then multiplied by 100, and use the round function to get an integer, and then use num2str to change it to string and concatenate with other string literals in the annotations. We will repeat this step for y-axis and z-axis. OK, we finished our figure, want to change the orientation of the figure? click this button. You can rotate this box all around. After placing the figure in a proper orientation. Export the figure into different formats. Clicking export. I usually export it as eps. So select EPS format. And give it a name as PCA cancer subtypes. One last thing. Most of the time, you may want to z-score normalization on your x, y, z coordinate. So they fall into the same scale. To do this, use the z-score function before plotting. Let's replot the figure. I copy the whole code And paste it here. Now the axises of normalized figure are all in the range of -2 to +2 so they are in the same range. OK. We have finished our. PCA plotting in Matlab today. Hope you enjoy. [MUSIC]