Hello Ebron. In this video we're going to talk about clustering algorithm. So clustering algorithm is one of the most popular on supervised learning method comparing the PCA algorithm that we talked about last time which find a low dimensional representation for dimensionality reduction, clustering algorithm focused on finding some subgroups in the dataset. What clustering can be useful for, by doing clustering we can get some insight about the structure of the data, or we can use the clustering before the prediction to identify some subgroups and then make different models to each subgroups. For better prediction model, let's talk about application examples of clustering, marketing and sales, use the clustering a lot. So for example, customer segmentation find the sub groups of people based on their age or their occupation. And their incomes etcetera to predict what kind of product category and the pricing they're willing to pay and they would need, as you might have experienced advertising in social media or amazon website and so on. These days machine learning models use clustering to find sub groups of people who might be likely to click their advertisement and then purchase their products. Closing algorithm is very popular in bioinformatics and medicine field. So for example, we can use it to discover disease subtypes such as the diabetes or cancer. It is also used a lot in genomics and other bioinformatics research, for example by looking at the subgroups of these genes and proteins that are related to each other. We can get some insight on which sub groups of genes or proteins are responsible for certain disease. We can also use clustering for grouping documents or movies and music. So we can identify rules of those movies, music or documents. Clustering can be used for images, segmentation or pre processing of the image data. So by using certain types of clustering you can do segmentation better. All right, so we're going to talk about two clustering algorithm that are popular. One is called the K means clustering which uses a distance metric between the data points, and that we're going to cluster data points based on their distance. So data points that are close together will be clustered as a group. Another idea is hierarchical clustering which cluster data points in a hierarchial manner. So you might have to remember some images of upside down three diagrams. This is from hierarchical clustering. We'll talk about it more details later. All right, let's talk about K means clustering. So clustering is assigning labels to each data point without labels in the training data. So by I bullying you can say maybe these data points are together the former cluster and this data points are together and form a cluster. So as a result we can assign different colors to each cluster. So this is one cluster through a little bit better, and this is another cluster, then how do we assign these labels to unlabeled data? We use a distance metric to measure the distance between all this points to a reference point to the cluster which is called centroid, so centroid. It is essentially the average position of all these points that are assigned to this cluster. So this is the average position for red cluster and this is the average position of the blue clusters. So having that in mind, k-means algorithm works like this. So initially zero, we assign random labels because we don't know where the clusters are. Therefore we cannot calculate the centroid yet. So we just randomly initialized all the labels to the data point, and then we calculate centroid. As you can see this centroid, calculated with the initial labels are kind of in the middle but it's very rare that the centroids are perfectly overlapped each other. So there is a very small difference between this initially assigned the centroids because of that. Let's say the yellow centroid was here and the pink centriod is here, and green centroid was here, and blue centroid was here in the next iteration. So all of this point our closest to the yellow centroid. Therefore they get assigned the yellow on the other hand, these greens, they're our closest to this g centroids. So they got assigned to green and so forth. And then we recalculate the centroid based on this new assignment. Then the yellow century moved to here, and the pink centroid moved to this position and the green centroid. Moved to this position and blue centroid moved to this position, and in the next iteration we will sign again. So all the points that are closest to this centroid will be assigned as a yellow and so forth. If we keep this iteration again and again, this centroid will move around until it converges, and then as you go further at some point the assignment to all these points will not change and the centroid kind of stays there. So in that case we call it convergence and when convergence happens, the K means algorithm ends. Or we can also stop by setting maximum iteration in case the conves takes a long time to converge. All right, so distance metrics two popular distance metrics that are used in K means algorithm is distortion and inertia. They are essentially the same distortion Is average of the squared. The Euclidean distance between the points in the cluster and the centroid of that cluster. So this distance squared plus this distance squired, and so on and take the average, inertia is the sum of those squired distance between the points and the centroid. K means algorithm has a hyper parameter K, the name K means comes from this hyper parameter which represent the number of clusters. So how do we know in advance? How many clusters do we choose? We don't know in advance, so what we have to do is we just go from Kequals one different values. So we'll try everything and measure the distortion or inertia or whatever. This is a metric of choice and then we plot the elbow plot. So this plot is called the elbow plot because it has a shape of elbow, and the optimum value is this one because it has the most reduction of this distortion here, and then if you have more clusters it doesn't reduce the distortion very much. So, K equals to three in this case would be the optimal choice. So let's talk about some properties or facts about K means clustering. As we just mentioned, we need to know the number of clusters in advance and the elbow plus is a way to how to determine the number of clusters. K, K means clustering has a drawback it can suffer from curse of dimensionality. So curse of dimensionality means that it will be difficult to have an accurate clustering assignment when the dimension becomes high. And it can happen because some meaningless dimension in features may make some distance symmetry inaccurate. In that case we can use the PCA to reduce the dimensionality. So this pre processing using PCA may help, K means has a good property that you will always find a solution given enough time. However, this convergence time can be really slow if the number of data points are too many or the features are really high dimension. And when the K-means find the solution that's not global minimum, but it finds a local minimum, that means the results can be different depending on the initialization. You can find the K-means package from sklearn, a sklearn.cluster package and the K-means module offer initialization options. So you can do any equals k-means ++. Then you will have to converge faster, in case there are too many data points that makes components very slow. Or you can also use the mini batch K-means instead of just k-means, this algorithm used the mini batches. That means that we will randomly sample the data points instead of using all of the data to assign the clusters. So it can help in reducing the competition time. All right, so that's it for the K means clustering. And in the next video we're going to talk about another clustering called the hierarchical clustering. Which can address some of the drawbacks of the K means clustering.