Hello everyone. Today we're going to continue with our discussion of data mining methods. So previously we have already talked about the frequent pattern analysis and also classification. So today we're going to continue with another core aspect of data mining methods which is Clustering. So in this module, what we're trying to learn is about clustering averages. How they work, how we apply the different types of masters for clustering problems. And of course we also talk about how you can evaluate and compare different methods. So what you mean by clustering? So generally with clustering, you have some objects. So those are usually described by multiple attributes. But the one key heuristic here is that we don't have any predefined classes so we don't know which classes they belong to or whether there is any nice separation. But typically what we're trying to look for in terms of clusters are those kind of like a better or closely connected points. So basically there are two key properties. So within the cluster, so intra-cluster similarity should be high. Because so you wanted the objects that are within the same cluster, they should be similar to each other. So that is an intra-cluster similarity and then if you're comparing across clusters, so the fact that those objects belong to different clusters is because they are different. So that's also the inter-cluster similarity and we want that to be low. So that's really the key notion of clustering. So we're trying to identify clusters. That has these two main properties. So then typically when we talk about a cluster analysis, so that is just that we're trying to identify those clusters. So compared to the like classification, like the supervisor learning approach we mentioned earlier, we can say that this is an unsupervised. Because we don't have any predefined classes. And our goal is really just to take the objects, and then identify whether they are clusters and divide them into the different clusters. So to do that, you need to look at really the similarity because you need to know how you define similarity. So this really catalyze very relevant in terms of what we covered in coarse one about data understanding because we have talked about how objects are represented by different attributes. The attributes have different types and then on top of that you will have a different way of defining similarity or distance for the different types of attributes. So once you have a reason to go to similarity measure, for your particular like data points and then the question is really about how you actually do the clustering. So this is the clustering method angle. So we'll be covering quite a few different methods for identifying clusters. But just always keep in mind when you're talking about the specific clustering methods. So we need to think about the quality because of course you want to find a good clusters. And then also efficiency because you want to be able to find it quickly. You don't want to be taken too long if you have a large dataset. And also this incremental clustering angle is important. So this is scenario when you have a new data coming. So where this message support incremental data that is like do you need to redo everything or can you just build upon the clusters you already have? All right. So specifically when you talk about clusters, you want to think about how you evaluate like the satellite. They are different masters and their different problems setting. So how do you know? Whether you're doing a good job? So there are a few different types of questions. So we may ask when it comes to class evaluation. The first one is about this classroom tendency. What this question is looking at is really that, do you even have clusters in your dataset? Because as we said, there's no predefined classes. We don't know how things look like. So you get the data, you want to do your initial and just say actually do the points tend to cluster or are we dealing with a roughly kind of uniform or kind of really random distribution. There's no good like classroom effect or tendency in my data set. So that is very important to understand in many real water data sets. We do see clusters or clustering tendency. But also there are other scenarios you just don't have like a reasonable clusters. So then that's actually the first question. The next one, so this gets to a more of the specifics of the clusters, the definition of clusters. So remember earlier we're talking about the intra-clusters similarity, intro-clusters similarity. So this really kind of gets to this notion of cluster cohesion and the separation. So cohesion refers to within cluster. Within the same cluster, the objects should be coherent. So that's about the high intro-cluster of similarity angle. While separation poses about the clusters, should be different. So if they are different clusters then apparently you will see them reasonably separated. So that is an inter-cluster similarity. There are some kind of other important ones which we'll cover later but one key aspect is not actually the number of clusters. So you will see later that again because we don't have predefined classes. So one question actually is that how many classes are we considering and how do we even determine that number? So there are various methods which we kind of evaluate how different clusters maybe better for your particular application scenario. So one good example is the silhouette coefficient. So what is basically doing is looking at each object. So you compare within the your same cluster. So if I know this object belongs to cluster I, so then you look at the distance from this object to its other objects in the same cluster. So that's the mean distances, that's one value. And then you will also compare across all the other clusters. So you say for this particular object, what is the mean distance to each of the other clusters? So you have another value? So that's B that's B by and A is within clusters. So you're basically trying to look at the difference. So the general idea of course that you want the difference to be big. So that means that I have a smaller distance for objects within my cluster and I have a larger distance to all the other clusters. There's another aspect that's a particular useful, is that many of the real world setting it's not like you're just doing your clustering kind of without any knowledge, actually there are pretty good kind of domain knowledge or external knowledge. You may be able to love it. So they may already give you some sense about, maybe how many clusters you should consider or even kind of roughly those would be some good cluster to start with. So those are actually very useful information. And then there's another way in terms of cluster evaluation that is that you may have different methods or different approaches. You have created two sets of different clusters for the same data set. And then the question of course that Is one better than the other? Because remember here again, we don't have our grand shoes, right? Predefined class or classes. So instead you're trying to say, yeah, there's one approach this. These are the cluster that have been identified that this approach and then this is a set of clusters. Do you see one being better than other or even some kind of combined version may also be useful. Right. But the benefit of kind of comparing right across the different sets of the classroom could be also very useful. All right. So just keep all this in mind as we go through the detailed discussion of different measures and also like the various choices. And we also come back later. We'll talk about how you may use different types of constraints to kind of further refine the kind of classes you're looking for, all right. So, there are many classroom masters. Okay, so we will try to cover quite a few of them. So they are roughly divided into the following categories. The first one is partitioning measures. Okay, so the name suggests basis that you're trying to divide up your data set. Right? So I may have any objects and just trying to assign those objects to different classes. Okay, so that's referred to as the partitioning approach. The next one is hierarchical measures. Okay, then, as the name suggests. Right, so hierarchical so you're basically you're constructing clusters at different levels. Okay. So you may be kind of dividing sense app or like merging things. Right? As you kind of move up and down in this hierarchical level. Okay, the next one is a grid based methods. Okay, so the general idea is that instead of classroom objects, you start by dividing up your space. Okay. So you take your object in space and then you say this is the different dimensions. And these are the this is the range for each dimension. So it just divided up into some kind of greater cells. And then you look at which objects fall within which greater cells. Okay, the next one is the density based. Okay so the notion of the density, of course they should talk about how packed or dense right objects are within a certain area. Okay, so there are different ways of defining density, but once you have a way of identifying the denser areas then you can actually don't identify those as clusters because clusters attend to have decided like this more like concentrated right areas of the objects. And and also we're going to talk about the probabilistic message. Okay, so in this kind of like approach, so instead of this kind of definite assignment of one object to a classer a particular classer they actually look for ways where you have a bit of kind of probabilistic membership. So the idea is that each object may belong to a particular cluster with a certain probability of course it can be one right or zero then that's the absolute case. But in but generally you want to have a bit of kind of distributed like a scenario because then that allows you to handle some of the fuzziness. Okay, when you're trying to determine classer membership, okay, so all those methods are or why did they use it? Of course they work very differently but they all generally tend to give you a good results. So as we go through those methods right? We want to understand how issued them work. Right? What is a high level idea and also how they differ? Okay. And then of course, is that given your particular problem? Right. Is one type of method better than the others. All right. Let's start with the first category. Okay, so these are partitioning methods. Okay, so the high level notion, right, petitioner is just that I have an object and already told that I need to divide them up into classers. Okay. So that's important assumptions. Right? You're being provided with this cake. All right. And now of course our job is to take those end objects and figure out a way to partition them. Okay. So of course you can try this kind of brute force approach because well you have objects you have K clusters. Right? So each object can go into one of the K clusters. So you can try all the possible combinations and then just using some classer evaluation like metrics that you can say which one is the best. Right? That works only if you have a smaller and smaller cake. But remember, this is a very large like number of combination. And usually you this doesn't scale well whenever you have even just like media, like medium size, the data set. Okay, so this method of course will give you the optimal solution because you are trying out all the possible combinations but usually is not feasible just because of the complexity. So they are then like other kind of more like a heuristic based methods. Right? So they are of course trying to kind of like find some kind of heuristic to determine like which assignments may be better without checking everything. Okay. So there are too many use the method once the K means classroom. Okay. Which is why did they use, you probably have heard of the name or use it in some settings. Right. So, so the K means generate is using this century notion. Okay, so it's basically the average or the mean value within of all the objects within that particular cluster. The related one is called k-medoids. Okay, so it's very similar to the k-means designs, their partition based. Right. But the key difference is that instead of using a central, remember, central may not be an actual object in the object, right in your cluster. Right. So instead with the k-medoids, is they they're using this middle notion, which is always an object. Okay. It's an actual object, but it's essential meaning that it's somehow it has the essential position compared to other objects in the same classroom. Okay, so let's look at the details. How would that work? So, first one, let's talk about the k-means cluster. Okay, so the method of k-means cluster is actually pretty simple. Okay, so it's a it's a alternative process. It's fairly easy to implement and use. Okay, so it has only a few steps, so start important you need to have your initial centuries. Right? So I already know I have any objects, I need to divide them up into K clusters. Okay, so let me start with K initial centroids. Okay, so of course if you don't have any other information, you can just pick randomly. Okay, so just say this is my object space, I just randomly pick like centroids. Okay. Or if you do have some kind of domain knowledge, right? You can actually utilize that. So you can already kind of roughly pick right? Like reasonably good, like set of centroids. But either way, right? You start with those K initial centroids. Now, once I have my initial centroids. So then the next step is that now I need to do the partitioning. Okay. So what I do is that I go through all my n object and for each object based, just compare with each of the k centroids. And since I have my distance, like function defined, right? So I can calculate right? Which is centroid is the nearest to this particular object I'm considering. So I need to do. Is that okay? That was closest. I'll just move it like assigned this object to the nearest centroid. Right? That's pretty simple. Okay. All right. So once I have finished my this random assignment, right? So I have now divided my n objects to the case interest. So each object goes to the nearest centroid right now, the for each century or each cluster, the membership may have changed. Right? So you may have new objects being assigned to you or Some previous objects are no longer in your cluster, right? So now we get to this step that we need to update our centroids. So what I do is just that I take the current like the updated object membership, right? So I know which objects are now assigned to my cluster. I just compute the new centroid by just taking the average, the mean value, right? Of all my objects in the same cluster, okay? So that is what you do using one iteration, so 2 and 3, right? Basically allows you to do the partitioning, update your centroid, right? And in this case your centroid may have changed, so you now need to go back, so go back to step 2, so you'll re-do the assignment. So the general notion is that as you iterate, right? Through this process, right? Then you are likely to get kind of like a better approximation, right? Of those centroids, okay? And then you just repeat this process until your centroids are stable, right? Stable means that they don't change or just the changes slightly and that's good enough and usually you can set like a number of iterations so they say I run up to that, right? And then that will usually give you a reasonable good, like [COUGH] Set of centroid and that of course defines your clusters. So just a little bit, just quickly in terms of the complexity, okay? So here, as we said, we have n object, we need to petition them into k clusters and you need to run this multiple iterations, so that's like the t is a number of iterations you have, okay? So basically we look at the complexity of k-means clustering, right? This is a big O(nkt) because, right? You have t iterations, you can use this as your outer loop, right? Within that, each iteration, you go through each of the n objects to partition it, right? But then to do that you go through each of the case centroids to determine which one to petition this to assign this object to. So that's the t times n times k, okay? [COUGH] And usually it's fairly efficient, right? As you said, the process is simple, it's easy to implement and it works well in many real world settings. So let's look at a concrete example, okay? So my objects are simple, I only have one dimensional values, okay? So here I have 10 objects, okay? So they have their specific values assigned here, all right? And also I'm provided with like two initial centroids, okay? Let's say 30 and 60, okay? [COUGH] Now of course like my first round right, what I need to do is I need to do the assignment. So I have two centroids. Okay now I just basically to figure out which object should go to which cluster of centroid, so basically you're looking at the distance, right? So let's consider just kind of like the L1 distance. That's just absolutely difference between the any two objects, okay? So to do that, usually since it's one dimensional, usually easier for you to sort it first, right? Then you actually can quickly determine the splitting point, right? Like anything below this should go to one centroid and anything above it should go to the other centroid. Okay, all right, so you can maybe work on this example and then figure out your first round of object assignment. So basically I have two centroids 30 and 60, which objects will be assigned to which centroids? All right, so this is my first round of assignment. Okay, so you basically see that like since it's assorted, right, you can see like the 35, 48 that's actually split this splitting point because for 35 is still closer to 30, so it goes to this side. Well, for 48 Is actually closer to 60, right? So that's why, like you see this kind of nice split, right? Of the two of your 10 objects into the two clusters, okay. So now I have my new cluster like assignment, right? Because the objects are now different, so I need to update my centroid, okay? So what I do is that within each cluster, I sum up all the values and calculate the average, okay. So that's what I do, so I have for class one corresponding to the initial centroid to 30, right? So I take all the five members, five objects divided by 5 and that's the 18.2, so that's the new centroid value, okay? So similarly, I can do that for the second cluster, okay? So I'm saying, you add them up, divide by 5 then you have the 66.6, okay? So that's my new centroid, right? And this is one iteration. Okay. Now, because my centroid has changed, right? So I need to kind of go back, right? Reiterate this process to see whether I may have a new membership, a new object assignment. So I can do this process again, okay? So this is a round of 2, right? I can use 18.2, 66.6, so these are the two updated centroids, okay? And you can again go through your objects and decide which object will go through which is the centroid, okay? So in this particular case it turns out that like my membership doesn't change, right, so it's the same set of objects belong to like the same kind of like cluster, okay? So because of that, right? This is when we say it's stable, like this is my centroid, so it do not change, okay. And this is where I can stop. All right?. Okay, so let's look at it like some of the kind of key features, right? We talk about how the k-means clustering method works, okay? [COUGH] So some of the key features when you think about like using potentially, like using k-means clustering for your problem. So first, k-means clustering, why did they use? Actually It's probably the kind of like go to method whenever people need to do some kind of clustering, okay? So it's a pretty efficient, right? And has good results in many real water citing. So definitely like consider using k-means clustering like if you're just kind of getting started and trying to figure out like some, the property is always in your dataset. Okay? But you do need to know that in this case k-means clustering requires you to specify k, so of course you need to figure out how many clusterings, okay? [COUGH] Of course it's like sometimes, you know already based on your domain knowledge, right? Sometimes you don't know, so of course, like what or people have done so that you can try different k values, okay? And then use like some measure to determine which you can actually give you the better clustering, right? So the centroid like coefficient [COUGH] We mentioned earlier is actually why did they use for kind of measuring which k, right? It's a good one, okay. But anyway, you need to pick a k before you can run the k-means clustering method. The other one is also very important, is about you need to be able to define centroids, right? Remember centroids is defined as the mean value, right, of other objects in that particular clustering. So depending on your kind of like attributes, right? The types of attributes, right? If it's a numerical value, usually you have a natural way of computing the average, right? But if it's a category for value, right? So if you have like, say different majors, okay? Or different types of products, how do you define centroids? Right? So you need either you just say, okay, I need to come up with a specific way of defining centroids in different cases or just well, this may not be the best solution for me because I really don't have a good way of defining centroid. Okay, so that's very important to consider when you want to use the k-means clustering. Alright. So also you need to figure out how you choose your initial century, right. As I said, like, you can try just a random centuries usually works right. Or you can leverage your domain marriage, okay? But also keep in mind, right. Because depending how you choose your initial centuries, right. They could really skew your results, right. So if your initial central is good, so they would actually converge quickly and they will give you like good results, right. But if somehow at the center, for example, have happened to pick two random central that are very close to each other, right. Then you actually don't have this kind of nice separation power. Actually, you may end up in a kind of local kind of like a valley setting. So you don't actually find a good results. Okay, so many times in the real water setting, right. Either you have good domain knowledge to pick the initial or you just try much more like a set of initial centuries and then then see whether they converge to the same kind of or similar classroom results. Okay, so this is very important, is about the impact of your initial centuries. Okay, the next one is also, is very important because this is about what types of class earners your method can find. Okay, so think about like the K means, right. Remember with K means if you have a century, the objects that are closer, right. Or you are always assigned an object to the closest century, okay? Because of that, right. You tend to have this kind of convex shaped clusters meaning that the objects always belong to the closest one. So if you have in your real world settings, you expect to some have some kind of non convex shape. Okay, so let's think about it. Like it's roughly or support like for your shape, for your cluster, but then you have a bit of kind of like concave like area. So that means the piece that is like let me think about you think about a doughnut shape, right. So don't mean like like you have an inner circle and you have an outer circle. Okay, so if those two are two different clusters, okay then K means clustering would not work because remember, right, you are always assigning object to your nearest centuries, okay? So if you have this concave shape, they were not working. Okay, so that's important to consider giving your particular types of applications and given your particular like types of clusters you expect to find. All right, another one is that K means is a generally it's a more sensitive to noise or alter lies in your dataset. Okay, why is that remember the way we're computing central is taking the average of the objects in a particular classes, right. Okay, so if you look at my kind of simple kind of example here, so you have this kind of like a nicely classroom objects here. But somehow you have like outline point that's further away from this one. But if this one is also assigned to the same cluster, okay? So think about when you take the average, right, of all the objects then your central, right. Okay, because the central is capitalized, the average of everybody could be actually pulled out and really kind of like skilled more towards that outline point. You have to the upper right. Okay, and that actually may not be good, right. Because basically this is one outlined note or noise note could have a much a pretty significant impact on the overall central or the classic design because that would also the impact, right. You later on your assignment because the objects are assigned based on the central distance. So to kind of like adjust that problem. Okay, so then there's like a wonder approach is about instead of using century it let's use actual objects. Okay, so in this case, right, If you say I don't computer central, but what I do is that among all the objects that belong to the same cluster, I want to just find one object that is kind of central. So that means it's closer or equally kind of like closer to all the other objects. Okay, so if you do that, right. So in my example here, you may actually get to this green point. Okay, so this object, as you can see it's an actual object, right. And it is still roughly kind of like a central, right. Within all the objects in this cluster, but it's not as a skilled compared to the red point. Okay, that said the red point is kind of pulled away by this outline point while the green one being an actual object, right, is actually less affected by the outline. Okay, so that's actually also a kind of key aspect of this K medals classroom. Okay, so if you look at the whole process is very similar to K means. Okay, so you're basically partitioning right objects two different classes and you're trying to re iterate through multiple like runs so that you can refine your petitioning results, right. But because I'm using this medoid notion, right. So medoid has to be an object and it is defined as more like the central object. Okay, so you're basically looking at the distance right from this medoid to all the other objects. And you want to pick one kind of minimize like the combined distance of the object to other objects. So that's the central object we're looking for. Okay, so because of that, the center is generally less sensitive to noise or out allies. Right, and so this is particularly useful if you're dealing with scenarios where you may have more out allies and you really will not kind of reduce, right. The impact of those kind of multipliers. But one kind of like limitation of the K-medoids is method, right. Is actually because I'm using objects, right. And I need to find this medoid object whenever I'm kind of readjusting my clusters, right. So it's actually much more computation expensive compared to the simple calculation of a central, right. Because you have central, all you need to do is kind of taking all the objects of doing a simple average right. While with the medoid that you have to iterate through all the potential objects in your cluster and then figure out which one. So it's actually adds up the complexity significantly. Okay, so that also kind of limit its scalability if you're dealing with kind of a large number of objects that this method doesn't like working well because too slow, right. There are some kind of approaches to kind of make it more efficient. So the general notion is around this kind of like samples or randomized samples, right. So the idea is that instead of trying out everybody, you want to kind of just kind of randomly, like you pick random sample. So you're dealing with a smaller set, but also you can kind of perturb in your neighborhood. So I mean that I have my if I need to update it right, it's unlikely that my central will change dramatically to the other end of the cluster. So you're basically checking in your neighborhood, okay? So that's also kind of against the heuristic, right. Doesn't guarantee global like optimal solutions. But typically using those kind of randomized samples, they can significant speed up the process and still give you good results, okay? All right, so we have talked about the partitioning based methods and don't talk about those like to representative measures once the K means the classroom, right based on essentially the design and also came needles right. Using this kind of central object notion. Okay, so the next category is a hierarchical classroom. Okay. So the name suggested like here we're trying to instead of having this partition has one level right? You partition objects and they're just kind of like separated right at the same level. Okay but with the hierarchy classroom we are trying to kind of go up and down a few times right at the different levels. So this is just a quick kind of illustration of like how hierarchical classroom work. Okay, so you have at the very bottom. Right this of course your individual objects. Okay and then we're using this danger Graham. Okay so that actually kind of shows how the objects are organized into clusters at different levels. Okay so typically what you have is that you have the objects that corresponded to the excellent dimension. And then your wife access usually refer some kind of similarity or distance measure. Right, so you then to see that as you go up of course you have a larger kind of like a distance. Right, so that means they are more different because now you have bigger clusters as you go down. Kind of closer, like out of the lower level than you to tend to have kind of smaller distance right there, closer to each other. Okay, so the pentagram actually gives you the very specific like all the detailed process of how objects get kind of merged. Or divide divided at the different levels. Right so depending on how you work with this kind of hierarchical classroom, right, You can either go top down or bottom up. Okay so if you're using like once this aggro narrative approach, so that's basically merging. Okay, so you repeat your emerging process. So you start with the individual objects and every time you find like as the classes. Right, so at the very beginning each class that contains a single object. Right, okay and then now you need to kind of at each round every level when you're trying to pick two objects to two clusters to merge. Okay, then you basically calculate the distance between all those different clusters. And you pick the one that that are closest right, emerged that right? So that is what we're showing with this kind of like kind of merging kind of process and you can repeat that, right? So at the very end, of course, you get the one class which contains everybody. Okay, so it is a bottom bottom up emergent process. Okay or you can do it the other way. Right, so basically go top that top nine that I started with one cluster, containing everybody. Right, and then every time I'm trying to just separate things. So I'm looking at like the, like within my cluster, right. Which objects that tend to be further away from each other and just try to divide them into two right clusters or sub classes. So that's the low level. So as you go from top level and going down, so you are reducing the distance right of the clusters. Okay, so the method is reasonably straightforward, right? And it has some kind of very nice features. So generally like this hierarchical classroom, a particularly useful in many real world applications. Because many times I said, like either, I don't have a good sense of how many classes are used. So this hierarchical classroom allows me to play with a little bit, right, so I can adjust or look at a different level. And many times you actually, this kind of multi level kind of like classroom is useful. So that means the patterns you're looking for, the classes you're looking for may occur at different levels. Right so think about your kind of like a customer population. Right, so you may be looking at a different granularity. Like you think everybody leaving your particular like a state or I can get to the city level, right? Or if you're looking at some kind of like a category, like a product categories, right? So you can also have different granularity depending on like the specific products or you want to go one level or two levels up. So you actually have different granularity when you're dealing with like your classes. Okay, so because of like I'm doing this out like a kind of more like basically go between the individual objects as a cluster to all objects. A single cluster. So you actually really allows you to play with the potential number of clusters and what would be a good level for you to separate. So you don't need to specify how many clusters you this method kind of naturally can build up the whole tree. To show how the classic can be merged at different levels. Okay, but there's one important thing you need to consider that is about the definition of a cluster distance. Remember this emerging or separation, right? It depends that depends on the actual calculation of the classic distance. So of course, depending on your objects and your attribute values, right? You probably have very different ways of computing your cluster distance, but there are a few kind of key sense to consider. So for example, you have one class to hear one cluster here. All of them have like each of them have some objects. Right, so we talk about classic distance. Right, of course, you can see if I have centrally defined, then I can just measure the central distance. That's easy, right? Or if I don't have that defined but I have the object two object distance defined. So I can pick like all the pair wise computation, right? This object to this object as long they are like in two different clusters. Then you look at like the minimal distance between the two clusters or the maximum distance between the two clusters or some kind of average right. As you can see right. There is already a few variations in terms of how you define your cluster distance. Okay and they of course have different impacts your actual clustering results. So you do want to think about what kind of class the distance would it be appropriate right. For your particular application scenario? Also as we have mentioned, right hierarchy classroom allows you to do this multi level classroom. So it's very useful, right. And it's particularly useful at the satellite if you are a problem scenario tends to have clusters at a different grand narratives. One other thing to keep in mind is about this because you're doing this kind of like hierarchical structure, right? Either top down or bottom up, but the once you like make one kind of like a decision at one level, like either it's a margin was split. Right, so, and then you go to the next iteration, then you basically have to work with what has already been decided in the previous level. Right, so you don't get to do this kind of global shuffling, like, like what petitioning master is doing right? Because they actually do like they can like for each iteration, they actually reshuffle potentially everything right. Well, with our hierarchical approach, you basically work with the stop branch. You're already because the decision has already been made in the previous round. Okay, all right. So we have talked about partitioning methods hierarchical methods. Okay, the next one is a great base. The classroom. Okay, as the name suggests, were using green or gray cells. Okay, so the key idea is this multi resolution grid structure. Okay, so the main thing is that I don't start by partitioning or like joining whatever my objects. Okay, instead I divide up my space. Okay, so think about like if you can have your original space, that's showing the top figure. Okay, so I can do a simple like two by two split. Okay, so now I get have like four grid cells, but I can do further like splitting. Right, so if I can do so if I take that and each one then recursive that I do the two by two, then I would end up with this four by four grid cells. Right, and you can do that further. So naturally you have this kind of marty level or multi resolution. Structure and at each level, right? You're basically looking at the grids, right? Those are grid cells, right. Okay, so once you have those grid cells, right then, all you need to do is actually just figuring out which objects belong to which grid cells, right? And you count how many fall within certain grid cells, okay. And once you have reasonable number of objects in a particular grid cell, then those can be considered as a cluster, okay. So your cluster may cover just a single grid cell or you can connect the neighboring grid cells because they are all kind of like similar to each other, right? So in my example here, I have this orange grid cell, like containing multiple ones that represents a classified itself. And then I have the the green ones. I have three grid cells, they are dense, they have a certain number of objects in them and also they're connected, so that gives you another cluster, okay. But one thing to keep in mind is about this cluster boundary, right? Because I'm using grid cells, so I'm dividing up my space, right? So I basically using this horizontal or vertical splitting like this that has to be a particular kind of value along that particular dimension. Okay, so because of that, right, so your cluster boundaries are naturally defined by this horizontal or vertical, like edges. Okay, so you don't get to this kind of flexible shape of your cluster boundaries, okay. We should of course that may still work in many settings, but you need to know that this is what you get, okay. And it may not work in certain application scenarios. All right, and also if you think about my main process, right? Is that designing my great structure and once I have it I just like assign the objects into those grid cells. So because of that, the complexity of this approach is actually dependent on the number of a greater cells rather than the number of objects, right? So before you can see any objects, if you know the object space already, you can just go ahead and set up your greater structure, okay? So when you're checking for classes, right, you're working on this grid cells. So it's based on the number of grid cells that would impact your complexity, okay. And because you have this kind of nice split space, right? It's actually very easy to paralyze your calculation because well, basically the great cells just handle their main cell and then you may do a bit of aggregation when you move from kind of lower level or final granularity, greater sales structure to the upper level. But they at the same level it's easily paralyzable and that can make the process a lot more efficient. And also they like one benefit of this kind of calculation is that you're basically using some kind of statistical information, right, within each grid cell, okay. And one benefited with that, is that if it's statistical calculation, for example, like how many do you have? What's maybe the average or the kind of the standard deviation? And now if you have new objects being assigned into that particular grid cell, right? You can actually easily just update rather than re computing everything, right? Because you already have the statistical information and that the special information can be just updated with the new information of the new objects, okay. So this is particularly useful for incremental processing, right? So [COUGH] just keep that in mind if your application scenario calls for highly efficient and incremental processing design, which can also work with this horizontal vertical cluster boundaries, okay. All right, so the next category this is referred to as a density basic classroom, okay. So as the name suggests, we're talking about a density, right? So what do you mean by density? So my simple example here, you probably tend to see like I have objects in my space and then generally you tend to see that like the objects, right? The clusters kind of corresponded to areas that are denser. So dense with more objects. Okay, so the question of course, is that how do you define density? You're looking at not your global structure, right, or not even in between distance of one class to another. Rather you just look at your local neighborhood and you say, okay, within this local area I'm a dense or not, or how dense am I to decide whether I'm in a cluster or not? Okay, so there are different approaches in terms of how you define density and how you use the density. Okay, so the DP scan method which is widely use, it's basically defining your neighborhood. So if they say I have a neighborhood and my density is based on like how many objects have within that neighborhood, okay. And as long as I'm above a certain density threshold, I can keep growing my neighborhood. For example, my neighborhood incidence, I keep going, I'm still dance until I reach out more like the boundary area where I say, yeah, I go beyond my density with a job below my social, right? So that's actually how you identify like your clusters. So by growing your local neighborhood as long as they're above your density especially. And then the other approach which is called Denclue. This one is also based on density, but it is used actually more kind of formal definition of a density function. Okay, so general idea is that you have your local influence function and when you aggregate across multiple objects of the influence of function, then you get your density attractors, which we'll talk about a little later. But as you can see, right, both methods are based on density but the notion slightly different, okay. But the high level goal is still the same. We're trying to find dense areas, okay. Now some of the key features. [COUGH] So because the density is defined just based on how dense it is, right? Doesn't really have any specific kind of control in terms of shape, right. As long as this area is a dense is considered part of the cluster. So because of that, right, it is actually very flexible in terms of identifying arbitrary shapes, okay. That is important because I think about it in even just a physical space, right? Urban planning scenarios, or like how like some like species of animals migrate, right? There may not be the nicest like a rigid structure of the class. Rather you're basically looking at the density and that actually allows you to capture potentially any shape, right? As long as there is enough this density, it's also it's a noise tolerant. Okay, that's important to think about in the previous calculations, right? The distance or the classes similarity are all based on the objects. So each individual objects, right, would be considered. So, if you have noisy data, right, so those kind of noise objects could really impact your decision calculation. Right, here if I'm talking about density, think about how things may fracture it a little bit. Right, so this individual objects that may be off a little bit, but that probably doesn't change my density calculation because as long as you there are a certain number of objects in my neighborhood and still considered them. So I'm still in a cluster, right? So that really allows to Last two can handle potentially noisy data, as long as there's a good kind of density information to be leveraged. All right, also, it's important so, the previous methods usually have like this kind of alternative process or hierarchical process, right? But with the density approach basically just grow, right? You look at your neighborhood and you like Just keep growing or expanding when you get to the boundaries. So it's really just a single scan right? You do one pass and you're done with your classes okay? You can also like use your density definition and the parameters to adjust like how dense your area needs to be. Okay and that also allows you some flexibility in terms of identifying classes with different density specials. So let's look at the specific methods. Okay so the first one DBscan right? [COUGH] so the general notion with the DBscan is this kind of neighborhood and the density definition in that neighborhood. Okay so we used two key parameters. Okay so the first one is epsilon. Okay so epsilon basically defines radius. Okay so if you're considering any object P. Okay you basically draw this radius of epsilon, so that's your absolute neighborhood. Okay once you have that neighborhood then you just check how many objects are in that neighborhood, remember you have the distance function defined anyway? Right, so now you just say, okay, within my absolute neighborhoods, how many objects are in that neighborhood? Okay and usually you use this mean points. So that's the minimal number of points that you should have in your epsilon neighborhood for you to be considered a core object called the min points and in the middle of a dense area. Okay, [COUGH] so naturally, right, that just allows you to kind of like move around in this object space, identified areas that are dense and you can connect it to like your neighboring areas if they're still dance and stop when you're at a boundary. Okay, so the whole kind of classroom process, right, is about kind of like they're growing right growing your neighborhood as long as you're kind of core objects on the set. Of course just are like if you're still in a dense area, this this border objects that like actually refers to point out I kind of still belonging to your cluster but they're already at the edge. So that means there reachable right from core objects, but they already kind of at the edge. So in my example here, right, I have this red object here as you can see within my neighborhood, right? There are quite a few points right? So this is actually considered core object. Okay, well if you look at this green point, right? So green point is connected to a dense area and it's a neighbors are likely to be all like core objects, but like if you move beyond the green one, you can see that. Now I don't have many rights, I'm reachable from the core objects, but I am self it's like it's not a project. So that's kind of consider like border objects that kind of like naturally allows you to define your cluster kind of like space. All right so this is the general notion is about this density-connected density-reachable right? So the idea is that whether you are still connected in the dense area. So if you like true core objects of possible densely connected, then the reachable, like I can have multiple objects, right? As long as I'm still kind of like densely-connected for each object right? This is a densely-connected okay. But it's only when you are out of the edge and you cannot go beyond right? So that is when you stop and say okay, now, no longer it's no longer density reachable beyond this point. All right, so the other density based method is called denclue. Okay, so then denclue as we said, it's actually uses like a formal definition of your density function. Okay, so it has this kind of key notions. The first one is influenced function. So think about in the particular object space, right? You basically have objects in this area into the field and then each object would have some kind of influence function and this influence function is usually dependent on the neighborhood. So that means usually you have like a stronger influence for like nearby objects and the evidence go further maybe not. And you can also kind of vary by direction. So if I go this direction, I tend to have a bit more influence in another direction maybe not as much. So you can have your customize influence function depending your problem. But this is a per objects in different function right? So just like consider in this kind of field, each object has some influence function right now, if you look at any point in this field. So it's really the aggregated influence right from multiple object. But the object, of course, are at a certain distance, at a different distance away from this particular location. So what we do is really like we just combine, right? We sum up like all the influence of functions of all the objects in this space. Okay and then once you have that right now, you look at the overall density. Okay and the overall density is what allows us to identify those density attractors, right? Because the idea is that if you look at the whole field, right? So some areas, so maybe kind of higher because of the combined like influence [COUGH] functions much higher right? Where some areas are like valley. So they are like they don't really have a lot of impact by the neighbors. So that kind of naturally gives you this concentrated areas right? Where there's like a more significant influence or density okay. All right so far, we have talked about four different categories of methods. We'll talk about partitioning methods. We'll talk about hierarchical methods, great based density base. Okay, so there's one more category which is a probabilistic methods, which will continue in the next lecture.