Hello everyone. Welcome to the 2nd course in the data mining specialization. This course is titled data mining methods. Today we're going to start with an introduction of data mining methods. In this course, our goal is really to identify the core functionalities of data modeling in the data mining pipeline. If you remember, in Course 1, we covered data mining pipeline. We talked about the specific components in that pipeline. Then in this course, particularly, we'll be focused on the data modeling aspect and then look out at the various methods. Then also today we're going to start talking about frequent itemset mining, in particular we'll talking about the Apriori algorithm. If you recall, what we have learned in Course 1, we talk about generally what data mining is about. We typically use these four views to characterize a data mining project. Starting point is of course data. What kind of data do you have? What kind of attribute types are you talking about? What kind of characteristics you have? Then you look at application domain. Many times they're very important domain-specific information that would be concerned regarding the data and a kind of analysis you would like to do. Then we'll look at the specific type of knowledge you would like to learn, because you have the data, you know your application scenarios, and then say, "What kind of knowledge am I trying to discover for my data?" Of course, you need the 4th component, that is the technique, because you need to figure out how you do it. Then we also talked about the data mining pipeline in Course 1, starting with data. If you have your raw data, you want to spend some good time, good effort, trying to just understand your data. The next step is about preprocessing, is about preparing your data. The data warehousing part is focused on managing, and how you do this multi-dimensional data analysis using common data cube representation. In this course, we are going to focus on the methods. That particularly is relevant to the data modeling part and related to that is the evaluation aspect. From the technique view, there are actually quite a few things we can do from the data mining side. Specifically, we'll be talking about frequent pattern analysis, classification, and prediction, also clustering, anomaly detection, and trend and evolution analysis. Let's quickly look at what we mean by all those different techniques. Starting point, let's look at frequent pattern analysis. As the name suggests, we're looking for patterns that occur frequently in your dataset. Depending on the types of data you have, you may be looking at frequent itemset, frequent sequences, and frequent structures. Let's look at a particular example. Here I have the bird migration information. If you're looking at the itemset part. That's really basically about which items occur frequently. Then in your dataset, it'd be asking the question about, "Which bird species tend to core occur in your dataset?" This could be specified by a particular region or particular time duration, but the focus is about the itemset. The core occurrence of some bird species. Then if you look at the sequential information because your dataset naturally give you this trajectory about how the bird migrates from one location to another. Then you could have this kind of frequent sequences which then captures how bird goes from point A, point B, and point C in sequence. Also if you look at your map, it's not just that the sequence, you actually have this 2D structure. You actually could have a graph structure representing the physical space, and then you show that not only they are in that particular sequence, but also they may have a certain structure in terms of like they may have this nicely measured area here, but then there's usually just a single route going that way, but once you get to another site then there's a bit of more variability. That naturally gives you this structure view of the pattern. As you can see, it's really about what kind of data you have and then what kind of frequent patterns you're looking for. Once you have identified those frequent patterns. There are two more types of analysis you can do on top of it. One is about association rules, another one is about correlations. Generally, associations are just basically looking at, if A has happened, then what's the likelihood of a B occurring? That's actually very useful for you to make predictions or you've actually be able to design certain strategies to leverage the information that A is occurring. Think about the recommended systems. If you know your customer has purchased certain items and then you know that if those items are purchased, then there's a high chance of the other items being purchased. Then you can make a recommendation about the other items. Another one, the correlation goes one step further. It just says not only that A is occurring and that there is a certain probability of a B occurring. But what do we look at when we look at a correlation metric configures that it should be more likely to happen given that A is happening or B is less likely to happen, or they're independent. Basically, you just try and look for things that are maybe relevant. If you recall, what do we talk about in Course 1, when we talk about understanding data and the correlation analysis, it is important to keep in mind that correlation does not imply causality. These two are not the same. Just be very careful when you're doing this kind of analysis. Here, just example, many times if you have many dimensions of your data, then you can use this m by n grid to represent the pairwise correlations. Here I can use the color. The warmer colors indicate a more positive correlations and the bluish colors are negative colors. You can quickly see which attributes are correlated and to what extent. We will come back and of course talk about more details when we discuss frequent pattern analysis. Another one, used very broadly in the process of data mining is the classification. You may have heard of this in many other settings, but typically by classification, what we mean is that you have predefined classes. You already know which classes you care about, and our goal is to be able to look at the particular attributes of the object and be able to determine which class they belong to. This naturally I can call for this a training data because I need to know the object to [inaudible] as my [inaudible] choose. On top of that then I can build a model to learn those, relay the information between the object attributes and the class labels so that I can determine the class labels for new objects. Related to classification is this a prediction term. When we say prediction, this is typically referred to numerical data. With classification, you have a finite number of classes, while with prediction, we're talking about numerical values, usually just continuous value. Think about if you're predicting weather. That could be temperature, wind speed, precipitation, or you can look at the stock price, and/or traffic, this could be the volume of the traffic, commuter time. All those are numerical values that I know a goal here is to be able to predict those numerical values. The next one is clustering. Clustering is particularly different from classifications because you don't have any predefined classes. Here, you'll have data points. There's two-dimensional, high-dimensional. You have data points and then you're just trying to determine whether there are some natural grouping. Also, that's the cluster we're trying to discover. What do we mean by cluster? Typically, we're looking at is that within a cluster, those objects should be similar. That's about the intra-cluster similarity. It should be high. But then between clusters, they should be different. That basically means the inter-cluster similarity should be low. That's the general notion and of course we will discuss most specifically about how we actually identify those clusters. Another one that's actually very important is about anomaly detection. Because many times always the data mining analysis, you are looking for general patterns; a pattern that can apply to majority of your data or scenarios. But in many times, you actually look at the other side. Those are the rare cases. Things just don't look similar or look quite different from others. That's the general notion of anomaly or outlier. Depending on your scenario, you basically just looking at, "Okay, what is the expected scenario and then things that look different?" Or just say like, "This is so rare. This is unexpected." Here I have some example. I have a time series data. This actually remote sensing data. You can see that there's a general pattern, even a seasonal fluctuation, but then you see those spikes or this more like hump over here just looks different. Then also you have this pretty significant level shifting. All those are potential anomalies that should be looked at. The anomaly detection is very useful. This could be indicative of something significant, just like a rare event, but a significant event or this could be some fraud activities or errors or noise in your data. The other one is trend and evolution analysis. This particularly related to changes over time. You think about the previous candles, like types of modeling we're talking about, they can be just marked as static. But then also you can look at how those models change over time. Typically when you have temporal information then you're looking for, the longer-term trend. A general trend, but also a lot of our data actually have this season or periodical patterns. Then you look for anomalies. They can be also noises, so basically you just take it through time series and try to decompose it into the different types of sub-patterns. In this course, we are going to, really dive deep and learn about the specific measures that have been proposed for various kinds of modeling needs. We'll be covering frequent pattern analysis, classification, clustering, outlier analysis. These are core methods. You will see being widely used in many data mining applications scenarios. Let's start with frequent pattern analysis. The origin or the starting point with frequent pattern analysis was actually motivated by this market basket analysis. Think about retailers, they have a lot of items in their store. They have customers come into the store, they're purchasing items. Typically we use this as a transaction table. You can view that as just a list of transactions. They're captured by the transaction ID. Then for each transaction then you would have the information about which items have been purchased. When you have that, then one natural question we'll ask, is that, "Which items are being purchased frequently?" Or "Which item set?" That means that a set of items that are purchased frequently. This notion of itemset is just a set of items. But then when we say frequent, we need some measure. What do we mean by frequent? Typically we use this term, support. Support is really the probability of the itemset occurring in your data-set. Then for you to define the frequent one, then you should have a threshold, which is referred to as the minimal support. We start then you can say that, "Okay, in my data-set, if I look at this particular itemset, I can calculate the number of occurrences or frequency of that particular item set. If it's above my minimal support then that is considered a frequent itemset." That's just the general notion. This is particularly useful from rate or less point of view or just the actually generally any scenarios where you have subsets of objects or items occurring. Always you're looking for understanding what is frequent? What occurs regularly and frequently in your data-set? Now the question of course is that, how do we find frequent patterns? The straightforward approach or the Brute force approach is that, well, let's just try out all the possible combinations. Because say if you have certain number of items, then I just say, "I enumerated through all the possible combinations of those items occurring and then I just count the number of occurrences then I should be done. It's doable for smaller number of items. But that number actually grows very quickly. Think about like in my simple case right here, it just a 100 items. Note that in any regular real vorticity you're talking about a lot more than say like thousands, millions of items. But even with just 100 items, think about all the possible combinations. You can do this by 100 choosing one. That means these are just single item scenarios and then you can see the two-item scenarios, three-item scenarios, so you basically you sum up all the possible scenarios. Another way to look at this is each item can be chosen or not. You have two choices per item, so that's like two raised to the power of 100, if you have 100 items. The minus y is minus in the case that none of the items is selected. That actually of course gave us a very large number. Basically shows this kind of Brute force approach is not feasible. We need something much more efficient. Before we talk about the specific measures, there's two important notions that is very useful when you talk about frequent patterns. The first one is called closed pattern. The notion basically just says we consider different item sets. But if I know X is my current item-set, and then I want to see whether there's a super-set, all the X's. That means that set Y contains X plus something else. If Y is also frequent than apparently it's implied that X would be frequent. That's why in this case, I would just say I will keep growing or looking at the larger or the super patterns until I have reached a point where I have no super-pattern with the same support value. That is then referred to as a closed pattern. There is another related term which is called a max-pattern. This is based on the scenario where you say if you have a threshold, once you're above my threshold, you are considered frequent and I really don't care whether it's 0.7 as my support or 0.8 as my support, as long as you're above my threshold. As you can see, so the two patterns, these two notions, they really differ about this support part. With a closed pattern, we basically consider that you need to have exact the same support to become paired while with the max-pattern, as well as your super-pattern and you're frequent, you're good. You don't need to have the same support as your sub-pattern. Let's look at a concrete example about why this is useful. Here I have two item sets, basically, two transactions. The first transaction contains all 100 items, the second transaction only contains the first 50. With the two transactions and my minimum support is 0.5, so 0.5 out of two transactions, that basically means that you need to occur at least once. For that, now if you say okay, what patterns will be frequent? That's actually everybody, all the possible combinations because each of those items has occurred at least once, and each of those item combinations have occurred at least once. All the possible combinations of the 100 items are considered frequent, and that's what we actually did previously. We already showed that it's a huge number. Now let's look at the closed pattern. Remember, a closed pattern is that you basically keep looking at the super-pattern and see whether the super-pattern has the same support value as the sub-pattern. Then you keep looking at a bigger one until you stop at a point where it says there's no bigger, no super-pattern with the same support value. If you look at this one, we will see that actually we have two closed patterns because in this case, if you look at my dataset, the a_1 all the way to a_100. That means all 100 items in this item set, this one has occurred once because that's the first transaction. However, if you look at the first of 50 items, those actually have occurred twice because the first 50 items would have occurred in the first transaction as well as a second transaction, so its support or its frequency is actually two. Because this two have two different support values, that's why this a_1 all the way to a_50 is still a closed pattern because it has a different support value. But now if you look at the max-pattern, so as long as you are frequent, that means above my threshold, in this case, as long as you have occurred at least once, you're considered frequent. Because of that, if you look at our two closed patterns, in this case, they are both above one and they are both frequent so that if I ignore the support of the frequency, the difference in terms of one or two, then you can say that the a_1 to a_100 is a super-pattern of a_1 to a_50, and they're both frequent. In this case, the max pattern is only a_1 to a_100. With that, as you can see, it can significantly reduce the number of frequent patterns that we have to look at. We basically just go for the biggest pattern that is frequent. That of course makes the process a lot more efficient and I don't need to try all the possible combinations. This actually then leads to our Apriori algorithm. This algorithm has been widely used for frequent item-set mainly. The notion actually is very simple. It has this key idea about the Apriori pruning. What it means is just that if you're trying to grow your patterns, if you know that at some point X is not frequent, so that means it's actually below your threshold for being frequent, than anything containing X, so any of the X's super-set cannot be frequent. Why is that? If you want to contain X and the X is below your threshold, so X plus something cannot be above the threshold. Then you can safely remove the patterns if they are below your threshold. The process of the Apriori algorithm works like this. It starts with a single case, that's the easiest one. You look at the individual items, and then say, I go through my dataset, I count the number of occurrences of the individual items, and if they are above my threshold, apparently, they are frequent 1 itemset. The ones that are below my threshold, that mean they didn't occur right or they didn't occur enough for the threshold. Then they will be removed because they cannot be a frequent pattern anyway based on the Apriori pruning approach. Now I have my frequent 1 items. Those are just the line itemsets containing single items and they are frequent. From there, I can then generate my candidates for the K plus 1. In this case, if I start with single items, A is a frequent, B is a frequent, then A and B can be a candidate, that means A and B may be frequent. But if A is a frequent, C is infrequent, then AC cannot be frequent. I will not generate AC as a valid candidate. Once I do that, I take all the frequent k itemsets, I grow by one item, then I get to the candidate k plus 1 itemsets. But those are just candidates at this point because they may or may not be frequent. What I need to do is that, with this as k plus 1 itemset candidates, I go back to my database, the list of transactions, I do another random scan. Now, this time I'm counting the support for the candidates, and based on the support values, then I can decide whether they are frequent or not. If they are infrequent, remove them because they need to be pruned out. I'll then repeat this process until I have no more frequent itemsets or frequent candidates. I would stop at a particular k. K is the largest where I see I have frequent itemsets or candidates. Let's look at a particular example. Of course, in the real world, you will have many, many more transactions and more items. But for our small example, I have five transactions. Each of the transactions I have items, and you can see A, B, C, D, E are the possible items. My min support is 0.6. That means I need a 0.6 probability of seeing a particular itemset for it to be considered frequent. Because I have five transactions, 0.6 to translate to three. That means each of those items that needs to be occurring at least three times for it to be considered a frequent pattern. Let's start with the 1st round. The first round dataset, we go through the single items. I go through my dataset, count the support or the frequencies for the single items. You'd have A, B, C, D, E, and I'll give you a little time so you can calculate the number of occurrences for each of those items. Where's that. Now you have A, B, C, D, E, those are the one itemsets. Now, if you go to your transaction table, you would be able to account. The second column shows the number of occurrences of each of those itemset. I can count that 2, 4, 3, 3, 5. As I said, giving the minimal support of 0.6, I need at least three occurrences for it to be considered. In this case, A would be grayed out or pruned. This A cannot be a frequent pattern. Now, with that, I know B, C, D, E as one itemsets, they are frequent. Now, it's a process to generate the two itemsets that's a candidate, and, I mean, they can be frequent. Basically, what do you do? You do the combination. You have B and C, so B, C would be a candidate, B and D, then B, D would be a candidate. Also, know that because it's itemset, the order doesn't matter. In this case, BC, CB, they're the same. You just use some ordering to ensure that you don't have duplicates. Now, let's generate the two itemset candidate and then redo the databases scan so we can get to the number of occurrences for the candidate 2 itemsets. This is what you would get by generating the candidates. I'd have B, C, B, D, B, E, C, D, C, E, and a D, E. Those are the candidate 2 itemsets. Now, you go back to your transaction table, you count the occurrences and you'll have this as shown in this table. Again, the ones that are grayed out because they are below my threshold. B, D and C, D are no longer kept because they cannot be frequent, and anything containing B, D or C, D cannot be frequent. With that, now we're ready to generate my candidate 3 itemsets. I take the frequent two itemsets. Now I'm trying to combine them and say which candidates will be generated and then count their support again. With that, we are able to show that B, C, E is a valid three itemset candidate. When I go on to check the support right in my transaction table, it has three occurrences. B, C, E is indeed a valid three itemset, frequent three itemset. If you take all those together, we're talking about a frequent itemsets. You will combine the frequent of one item set, which is B, C, D, E. Then you have the frequent of two item sets, these are BC, BE, CE, DE. Also, the frequent of three itemset that's a B, C, E, so you take all those together, and that's the total number, and a list of frequent itemsets based on your particular data set under the minimum support threshold, okay? Well, if you have tried this, and you may have actually noticed something interesting. I'm not generating B, D, E, or C, D, E as a candidate for the three itemsets. You may wonder because apparently, if I take a B, E and a D, E, I could have B, D, E as a potential candidate. Then how come it's not included? Because in this case, B, D is below my threshold. If a B, D is below my threshold, then apparently B, D, E cannot be a valid candidate. Similarly, if you light you look at C, D, E, so I have a C, E and a D, E. I could have combined this two and get to C, D, E as a candidate, but in this case, C, D is removed. That means C, D is not frequent, so apparently C, D, E cannot be frequent. That's the liken this to this two important details on when you're using the Apriori algorithm, right? The first one is called a self-joining, so that means when you're growing right from your frequent k itemsets to get the candidate k plus 1 itemsets. Well, what we do is that we first check, we're only joining if the first k minus 1 items are the same. This actually related to the ordering part, as I said, it's a itemset. The actual order doesn't matter, so we just use one particular order, and we use that to ensure that we don't generate duplicated candidates. The other part is about pruning. That means you may have generated your kind of the candidate, but now you need to check whether the subsets are frequent or not. If any of the subset is not frequent then that should be removed. That means that it cannot be a valid candidate. Let's look at the concrete examples. This is a level 3, so that means that I have my frequent three itemsets, so abc, abd, acd, ace, bcd. These are like three itemsets that I have checked, they all are above my threshold. Now, I'm ready to take this, and I'm now, going to generate my candidate four itemsets. What do I do? I can take, apparently like abc, abd. In this case, it is satisfied this is self-joining because the first two, like k minus 1, so that first the two items are the same. abc and abd can be combined to generate abcd. But I cannot take abd and acd to generate using my self-joining. This again, this is to avoid duplicates. You would say that, if I take abd, acd I would also generate abcd as a potential candidate, so I don't do that. In this case would say I have abc, abd, I can generate abcd, but then the next one is about the pruning part. It's about, I now need to check whether any of the subset combinations are frequent, mean that for abcd to be considered a valid candidate, abc needs to be frequent, abd needs to be frequent, bcd needs to be frequent, and acd needs to be frequent. That is what we're looking at, I have abc, abd, I have generated that abcd, and also say that abcd is in our series, that means that it is a frequent. In this case, I'm using the self-joining rule, and I also checked that the other subsets are frequent, case solved. Now, abcd is added as a valid candidate for itemset. But for the other example, I have acd ace. Using the first rule, self-joining, that is okay, because first the two items are the same. ac, ac, that's the same, so I can combine that to get acde. That is a potential candidate, but when I check the subset, I know that acd is frequent, ace is frequent, but acde, is not frequent here, and also you can see that ade is not frequent. That basis says that, then acde cannot be a frequent itemset anyway because the subset is not frequent. With that, then you can prune out. You would say, acde could have been a candidate, but because its subset is not frequent, so it cannot be a valid candidate. You will have to prune out this candidate before even you go to the support or counting stage. These are just some important details, just make sure you understand, and use an example to figure out what would be included as a valid candidate, and what would be pruned as not a valid candidate.