Hello, everyone. Today we're going to start discussing outlier analysis. Our goal is to learn about outliers, how we can do outlier analysis using different type of techniques, and, of course, we want to explain how those methods work, and be able to evaluate and compare different methods. Let's start with the general notion of outlier or also referred to as anomaly. Typically, the high-level idea is that we are looking for patterns in data mining. Previous semesters, we talked about typically looking for general patterns; patterns that can apply to a majority of our datasets. Think about frequent patterns, classification, clustering, many times they can be used to find general patterns, or sometimes you can say the normal patterns, so what we expect to see with high probability. But then on the other side, we can look for things that are just different. That's actually a very interesting angle from the data mining, like I said, is that besides the pattern in general, but also you want to see, or I see, are there things that are just different right from the norm. Those are usually referred to as anomaly because they're just abnormal. Then the key purpose of outlier analysis, also referred to as anomaly detection, of course, just to find those outliers and be able to understand what they are. But there are some key notion to keep in mind as we talk about outliers or anomaly detection, is that if you think about outliers and say all outliers are the errors because they are caused by various issues in the data. I can just find them and remove them. But that is not exactly what we want. Because in many real-world settings, those outliers are actually significant events, or those actually, say, are suspicious or fraud activities. You want to detect and actually understand what they are and be able to act upon them. Really just keep in mind that when you see outliers, or the purpose of outliers is not just you find things and remove them so you can focus on the normal data, many times the outliers are the main focus of your data mining process because you want to find things that are potentially of high significance. There are many different types of outliers. Let's just see in my example. This is remote-sensing data. Basically, the satellites goes around the earth and measuring various types of properties. Here we're looking at surface temperature. The satellites are looking down at a particular region and then looking at how the surface temperature changes over time. You can see that the temporal dimension over a few years and you can probably see some seasonal pattern because every year it's cooler and colder, so the temperature would change. You can already see those spikes and a shifting. They are, of course, outliers in this particular example. What outliers do we see? Typically, you start with the simple case. This is usually referred to as global outliers. Sometimes those appear as just a point basis, meaning that you only need to look at a particular data point or look at data value and you can already decide whether this is a outlier or not. The general says anything above a certain value or below a certain value, well, they are really outliers. If you look at our example here, so I have this middle here. This is this very high spike rise way above the rest of the dataset. You can set that as more like some threshold and say anything above that is a outlier, and so you only need to look at the particular value. But many times, you then to say, well, if you are seeing some other values, you can say this value may still be within the general range, but given that particular scenario, this value seems high or different. This is what we're generally referring to as contextual outliers. You will see, in many world settings, most of the outliers need to be contextualized because you need to know what is this value? What is this attribute? What ranges do we tend to see in specific scenarios? That's really the context. A lot of times when you look at outlier, you say this is a outlier or abnormal giving the specific contexts. Then the question goes, what do you mean by context? That varies by your actual application scenario. But in our case, for example, we can just say, "Well, I expect there will be seasonal changes." If it's wintertime versus the summertime, then the general value may differ. If you're seeing a value, so that is to say higher for the wintertime or lower for the summertime or sunset, so that basically is just giving the context of that particular season; this value is considered abnormal. Then there are also other cases that even goes beyond a single value. You don't look at just a single value and say this is the outlier or not given that context. Instead, sometimes, the outliers reuse this collective effect. That means you are not looking at just a single object being different from others, you're actually looking at multiple objects being different. If you look [inaudible] example here, I have this small hump, you can see it's different from others. You would say each individual one, they're slight difference but maybe still within fluctuation for that context. But then seeing that quite few of them are different. That could be this collective outlier because all member of those objects or group of objects are different based on the previous or general pattern we're expecting. You can even see this larger one, this actually pretty significant level shift. You can see that compared to the first half and the second half. The second half actually looks quite different collectively compared to the first half. Again, that is not about a single value or about a particular contact, but there is a group of them changing in certain way, and that's the collective difference. We are talking about a real-world application scenarios and you're looking for outliers. Think about those different scenarios because they actually presented different things and depending how you answered the question or how you interpret it, you may find different types of outliers. Now, let's look at a few more examples. Start with one scenario environmental data, we're just talking about remote sensing data. But think about other scenarios. One, just a general temperature measurement, you can use my example before, but also you can just think about generate temperature in your neighborhood. You know how the temperature changes over the year, during the day, at different season. You have a general pattern. But then you can add in what would be considered normal or abnormal if the temperature is particularly high or low for that particular context. Also, think about a commute time. This is about the time that people take to go to work or come back from work. You may say depending on where you are and where you work, you'll say, there's some general pattern in terms of how long it would take for me to commute. You can add in the context, for example, if it's a rush hour or if it's a rainy day or there's construction going on. Those are different things you can add in. But then also think about how you may see global outlier, contextual outlier, and the collective outline in this study. Other example, this is a wind power generation. If you're looking at a wind farm, apparently that will probably depend on how much wind you are getting from which direction. But then you look at that, that's the context we are looking at. But then if you see abnormal wind power generation, this could be indicative of potential failure in one of your wind turbine or even multiple. This is actually also very useful scenario for you to quickly identify things that's not different. Let's consider another scenario. This follows social data. Think about online social networks, you have friends, you have people posting content, people like contents so they're interacting with various information and users. Then you if you look at like say, number of friends, number of posts, number of likes, you know that they may vary. There's definitely some people are just more popular or they're more active or some posts are receiving a lot more likes than the others. You can probably capture some general pattern. But then you can say how things are different at certain time periods or for certain users or on specific topics. All those are things you can look at when you're thinking about potential outliers. Also think about not just the number of likes at a particular time, you actually want to see how it tend to change. For example, most of the post on certain topic have this kind of pattern, but if you see ones that are particularly popular, then they may have actually much higher increase rate or they may be flattened out and have another jump because of certain instance. If you look at this temporal information, you can also actually identify patterns that may be abnormal compared to the general pattern. Besides that, you can also look at the network structure, you basically look at how things are connected. Social networks, they expect most people are connected like this, but then you may actually see a very different type of structure within subset of the users. This actually has been shown to be that sometimes, this very different network structure can be indicative of, say, fake accounts or some fraudulent activities. Let's look at one more example. This is relating in terms of the financial market. Think about stock price. Apparently, you can, of course, you know stock price fluctuates a lot. But the many times if you're looking at, I expect to have some fluctuation, but this is really extreme. This is very different compared to others. Also, with stock price, many times I can see these collective changes. Because if some things are happening you actually may see not just a single stock making certain sudden changes, but also maybe a subset of the change in some very significant ways. Then credit card transactions, so I'll be using that example as fraud detection. If you look at your customers' general transaction pattern, usually overall activity would represent something that's quite different from the norm. Also, you can add in this spatial, temporal angle as well because credit card transactions probably depends on what you're purchasing, but also depends on when your purchasing or even where you're making the purchase. All those information can be used as some context and also allows you to identify things that are different from others. These are just quick examples but generally in almost any field to when you [inaudible] a data. As you're talking about the general pattern, then always think about the other side where you may find outliers, and think about the different types of outliers. All right, so we have talked about how the different types of outliers or anomalies in various problem settings and say that it's really useful if you can see that anomalies. But this is not an easy problem, there are actually quite a few significant challenges. Starting point is really this very vague notion of being normal or abnormal. We say intuitively, yes, outliers or anomalies are things that just differ from the norm. But what does it mean exactly? Many times you find no clear definition. Because if I can give you a clear rule saying that okay, if temperature setting is above a certain size, certain value, then that's an outlier or not. That's clear, that's easy. But in many real world citing, so you just really need to see what is the more frequent or more general pattern you are seeing, and then see how things are considered different in certain ways. There are also many different application scenarios and they may have very specific definition. Or even for them, there's no clear definition, but the notion of normal and abnormal may differ across different application scenarios. To add on top of it, most real world datasets are not perfect so they are noisy in many ways. You think about credit card fraud. Because say all this could be considered fraud activity, but your users as a regular, like a normal user, may still make some purchase that's just quite different from their regular pattern. How will you define that? That basically blurs the boundaries between normal and abnormal cases. All those are actually really the challenges that make it really difficult for any real word problem setting. This is also why we always emphasize the data mining, you don't just walk with a data, you need to understand your problem set. See what applications you're in, what problems you are looking at, what data you're dealing with, and how to interpret. That would really help us with a reasonably good understanding of what outliers we're talking about. How do we then find them? That's really the starting point. The next part is about efficiency. You want to find anomalies but many times with a real world setting, you care about efficiency. Think about detecting credit card fraud. In that particular scenario, the latency is important. You want to find it or detected right away if you can or very quickly. That way then you can prevent or act upon that soon rather than waiting for too long before you realize, "Oh, something suspicious happened." Also scalability. Because these days you are dealing with a lot of data, so you're basically monitoring a lot of information and trying to find the things that may look suspicious or different. You are naturally [inaudible] a huge amounts of data and then you're trying to make a quick detection. The scalability is very important. Now a further point, this actually asked you the official languages that you need to be adaptive. Because we don't have a clear definition of a normal versus abnormal activities and attains a change. You want to actually be able to maybe have a general understanding of your general patterns, but also know that your underlying pattern maybe changing, and whether you can adapt to that quickly is important. This third one is increasingly important because not only we want to find outliers, but we want to be able to interpret or explain that finding why you think this one is an outlier or abnormal. You basically need to get to the point about if I'm flagging some credit card transaction or I'm flagging a particular wind power generation scenario as abnormal, what does it mean? Why do I come up with that decision? There needed to be a bit of an explainability or interpretability so that you're not just throw this data into a black box and the black box just tell you, I'll flag like this. The outlier does not but you have no idea why. All those are important challenges and they're still active research trying to adjust each of that. Just keep this in mind as we come and go through the different methods. When we're talking about a message or designing a technique to find anomalies, there are a few different scenarios. The first question is really about, do you have any groundtruth label? Are you being provided a way examples or cases that are already known to be normal or abnormal? Of course, if you have the label, great. You can do some supervised learning. Because you basically say, I know which cases are abnormal or normal, then, for example, I can build up a classifier. That's great. But they may rainwater settings, you just don't have any clearly defined the labels. I don't say, okay, well these are the data I'm seeing, but I really don't have any Ground-truth label saying that which one is normal, which one is abnormal. This then general cause for some unsupervised learning method where you don't have predefined labels about being normal or not. In the real-world settings, many times you may be using some semi-supervised learning method. That, in general, referred to the scenario is that you have some idea, there are some cases that are already can provide it, but it's not a lot so you have a few cases that may help you, so how you leverage that would be very useful. The other angle is more about what are you assuming in terms of the normal versus abnormal? Because that's really the key notion. If you say, okay, this is the catalyzed I'm trying to look for, but what does it mean? Depending on what assumptions you're making there may be different types of methods. The One, general method these are using some statistical models. What are we trying to do here is that I can somehow represent or capture the normal versus abnormal cases using different statistical models. For example, I use a Gaussian with a different mean and a standard deviation that value so they may be captured like different scenarios, normal versus abnormal. Or you can have some other types of distribution. But the general assumption here is that you can model your normal or abnormal cases using some statistical model. Then there are other assumption or notion of abnormal versus normal. It's just that there's a majority and minority. Meaning that if you're a normal object, you belong to molecule majority cases. I'm using any problems studying. Say I assume most of my cases are normal while the few or a small number of candlelight minority cases are usually the abnormal, and that uprising many settings. But for your particular problems adding, see whether that is true or is it reasonable. There's also a very important notion about proximity. Because if you assume that this abnormal, normal is really to proximity, so that means if I'm close to a lot of people or a lot of other cases, then apparently I'm considering I'm part of the normal group. But if I'm far away from most others then I'm just different. Then I could be considered as abnormal case. Proximity is also very useful notion when we're talking about normal versus abnormal scenarios. But then we will actually come back this later to talk about the difference between using distance versus density as your proximity measure. Let's look at each of those methods in more detail. The first one is the classification-based advisors. We have talked about how in this case you have some label. It is a supervisor learning approach. In this case, you are already been provided ways to the labeled cases. Either you ought to say, Okay, my labeled case contain both normal cases and abnormal cases. That's great. If not, then you may say I have examples of normal cases or I only have examples of abnormal cases. You have those different scenarios, but you have same labels. Using those labels, then you can use a general more classification method. You build up your classifier so that you can have a way to classify your objects into each of those classes. Keep in mind that you may have more than one normal classes or abnormal classes. Because, well if you think about your gender pattern or think about your customers based on their purchase behavior, you may be looking at several types of customers. They may purchase different things, but they're all considered normal or if you look like abnormal cases. Those are all suspicious activity, but they may be suspicious in different ways. Just keep in mind that you may have labels for the different scenarios but your labels, you may not be dealing with just up two cases, normal or abnormal, you may have multiple cases within each collide category. Supervised learning or select a parent can be very useful if you have branches or label. But does have a few challenges when being applied for anomaly detection scenarios. The first one is important this a particular relate to the class imbalance problem if you think about it since we're talking about a normal and abnormal, many times you'll start just to see that normal case it being the majority, that class or that few classes could be large. There are many cases that belong to the normal scenarios where you may have much fewer scenarios that are abnormal, think about faults in your environment. Let's say to talk about your wind turbines. Most of time, everything is running smoothly, correctly, where you may have a few cases where there is a failure. You nature your DNA with significantly skewed classes. A lot more cases are a for, they're normal cases, and much a fewer number of cases in the abnormal. You want to make sure that you are using a classification method that can handle such imbalanced classes if you want to use this for anomaly detection. The other one is about the fact above new patterns that's general limitation of a supervised learning because apparently, you come not to capture things that you haven't seen yet. You're building your model, based on what you have seen the training label, the data is what do you see, and you're basically trying to separate or classify. The different classes are based on what you have seen there. If you have new patterns, this actually particularly true with outliers because well, outlier just mean that different, but they can be different in many different ways. If you haven't seen that new pattern or new scenarios, your pre-trained classifier may not be as effective. This actually gets to this adaptability angle. Think about security attacks the attackers are always trying to come up with a new methods like arms race. You come up with some strategy to detect currently known attacks, but then a new attack may be designed or show up, and you haven't seen it yet. This is actually particularly challenging for this classification based on massive because they rely on branches or labels, and it may not be able to capture new patterns. The other category, those are classroom-based methods. As we have discussed previously, so clustering is unsupervised fluorine doesn't assume any implied labels of which class they belong to but rather just trying to find the things that are clustered that's good. That actually allows it to be more flexible in many real water settings. When we generate leveraging this assumption of majority versus minority classes to the idea is that if you belong to a large cluster or larger cluster, then the assumption that this is a normal class, and one of the normal cases. While if you belong to a smaller cluster or you don't belong to any of those clusters, then apparent, those are the cases where you'd be consider as abnormal case. Again, that also naturally give out this support of multiple clusters for the normal-abnormal cases because well, you can actually have some classes but they all represent normal cases. You might simple example here, you could say, those that are blue points are our class together, and it's written in large, those green prawns also pretty good their cluster so those could be considered as the normal clusters, where your redpoint is really far from the other clusters, so it's really different. So that can be considered as outliers scenario. One benefit, particularly in terms of the clustering-based approach that it is fairly generalizable. That means that it basically just using the notion of clusters. You don't need to have predefined labels. In general, is more flexible when dealing with different application domains. So it's fairly good. But still remember, if you are choosing your clustering method, think about specifically what clustering method you can use because we'll talk about different clustering angles, they may find different shapes of clusters, and they work about differently. Also, there's a key notion about, you need to define your similarity measure because clusters naturally finds things that are more similar within the cluster, and they're different across clusters. That actually leads to our discussion of this slide proximity-based methods. When we talk about a proximity, and if you were using that for outlier analysis, we have this general assumption that if you're far away from others, you're considered abnormal, but if you're close to many other points then you're considered normal, because you're jointly where you belong to a normal group. That's a general notion. But then you'll see that things can be different depending on whether you're talking about distance or density. Let's look at the general setting. Many times people use the distance. You define your distance function and then you basically look at the distance of a particular point to its neighbors. Typically you can look at your K nearest neighbor distance. You can say my 10s nearest neighbor or my top 10 nearest neighbor, how far away are they from me? This is this notion of absolute proximity because you're basically calculating measure and you're trying to look out at the distance, in a way threshold about whether they should be considered close or far away. But now let's look at this scenario. Instead of my initial example, now I have a case where the blue points have been more spreaded out, but it still forms a cluster. The green ones are similar. If you look at the absolute distance the red point is actually not that far away from the green points. Actually, the red one is actually probably closer to those green points compared to the distance among those blue points. If you're using this absolute proximity measure just based on distance, you will see that all well if you're using it, some kind of cutoff of distance threshold, then you may actually consider this red point as part of your green classic because they are within certain distance threshold. But if you visualize this quickly, probably you'll say, "Well, maybe not." You can see that at the red point, even though it's close in terms of distance, but it's in the sparse area by the South, where the green points are reading more cluster, so they are in a denser area. This is what we're trying to capture using this density-based method. The idea is that instead of using a distance threshold cutoff, we are actually trying to look at the density. If you belong to a dense area, then you're in that normal or neighborhood. Remember when we talk about by DBSCAN method for density-based clustering, we use this notion of Epsilon neighborhood. You can say if you look at the density within a particular neighborhood and then determine whether you belong to a dense neighborhood or not. This gives us this notion of a relative proximity. It's not just the absolute distance value, but rather that whether you're within a certain density bound and your neighbor has similar density as yours. There's of course, depending on application scenario, maybe both of them could be possible or one is more suitable than the other. What do we want to make sure that we know the difference between the notion of distance and the density. For your particular problem setting, just think about whether distance or density is more suitable for your case. We have talked about classification as a supervised learning method. We've talked about clustering as unsupervised method, and also talked about the different measures in terms of proximity. Let's talk a little bit more about semi-supervised method. As we have mentioned earlier, semi-supervised method is a scenario where you have some labels not a lot, maybe just a little bit. You can consider you may have some partial labels attending you. Well, here are some cases where I know they're normal or abnormal. Think about getting fraud detection. You may have some examples your fraud case even though that may not capture every fraud cases. But you have some centrist always, or you can say they have a few normal cases. Again, doesn't capture all the normal cases, but will give you maybe some information to your leverage, so that you don't do this purely unsupervised learning method. To do this, we can actually consider some combination for clustering, which is unsupervised and a classification which is supervised. Starting point is that I do clustering, because remember with clustering I don't assume anything. I basically just take all the objects, try to find similar objects and then try to cluster them. That's actually very helpful because you just take this huge number of dataset and then you just say, "Now I find a few clusters then this is how they look like."Now with that set of clusters, now you can try to maybe plug in some of the causal labels you have. Apparently, if you're in the cluster where several of you'll are coal cluster points are labeled as normal, then that cluster can generally be considered as a normal case, or we can also leverage it's a majority versus a minority scenario, so basically larger clusters. Also if you have normal cases, some of the labeled normal cases, then those are considered normal clusters. Well, on the other side, if you have smaller clusters or you don't belong to any clusters, then those are usually considered the abnormal cases, or if you are in a cluster but then several of them are being labeled as abnormal and you are similar to them because you're in the same cluster, then you may be considered as abnormal as well. That generally gives us this notion about normal versus abnormal, but that is trying to leverage both the clustering results and some of the partial labels we have. Once we have that, right now we have build up actually a much larger set, and I know in this set which cases are normal, which cases are abnormal because once you can label a whole cluster as normal or abnormal, then all the objects would have the corresponding labels. This actually then allows us to do classification because this is a case where you have a sufficient number of labels for the abnormal and/or the normal cases. I can now go ahead and build a class of L, which would then allow me to of course have a good performance when detecting anomalies. We have talked about so far some general methods. Those methods of course are typically used to find just global anomalies, because basically just saying that globally I have a way of separating the normal cases from the abnormal cases. But as we mentioned earlier, there are three different types of outliers, why they can be global, they can be contextual, and that can be collective. Now, let's look at how we can tackle contextual anomaly and also collective anomaly, giving what we have discussed as the general method already. As we have mentioned, contextual anomalies basically is about anomalies that are things that are considered abnormal within a particular context. This naturally gives us this notion of context and also behavior. Basically, you take your original set of attributes and then trying to determine which attributes are contextual features and which ones are behavioral attributes. This, again, is usually application-specific, but in your domain, in your particular application, you probably can get some reasonable understanding in terms of what would be considered contextual attributes. In using my solar farm, wind farm scenario, you will say maybe the behavioral feature of course will be the amount of power that's being generated during some time period, or daily power production. But apparently, that needs to be contextualized because if it's a windy day versus almost no wind or if it's a sunny day versus raining day, that would significantly impact your power generation. Then typically, you want to look for your context which defines your weather condition or whatever factors may impact your power generation. Once you can identify that as this is the context, as you say, all those are sunny days. I know I'm building some pattern for sunny days power generation or windy days power generation versus rainy cloudy days versus low wind or different directions. Once you have that as your context, then you can actually build up the patterns for that particular context, and then from there, then you can determine what is considered different. With contextual anomaly, then you usually do two main steps. Starting point, identify your context, because once you have your context, then it's easier for you to then focus on that subset and then find anomalies within that subset. There are various ways for you to identify a context. This is actually related to what we talked about earlier about this high dimensionality, because you have many attributes, but which ones are most relevant? You can use frequent pattern analysis. You can see these are actually some typical scenarios they tend to occur frequently together. That gives you this notion of context and some expected normal pattern, or you can use subspace approaches. Remember, when we talk about subspace clustering, the idea is that while you may not use all the attributes, all the dimensions to define your context, rather just say maybe a subset of those dimensions are more relevant and they actually work better for you to identify that cluster. That's actually a similar notion. You can use some subspace approach to identify your context that may be happening in a subset of a dimension rather than all the possible dimensions, and of course, domain-specific. You probably have good domain knowledge to leverage, to define, or identify contexts that are more relevant. You know which factors should be considered or doesn't need to be considered for your particular contextual anomaly detection setting. All right. Once we have the context and next step is actually easy. You basically pick that context and then you just look at the subset or cases that fit within that context. So I can ignore all other cloudy or rainy days, I'll only focus on the sunny days and say okay for all the sunny days, this is the general [inaudible] and then I can then identify abnormal cases. This Allows us to then transform our contextual anomaly, detection problem back to this global outlier detection, where we talk about the other different [inaudible] , that can be used. The key here is about identifying your contexts. Leverage the various kind of domain information if you can, but once you have it, then it can be easily transformed to the global outlier detection scenario. Next let's look at Collective Anomaly. This is a case where say, it's not a single object being abnormal, is actually a group of objects. They all debit from the norm, then collectively they would represent something more significant or different. Think about this kind of distributed denial of service attack. In those cases, if you just look at individual, if you look at your request or some particular machine, yeah, there's a bit of a fluctuation, but that could be still be within normal fluctuation where just the one node being particular active or busy. But then if all of a sudden you're seeing a large number of your machines are being busy or acting differently, that actually is a stronger signal or like this aggregate or collective signal that it's telling you something's not right. In this case, of course, you would say, okay, how do you identify that sub-group? You can say, okay, I can try all the permutations. Yes, but that is not going to be feasible for any reasonable large senior problem settings, because you'll [inaudible] talk about exponential growth. Instead, you need some reasonably efficient away to quickly identify some kind of structural relationship among your objects to mean that, while this group of objects is seemed to be different, or they are somehow more related. This could be actually by region or certain types of objects, or certain time duration. Because you can have certain way of grouping them because they are related in a certain way. Consider another example. Here this is about transactions; you have purchases. You know your customers are purchasing certain items and then, they may be doing that in certain location or during time periods. Once you have that linkage, you can see all [inaudible] purchase that's different, but also I'm seeing maybe a few purchases that are just different and that they may be happening at a particular location, during l a short-time period. That actually may be indicative of something suspicious. Again, the idea is this collective like pattern. Of course there are different ways for you to define your structural relationship. This again is a case where, you may have some domain knowledge, but also just need to take your different dimensions and then see whether there are some kind of structure or grouping of your objects. Then once you have that structural grouping, then the next step is easier. We're using this notion of a super object. What do we do is [inaudible] we take the original objects, those are like individual cases. But I just combine that into groups; this is my super object. Once I have those super objects, then I can then transform this back to my global outlier detection. Because remember with global outline detection, we're looking at individual object level differences. In this case, I'm now operating under the super object level and just say as a group of objects, this is what I expect. Those are the normal or general patterns when you're considering those super objects, but then those super objects just seem to be quite different. But you are now operating under the super object level, but you can use the general methods that are designed for global outline detection. All right. Next lesson, just look at a few concrete examples. I have used my remote sensing data example quite a bit, this actually one research project we have worked on. Particularly in this case at the center, we are trying to look for anomalies, but particularly this about spatial temporal, and it is purely unsupervised. Because with remote sensing data, the challenge is that you have almost the whole Labor Day, but they're just too much data and there's very limited information. Here [inaudible] just say, can we actually identify spatial temporal anatomies in the unsupervised fashion. We don't assume any prior knowledge about what is normal, what is abnormal. This is a diagram showing those steps that's involved. We're not again, going to talk about all the details, but at a high level, you started with the raw data, so that's a remote sensing data. First, you need to do some feature extraction because you have the individual pixels. That's your basic unit. So you need to get the features, but this is also where you need to do a bit of- With anomaly detection, actually it's very important. We talk about pre-processing as a key step in the data mining pipeline. You will see, if you don't do a good pre-processing step here, then you maybe ending up with a lot of noisy data and your anomaly detection approach will not be effective. Anyway, so you take the original data, get the pixel level information, and then you remove missing data or other noisy information. Now you get to the object level. The object basically is just like, if you look at the satellite image, you tend to see neighborhoods that are similar in terms of their characteristics. That, of course, depends on what kind of sensors you are using. This could be the surface temperature, this could be other sensors that you are leveraging or at a different frequency channels. But anyway, so with that, then you can say, okay, this neighborhood, they are similar. Now, you look at neighborhood, so you tend to build these initial clusters that they are similar in certain way. Based on what your features and your similarity measure, that is not just a time. You don't look at just one reason because remote sensing data, you have the temporal sense. You say this is a neighborhood that look similar in this spatial setting. But also you want to see whether the change is similarly across time so that it gets this notion about ST, spatial time outliers. You look at not only you're basically trying to find the things that are similar in space and a time, but if not, then you say okay, those are outline, though they look differently, either in the space you see crack or pretty different features or say, okay, this is actually similar objects but they behave differently in the time-space. Then that actually give us this notion about life spatial time outliers. There's actual further step is about this actually gets to this collective outlier angle. Not only you look at individual objects, but then you're trying to group them. You would say within certain time frame, I see actually a few anomalies. That actually then gets to these small [inaudible] anomalous events. If something happened, think about all this significant shifting, all of a sudden a lot of them can actually look differently. This general framework allows us to do a purely unsupervised anomaly detection approach. But also we actually can see where leveraging different dimensions that are different in granularities. I can group things to then identify not only the individual pixels that are different, or maybe within a certain context these things will look different but also could potentially you can have collective anomalies as well. Let's look at another example. This is wind farm. We have looked at the wind farm of data. They are here, the purpose of course is anomaly detection, but they're like our basic [inaudible] is about the fault. You want to see whether they're potentially wind turbines are becoming faulty. There's this fault prediction angle. You'll see things look different compared to normal operating scenario. Once you have identified those as a potential fault, we actually want to be able to diagnose this. That means not only I know this thing could fail, but also what kind of failure it may be, because that actually may be very helpful in terms of scheduling maintenance or what you can do in terms your control or your operational aspects. This actually naturally cause for this unsupervised approach and a supervised approach. In this kind of setting, with the most of the wind farm, like you have some scatter data. Those is more like control and also color monitoring data, so begin to telling you how the seasons are operating. Data that can be used to fit into your analysis tab. Again, keep in mind there is the preprocess step which is very important. You can sample data out of the right granularity and you want to remove any kind of noisiness or the incorrect data. Once you have that, this unsupervised branch just do this classroom-based approach. Basically, I don't know what is considered faulty or not faulty. I will see, thanks, because I expect to like a majority case have been normal. The normal operation should have some kind of contextual similarity because if under the same context, they should have similar performance. This is for the normal operating wind turbines. The ones that are likely to fail then I expected that to be different. Using clustering that I may be to identify it and just look a bit different. That's the unsupervised fault detection part. On the other side then you have, I know those are faults and then you may have some domain knowledge. Is about what types of faulty it maybe, it's a smaller data, usually much smaller, labeled set. But at least to give you some notion about how they may belong to different types of the faults. You can actually then use a supervised learning to then classify the ones that you have identified because you have to identify different clusters and they say those are the clusters, are unusual cases. Then you can see I have labels so for some of those unusual cases. I know what kind of faults they are, then that allows you to classify specifically what kind of faults it would be. One more example. This is a solar farm. It's similar but also come different in its own setting. One interesting part when we work on this problem is this first one that we actually leveraged this a hierarchical anomaly design. Why is that? If you look at any solar farm, you have this hierarchical structure. You start with individual PV panels, but the PV panels are usually connected via a PV string. This is more like your neighbors. If you look how those light the solar panels, they are connected next to each other. That's on the same PV string. Then they are connected to a combiner box and the inverter, and then connect to the utility grid. You see this natural hierarchical structure of the physical design of your individual components. Now if you're looking for anomaly, you can say, okay, what do I compare with? You can actually naturally compare to your neighbors. Because if you look at a particular PV panel, PV string, well, I want to look at the ones that are connected to me or close neighbors. Because for those cases, you are at a similar location. Because a similar location than the sun angle or the cloud coverage could be very similar for that particular local area. That actually gives you a context you can compare with. By using this hierarchical structure and then you're basically comparing anomaly or a normal versus the abnormal within your hierarchical units.That actually allows us to actually effectively identify potential anomalies. If you have a failing component in some of those substructure, then say, okay, well, everybody else looks similar, but you are different from others. This is the hierarchical structure. But we actually went further and actually found that you can leverage not only your local neighbors based on this hierarchical structure, but also across the all solar farm. Even though there may be panels that are on the other side of the solar farm, but during times, they may actually have very similar contexts as this neighborhood. This is this idea of this collaborative fault. We're looking for neighbors, not just physically like next to myself, also I'm looking for neighbors that have similar contextual property. That is the high level notion so that, we're not getting you the details, but at a high level, what we're looking at is I look across all the PV strings. I have their current information. Now with this, I do this similarity calculation. So that's basically I'm doing this more like global similarity calculation. This is actually a notion with a collaborative [inaudible] , which is a wider use for recommender systems, but the ideas are you want to find things that are similar. They may not be physically next to you, but they may have very similar patterns in similar contexts. Once you have identified those similar neighborhoods or neighbors that are even further away, now jointly you can then predict the expected pattern because I see those things are similar. If they're similar, then in this particular context, they should have this kind of behavior. That is the prediction. Now is the prediction you compare with your actual measured data and that is how you can compare. The residues capture the differences. If you are quite different from the predicted, expected current output for your neighbors, because these are sets that are similar to you, now you're performing quite differently. That's how we can actually then identify anomalies. Today, we have covered various aspects related to Outlier Analysis. To summarize, we started by talking about, well, what do we mean by outliers? What is anomaly? They are different types of anomalies. When I think about it, global anomalies just different from the rest of the data set. While contextual anomalies look about giving context and then that actually is different; collective refers to a group of objects. They deviate from the norm. We then look at the different methods and they are certainly depending on whether you have labels, what kind of labels you have. This can be supervised, this can be unsupervised, or semi-supervised. Also, once you have the general method, then you can think about how you can convert your contextual anomaly or collective anomaly problem to the global anomaly detection problem by identifying your contexts and also identify structure relationship, so you can group objects into the super objects. Finally, I will also want to emphasize this importance of interpretation. Outlier Analysis is very useful in many real-world settings, but be very careful about how you interpret the outliers. The key message is that while outliers are not always errors, the goal is not just to detect and analyze, it's just to simply ignore them. You really need to look hard at the outliers and determine whether those are actually errors or is actually some significant events that are, should it be examined further. All right, that's all for today. Thank you.