Welcome back. Today, We're stepping into machine learning with a discussion about image classification. We'll talk about the data, the training data set and the testing data set will talk about lost functions, and also the danger of generalization in machine learning, enjoy. Object recognition and image classification work with neural networks. Last time, we explored a core computer vision task object recognition, we said that object recognition is a central task in computer vision research is the foundation on which other tasks and decisions are built upon. In areas such as robotics and self driving cars, autonomous vehicles, autonomous drones, autonomous other machinery ride, just to name a few. But in the same time we often hear about image classification, right? So, what is image classification? Sometimes the terms are used almost interchangeably object recognition, image classification. How are they different? So in image classification, what we're basically doing is applying a label. We're classifying the image by giving it a labor, we apply a label. Can we apply multiple labels to an image? Sure, of course we can. So it's a label or labels, right? We can say one image is classified as being an image that has in it a cat, and a doll, and a book, and a bed, right? These are all labels that we can give to this image. So if you think about it, our task is the task of building a classifier, right? What is a classified? What exactly are we trying to do? So classifier is basically a procedure, Here, it is. It's a procedure, right? And this procedure, the way it works is it has to have some input data. You can think about the data as a set of features, and maybe the set of features comes from an image. That would be kind of how our pipeline, how we think about our pipeline, right? But the image or the set of features, the pixels in the image. How we want to think about it, right? It's going to be the input and our procedure needs to come up with a class label, right? So we can have different procedures for different class labels. All right, so one procedure for one label. One procedure for a second label, right? So we can say, this is the label for cat. And it produced yes, there is a cat. This is the label for doll, let's say, right, doll label and it produced the label though. But maybe there's another procedure, let's say dog. And in that case it basically didn't produce a dog label, it produced a note if you want to think about it that way, right? So yes, dog, yes, cat, no dog, right? This is our idea of classification. What do we need in order to build a classifier. So, we have the following terminology from the front and four sides of the textbook. We think about the data. We say as a set of labeled examples. So we think about it as these pairs. Xi, Yi, right? Were basically Xi, is the feature vector from our image or maybe even the image itself. Can think about it as being the raw image with image intensities with pixel intensities values, right. And then the other part of the pair is the label. So I said, has any images and every one of them has basically despaired Xi, Yi, where X is whatever we extract from the image, whether it's the raw data or summer features or maybe some of responses to a certain filters. Let's call them filters. We'll see there will be something that in do we do in your own networks, right? We'll see that in a bit. But our goal is to come up with this rule or procedure, right? And we call this rule and procedure prediction function. Okay, we call this prediction function. And our prediction function, it's going to take the Xi, It's going to take the image, the features from the image to some of responses from the image. And it's going to generate the new label. So let's say I have an image. Okay, let's say it's a nice image here. Okay,of a cat, and ii I apply my prediction function to this image, it should come up with the label cat, right? F(X)= Y. X ii the data is the raw image or the feature vector or the sum of responses to filters we said. And why will be the label that's our goal prediction function. Now, let's look a little bit more in depth at the data. We have seen the slide before in one of the lectures in the first module, right? We said we have a data set that's all the data we have, and our goal is to train a model for a classification task. So we will divide the data into two parts the training data and the testing data, right? We're going to hold out a little bit of data so that when we finished training our model, we can run it on the testing data, right? But what we want to do is you want to keep iterating to find the best parameters, right? We want to find the best parameters for our model. And in order to be able to do this iteration, we're actually going to take the training set and hold out another piece that we call the validation set. Sometimes we call it the holdout set, the cross validation set, the development set. And what we're basically going to do, we're going to keep Iterating between those two, build a model on the training data Testing on the validation data, make modification, train it again, test it again on the validation data, right? And then we're finally done with it, we can test it on the testing data, right? Our goal is to come up we said with this prediction function, and what we want to do is we want to minimize the prediction error. And prediction error is basically when our model assigns, The wrong label, Right? So mislabelling the data is going to generate an error for example, if we have 1,000 images in our training, Data set, okay? And our model classifies 950 images with the label dog, right? 1,000 of them were labeled dog, and then our model only classifies 950 of them as dog, then we say the training error, Is how much, let's see 5%, right? Training data is 5%, and then we can do the same thing on the validation data, we can do the same thing on the testing data. We usually talk about training error and testing error. Our goal is to come up with a model that definitely has a small training error, but most importantly, it also has a small testing, right? Now let's look at a couple more things that are really important before we actually get to build our first neural network. We need recap a few elements that we need to be aware of when we set up an experiment, any experiment. And in our case when we train our model, what we're doing is truly just running an experiment, hypothesizing that the model we're building will be a good fit for our data. For the data in our set for a training data, but also for the real world data, right? So let's assume that we have a two class classifier, so you can think about it as dog versus cat, or you can think about it as dog versus not dog or no dog, right? So one image is labeled as having a dog In it and one image is labeled as not having a dog in it, all right? Our training data is labeled, so the terms here are for any experiment, but in our case our training data, it's basically we have the data labeled, so we know the actual condition of each images, right? The positive images are the dog images, they are labeled with the label dog, the negative examples samples in our data, are the one where there is not a dog in it, right? Those are negatively actual condition negative, there is no dog in that image. Now the predicted condition over here is basically what are model predicts? What are model labels images to, so now let's say we are at the testing data set, we have a model and we're going to look at the images in the testing data set, and see us what our model does. So our model is going to label some images with the label dog, those are these ones and then some no dog, right? Label no dog, so let's see how our model does, so from the images that have label that our model labeled as dog, some of them really are dogs in it, and the original label, right? The one that we received when we collected when we build our data set, our training data set, right? That's the original label, if the original label agrees with the predicted label, and they are both positive, right? So this is dog, Labeled as a dog, then we have the true positive, those are the true positive set. The images that were labeled originally as dog, and our model also labeled them as a dog. False positives are not dog, right from here not dog, that ended up being labeled as dog. So there was no dog in the image but our model said there is, those are false positive, there was no dog but it was labeled as dog, right? False negative, there was a dog but it was labeled as not dog, true negative, not dog labeled as not dog, okay? Many times when we run experiments people look at these things and talk about it, what was the true positive? How many true positive, what was the true positive rate you would say, right? But, it's not exactly the true positive rate the way we think about it is I'm going to introduce you two more terms that are really important. So we're going to say sensitivity, the sensitivity of our model, the sensitivity of our model says how many out of all the actual positive samples, got labeled as positive? So how many of the positives got labeled as positive? So that's going to be the true positive divided by positive right here, right? That is sensitivity, we want sensitivity to be high, right? Especially if you think about certain situations like for example medical images, right? When we're trying to detect cancer, we want all the images that have cancer to be labeled as cancer, we want sensitivity to be high. And then we have specificity, Specificity is basically the reverse if you think about it that way, how many of the negative samples, were labeled as negative? If you think about the same example with medical imaging, if you have an image that does not have cancer, how many of them were actually labeled as not having cancer. This is important, because you have a patient that doesn't have cancer, but your algorithm, your neural network, labeled a medical image of that patient as having cancer, It's really bad because it's going to generate a whole bunch of events down the road. Maybe the patient will be asked to do a lot more investigations, to pay a lot of money. You can't even start talking about like, my goodness, the mental health tall about thinking that they actually have the disease when they actually don't, right? So this is labeling the negatives as the true negatives, right? So, sensitivity and specificity. All important information and important terms to know when you're designing an experiment, when you're trying to build a neural network model. Let's see, now we talk about things being labeled a certain way, right? Think about and this whole labeling is basically like a cost. Okay, cost, so what we are going to establish is a lost function, what happened if we have object of type A, i, Classified as type j, right? So if we have objective type i classified as object of type j, we should establish a loss function, right? How much cost do we want to put on mass labeling? Now, if an image is classified correctly, right, then the lost function, the cost is zero, everything was done, right? There should be no cost, no penalty, right? What do we do if something is labeled incorrectly, right? This is kind of a question mark. Different models, different researchers have different strategies, right? Maybe we're not necessarily in a two class, classified where we can be in a multi class classifier? Where maybe if the dog is classified as a wolf, the penalty, the loss function is smaller. But if the dog is classified as a bicycle, then yes, of course the last function maybe should be higher, right? But regardless of all these possibilities, you basically what you want to calculate with your model is a risk function. And what is the total risk for a particular classification strategy, right? It's basically the expected loss when using this strategy and you have to think about all the possibilities, right? So in our case for a two class classified, right, the total risk has to take into account the probability that the dog will be labeled as a cat, using the strategy as times, right? Times the loss for having a dog classified as cat, but also we have to add the other one, what if a cat was classified as a dog, right? Take the probability of that, using the strategy as multiply that by the lost function. You might notice here that the only situations where losses incurred basically when there's a mis labeling, when we have a mis labeling, when we have a false positive, right? Someone was some image was classified as a dog even though it wasn't. So, these are the false positives and false negatives classified as non dog when actually was a dog, right? Now, coming up with a suitable model is definitely not easy, we want to have a model that has a very small training error. But then also we want to have a model that has a very small testing area, but sometimes we can run into trouble, we want to avoid having our model generalize too much. There are two types of generalization, we can run into overfitting and under fitting. Let's look at an example here, so let's say we have a function, okay, so a lot of function, the one that is depicted here is six minus x squared divided by eight. If you plot that function, you basically get the red line, right? The red is the perfect fit, but the thing that gives us struggle in real life is the data is not perfect, right? Images can be corrupted, sometimes by adding noise, so let's consider these blue points, right? The blue dots are basically data plus some noise, so now they don't plan exactly on the line, but our goal is need to find a function that fits the point right? Now, we can divide the data. Let's say we take an arbitrary line here and we say that over here on this side is the testing data. And based on this testing data, we need to find our function right now, if you look at this, we try to find a different color here. If we look at the data, just the one that is to the left of the green line and we want to find the fit, one of the things that we can find pretty easily is a linear function. Look at it, it's not too bad, you can see it fits the data, okay. Right, this data, the testing data, but what happens if we use this function? And then we look at the testing data, right? Sorry, I messed up, this one was the training data. Alright, so based on detaining data, we came up with the purple line, linear function, but then when we look at the testing data, we have a huge error. These points are so far from the line, right? It's not a good fit, we call this an example of under fitting. Our model is too simple, has too few parameters, doesn't have enough flexibility. We generalized, try to find too simple of a model and it wasn't a good fit. Let's look at another example, how would overfitting look? So let's split the data again, let's use the green line again, split the data again, this is training data, that's all we have to come up with the function. This is testing data, this remaining three points are testing data. So now with overfitting, what's going to happen is that we are going to try to find something that is maybe too complicated. That's basically what overfitting is going to be, some model that is too complicated. Some model is trying really hard to have the error be small or almost nothing. The lost function be zero or almost nothing for every training sample. I'm going to try to do something here, it looks like a function that is overfitting, right? And I don't know, maybe this model kind of continues in some sort of way, right. It is not a good fit for the testing data. And while we ended up happening, what we ended up doing was having a model with too many parameters, a model that basically it's trying way too hard. It's trying to fit the noise, right? So there is noise in the data but we are trying very, very much to have every point be perfectly on the line, that is over-fitting right? Is trying too much to be like the training data and that is not very good and we'll see that this will have maybe a very, very small training error but it will have a really big testing error. Now, now that we figured out how things work while we're going to have to do in order to figure out how we run model runs, let's look at the classification in the traditional approach, right? So in a traditional approach and actually this figure is kind of inspired by a figure from the look on paper, right, the first convolutional neural network paper and Lakhan was comparing the traditional approach with the neural network approach. So in traditional classification in computer vision, you need to have a module that performs feature extraction. Feature extraction is going to mean finding safety features or histogram avoiding the gradient features, corners, all sorts of things like that, right? Then based on these feature vectors, we're going to have to train our model, this is going to be where our model is built and then based on the model, every image will be assigned a class labor. Now feature extraction techniques are constantly improving but it's really hard to find a model that accounts for all the variation of an object class. That's really, really hard not to mention that if you have more data, the feature extraction and the training are going to be exceptionally slow. It's going to be really hard to find something that agrees really well with all the examples, right? We talked about all the examples of changes in viewpoint, in scale deformations, all sorts of things like that. By comparison, what happens in machine learning? Can we try to learn these features? But rather than trying to learn them all together and build the model together, can we have a layer and then basically have this layer hierarchy Where basically every layer learns from the output of the previous one, right? And then when we're training rather than doing the first module and learning basic features, then doing another module and learning more complex features, and doing another module and learning combinations of those features, in machine learning, we have these layers that all train jointly. All trained together. All trained at once, right? And at the end, the model comes up with the class label, right? Now, one question to ask is how many layers? So we talk about shallow versus deep architecture. Of course, shallow architecture would be a few layers, deep architecture, more and more and more layers. We're going to see in a second how many and how this works. So this was a good introduction to classification, which is the meat of a machine learning for computer vision and deep learning for computer vision. Join us next time as we take a more in depth look at neural networks for image classification. Thank you.