So in this lecture, I will introduce the concept of a Support Vector Machine which is an alternative scheme as opposed to this progression for doing binary classification of examples. And we'll try and look at some of the relative merits of different approaches and reasons why, for instance, you might choose Support Vector Machine or religious progress or vice versa. Again, this is still a fairly advanced lecture. Looking at the background behind the Support Vector Machine, really just to motivate why you might choose one type of model over another. Later on in next week's lectures, we'll see all the library functions that actually implement these types of algorithms in practice. So the previous lecture we looked at logistic regression which is our first attempt to solve the classification problem of this form or we say, we'd like the predictions to be liable to true if the output of our predictor is greater than 0. And false was zero, if the output of a predictor is less than or equal to 0. So that was logistic regression, but Support Vector Machines are another instance of exactly this same type of predictor who are making decisions according to this rule. So it may not have been clear or it may have been very subtle. But when we designed this progression, we were making certain modeling assumptions and there could be other ways to build a model based on this principle. So SVMs are just one such alternatives. So the goal to Support Vector Machine is to build a classifier that tries directly to minimize the number of classification errors. This is a bit different from what we were attempting when we introduced logistic regression where we tried to maximize some kind of probabilistic expression. It's probably not clear yet exactly what the difference is. When I described maximizing probabilistic expression, it all sounded reasonable. We want the probability to be large or positive things and small for negative things. So how is it actually that different from trying to minimize the number of errors the model makes? So to give an example, let's think about how this regressor might classify the points shown in this picture. Here, we have two-dimensional data where we just have two features. Some of them on the left-hand side are labeled as positive and then drawn using peppers. Others on the right-hand side labeled as negative and they're drawn using stars, and we'd like to come up with some boundary that says should the new point be labeled as positive or negative. So that's what this line b does. It says, anything to the right of that line should be given a negative label. Anything to the left of that line should be given a positive label. Okay, so we can see that b is not a perfect solution in this problem since it's somewhere in the middle, but it makes a bunch of errors. And somehow, this line a would be a better solution to this problem. So the question is which of these lines would an algorithm like just progression we more likely to choose? So I don't if the solution, but this is obvious. But in practice, you might get a solution that looks more like line b if you want to train this progression. So the way I described it was to say that we want all positive examples have a very high probability or very large prediction and all negative examples would have a low probability, or very large negative prediction. Another way of thinking about that is to say that every single data point is trying to push this line as far away from it as possible, because it wants its prediction to be as confidently positive or as confidently negative as possible. So if every single data point is pushing this line as far away from it as possible, you're going to end up with a solution. That is somewhere in the middle of these two clusters of data points. There's a very small number of data points that are really ambiguous cases in the middle of this cases and there's a large number of data points that are easy to classify positive or negative examples. And logistic regression solution may will let those easy to classify examples dominate the predictions that makes. Okay, so just to add a bit more detail. We have these easy to class of my points on the left and right and we have this hard to classify points in the middle. The point being that this progression is not paying special attention to these hard to classify points and it may let the easy to classify points dominate the solution. All right, so just to clarify that. What I'm really saying here is logistic regressions are not directly optimizing the number of mistakes made. There's no special attention paid to those difficult instances. In other words, every instance influences them all to some degree. So we saw an instance here where the easy instances potentially could affect the model in a bad way and give it a less than optimal solution just because we're letting the easy instances have power over the final predictions. So our goal here and our goal of SVMs is going to be to develop a classifier that directly tries to optimize the number of mislabeled examples. Okay, so the basic idea is going to be the following. Again, we're trying to come up with some line that separates our two sets of data points. Our positive versus negative instances. So you can define a line by these expressions is a line in many dimensions. I user would say a line is just equal to y equals mx plus b or something. High dimensional space, we can describe a line intend by dot product of theta and x minus alpha is equal to 0. Okay, so there could be many such lines which correctly classify our data points into positive and negative groups. So not all of these lines are going to be equally good. How do we choose one of these lines as being better than another? So one principle and the one that Support Vector Machines use is to say that the best classifier or the best line is going to be the one that's furthest from the nearest points. So in other words, we want that to be as few ambiguous cases as possible. If we had a line that was very close to one of our data points, it would not be very confident about giving it the correct label and maybe we wouldn't expect it to work so well on unseen data. But if you have a line that's very far from all of our data points, then we think probably if we see a new data point, you will live at the correct side of this line. Okay, so I won't go through all of the mathematical details. But essentially, you can use equations that describe the distance between a point and a line. And you can say that if your equation for line is the x minus alpha equals 0 and you make a few other assumptions, then the distance of the line to the nearest point is going to be given by 1 divided by the normal magnitude of your parameter's theta. So to maximize the difference, that's equivalent minimizing the theta such that every single point is correctly classified. So essentially, what's going on in this equation on the bottom right-hand side of the slide. Okay, so that's the pure version of the problem. In general, this is not always going to be possible. In case of this data, there are many points that are sort of ambiguous or in the difficult to label region. There will be no way of coming up with a line. It correctly classifies all the data points. Also maybe this idea is not perfect, okay? So in this data set, we can find a line which correctly classifies every point. Probably this line may not be the best one. We were thinking this data set actually maybe a solution where line is roughly and the middle would be a better solution wven though it made one mistake, because it would perhaps have fewer ambiguous cases on unseen data. Now this picture maybe looks a bit unrealistic, but this could happen in a real instance. For instance, if we had fairly clean, easy to separate data points that positive and negative labels have very different features. But just one of those data points was erroneously mislabeled, that could correspond to this picture. And in that case, it would be bad to ruin our classifier by letting that one erroneously labelled point determine where the classification boundary lies. So we'd still like to come up with a solution something like this. So one way that Support Vector Machines are usually adapted in practice is to say, we want that margin to be as wide as possible just like before, but we're allowed to make some errors and if we make an error where penalized according to how wrong a classifier was. In other words, how far is this point from the boundary? If it's very close to the boundary, okay, it was only little bit ambiguous. We made a small mistake if it's very far from the boundary and on the wrong side of it, and we were very confident about making a wrong prediction, and we should therefore be penalized more. Okay, so this is the way this is usually formulated. Again, the mathematical details are not so important as the actual concepts. But all I've done is modify the previous equation, introduce this Greek letter psi which basically says for those points that were incorrectly classified, how much were they incorrectly classified? Were they incorrectly classified by a small amount. In which case, I pay a small penalty or were they misclassified by a large amount? In which case, I pay a large penalty. Okay, so to summarize this idea, what we're trying to do is to find a hyperplane which in two dimensions is just a separating line. That ultimately separates two classes of points and the best classifiers is going to be the one that classifies all points correctly such that the nearest points are as far as possible from the boundary. And we can extend that idea to say, it's allowed to make some mistakes as long as those mistakes aren't too large or we get penalised according to how large the mistakes are. Okay, so to summarize. In this lecture, we introduced the basic concept behind Support Vector Machines and I kind of just sketched an outline of how this is actually implemented in practice. So I also demonstrated some of the advantages and the modelling assumptions behind different classifiers, why we might want to choose one over another.