For this lecture, we'll introduce a model called logistic regression which is a modification of the regression algorithms we've already seen, in order to handle classification problems. And we'll show how this regression can be solved using gradient descent approaches as introduced in the previous lecture. Again, this is a fairly advanced lecture. Logistic regression is a technique we use all the time, but mostly we'll use it via library functions. The purpose of this lecture is just to give you the background of how that actually works and how we would actually try and implement that kind of algorithm from scratch. So previously, we saw regression approaches and linear regression in particular, where we were trying to make predictions according to this linear equation. Which says the label y should be predicted according to the inner product between some feature vector X and that parameter vector theta. So we want to try and extend those kinds of algorithms to help us handle classification problems. The output is not a real value, but the binary variable, true or false, yes or no, one or zero. And we'd like to make predictions, according to this expression here, which says that if the response about predictor Xi dot theta is greater than 0, which predicts positive. And if the response of Xi dot theta is less than or equal to 0, we should predict negative. So somehow, we want to train a predictor, in other words, comp with a value of theta. So that the predictions will be large positive instances, and small negative instances, and logistic regression is one such technique for achieving this. Okay, so from the previous slide, we saw the difference between trying to predict a real value quantity and a binary quantity. We want to use the same kind of predictor to do that, one that estimates our value according to Xi dot theta. So how can we convert between one and the other? How can we make a model which predicts a real value quantity instead predict a binary quantity? So one solution to this, which logistic regression uses, is what's called the sigmoid function. This function is given by the following formula. It says, sigmoid of an argument t is given by 1 divided by 1 + e to the -t. If you can't follow exactly what that means, this plot makes it a little bit clearer. It's a function that has this following shape. So on the x axis here is our argument t, and on the y axis is the response of our function. So the essential shape of it should be clear. Really, it's a function that takes an argument between negative infinity and infinity, or is that real value quantity? And it converts that to a prediction, which is a number between 0 and 1. Which is almost what we would like, if we want to train a classifier, rather than a regressor. So just to break down those parts, on the x axis here, we're going to have our prediction Xi dot theta, just like we had in our regression model. And on the y axis, we have the sigmoid of that prediction, which takes it from being a real value quantity to a number between 0 and 1. And in the middle, we have what's called our classification boundary. Basically, what that says is if Xi dot theta is greater than 0, we will predict positive, if it's less than 0, we will predict negative. Really what our predictable operative is a number greater than 0.5, if Xi dot theta is positive and less than 0.5, if Xi dot theta is negative. So 0.5 is kind of the threshold or cut off that we use to determine if prediction is true or false. So what we would like is to train a model that looks like the following. Xi dot theta should be very large when we have positive instances, and it should be very small or very negative when we have negative instances. That would mean after we pass it through the sigmoid function, it'll be a value very close to 1 for positive instances and very close to 0 for negative instances. So there's this fairly complex equation writing down that objective. But all this equation really says is that for any observation whose argument is true, yi = 1, we want the sigmoid to have a large value. If it's false, we want that sigmoid to have a small value. So I've written this down using probabilities. So basically, we can think of the sigmoid as representing a probability if the label is true. If Xi dot theta is very large, that probability would be close to 1. If Xi dot theta is very small, that probability to be close to 0 or it's very negative that probability to close to 0. And the Xi dot theta is kind of in the middle if the prediction is close to 0, we would have a probability which is close to a half. We're not really sure if this particular data point should be considered a positive or negative incidents. Okay, just to summarize, all we're trying to do is have large values of Xi dot theta or high probabilities for positive instances. Small values of xXi dot theta or low probabilities have negative instances. So how do we optimize this? To do so, we'll build on the gradient descent ideas or gradient ascent ideas we saw in previous lectures. Here, we take the logarithm of this equation, Just to convert these products into sums. We compute the gradient of the equation and we solve using gradient ascent. So the difference between gradient ascent and descent is very small. It's just a question of whether you're trying to make an expression large or make an expression small. In the previous lecture on gradient descent, we were talking about minimizing a sum of squared errors. We want that expression to be as small as possible. In this case, we're talking about maximizing probability. So on this expression to be as large as possible, that's the only difference. Okay, so the first thing we wanted to do is complete the logarithm of this expression. Okay, so we have a log of some products, Is going to be equal to a cell of algorithms. So we have some of the depositing instances, yi = 1 of the log of our sigmoid function. And that function look like the following, 1 on 1 + e to the -Xi dot theta. And we have another sum over the negative instances. That's going to be 1-1 on 1 + e to the -Xi dot theta. Okay, we can simplify this a little bit. This expression inside the second argument can rewritten like the following. Okay, Now, we can simplify this further. Here, we have a log of a ratio, so that's going to be equal to the log of the numerator minus the log of the denominator. And in both these expressions of the left-hand and the right-hand side, we have the same denominator, okay? So no matter what the label, sum of all i, we have negative log of the denominator. And next, we have the log of the numerators. From the left-hand side, we have just the log of 1, which is equal to 0, and only the right-hand side remains. Okay, and the log of the numerator there. Well, that's actually E to some power, so the log in E cancel out. We're just left with -Xi dot theta. Okay, so that's the logarithm of the expression we're trying to optimize. Here's the same thing is written out a bit more neatly. The next one we have to do is to compute the derivative of this expression in order to perform gradient descent or ascent. Okay, so how do we take the derivative? On the left hand side, The sum of your expressions, the derivative of a logarithm is equal to the derivative of its argument divided by the argument. So the derivative of the argument is equal to negative, the two negatives cancel out. So Xi k times e to the negative Xi dot theta divided by the argument. Okay, and the second time here, much easier to differentiate. It's just going to be equal to -, Xi k. All right, so we can simplify that a little bit further. The left-hand side actually looks very similar to the sigmoid function that maybe looks a bit familiar from the previous slide. It can be rewritten as a sum over all data points, xi k times, actually, it's 1- sigmoid of Xi dot theta, Plus the sum over only the negative instances of -Xi k. Okay, so just to summarize and write things out a bit more neatly, we have our log likelihood and our derivative, which should match the expression on the previous slides. All right, so to summarize. Here, we introduced the concept of logistic regression. We showed how we can solve it using gradient ascent and I went through all of the first principles to say, how do we actually do that? How do we take the logarithm of the objective we're trying to optimize, compute its gradient? Of course, later on, we'll mostly use liability functions to do this. But it could be useful to know how to do this from first principles.