So, that was for simple linear regression. Now, that took a little bit more about something else. So for multiple linear regression, pretty much is the same. Okay, so what is this? Precisely speaking, we have now again several data points. We get our own data points, but for each data points, now we have x one x two blah, up to XP. Whenever I write Xi and i in the superscript. That means this is a factor. Okay, the first vectors. The second vector, the ice vector. Okay, so when you see Xi that's a factor. When you see X one, okay, or X i where i is at the subscript then this is a skater. This is a real number. So whenever you see for example, X i j where i is in the superscript then j is in the subscript. This means that j's element of the i factor okay, so I hope this is not too confusing for you. So now for each data point, we get p variables or p independent variables. Okay, we want to use that to get some kind of predictor explanatory models to explain the difference in y. So now our line, our straight line becomes some kind of hyper plans. And we get coefficient alpha as a constant, and the beta one beta two blah up to beta p. Each one, for an independent variable. Okay, so given all of this, we now want to find several variables, and this number now is greater than two. Is p plus one. We want to find p plus one variables to minimize the sum of squared errors. Okay, so that's pretty much all we want to do for multiple regression. If you want to derive a formula for that, you are still able to do that, for some other condition. And then you're done. So, when we have of summation like this. Okay, this is beta one times x I one plus beta, two times x I two and so on and so on. We may, make it singular. Look, so we may simply write it as beta transpose Xi. So, what's this? Beta is a vector, including beta one beta and exercise another vector. Right? So this is just an inner product. I hope this is, reasonable for you. Okay, so this is just a way to solve or to consider linear regression. So we consider linear regression as an optimization problem as a non linear program. Okay, And you may also take other courses like your statistics scores. Your, linear algebra course. There are several other ways to consider your linear regression problem or to derive the formula. Maybe you, may see in some text book that we are actually projecting a vector to a specific vector space and so on and so on and so on. But anyway, at least now you have one way to explain to yourself why the formula is looked in that way. Also, I want to mention to you one thing is that, maybe when you are learning statistics you have the following question is that, we define the errors as squared errors. Maybe your statistics Professors also mentioned this to you, saying that we choose to define the error terms as square terms because you cannot have positive terms in the negative terms cancelling each other. But if that's the case, actually, absolute errors should also make sense, right? We may simply define those errors as alpha plus beta, and the difference between that and why and then we take absolute values. That makes sense. Okay. If we formulate our problem like this, the problem makes sense. But the thing is that if you want to solve this problem, this problem is more difficult to solve from the perspective of optimization. And pretty much now you have a feeling about why? Because this function cannot be differentiated. Right? If you have a square function, you may differentiate. If you have a king function like your absolute value function, it's not so easy to deal with. You may still use your algorithms whatever to search for an optimization, but yeah, why do we have to do that? So pretty much? That's one reason why people accept this setting, why people accept using square terms to define fitting errors. Lastly, I want to spend a few minutes to talk about some other regression models. So one thing is that many some of you have learned this in other courses, saying that when we want to apply linear regression to do prediction and we want to avoid overfitting, then we may do regularization. Well, what's that? The idea is very simple, is that we are given several variables. Okay, x one up to XP. But in many practical applications, the amount of variables that you may obtain in a prediction probably is huge medians, for example. And you have no idea whether one of them is really useful in doing a predictor. So you let the model choose that for you among all the variables you get, you don't choose them by yourself. You let the model choose it for you. So, how to do that? Basically, you still formulate a square problem. You still want to choose alpha and the beta? Of course, here, beta is a factor. You still want to choose alpha and beta to minimize the sum of squared errors. But here you have a as a penalty of using variables and you plus a penalty term here. Also, ideal kind of is like Lagrange duality or Lagrange relaxation. But anyway, let's ignore that for a while. This term has some impact. What's that? Basically, it says, we want to minimize the sum of everything. So, we hope, ideally, all this beta j is zero. Ideally, if you really do beta j saying getting all of them to be zero then, of course, your sum of square error would be large. So, you need to carefully choose subset of data and make them positive or negative. You need to choose the most effective beta to make them positive or negative so that you may minimize the sum of squared error, while do not get a too large penalty. Okay, that's indeed an optimization problem and non-linear optimization problem, right? You try to, get some balance among all possibilities. So, when you do this, we call it Ridge regression or similarly. Some people use LASSO regression, and the only difference is that here the square term becomes the absolute value function. So, I'm not going to go into details about, when should you use Ridge? When should use LASSO? What's the different properties of the, formula you obtain through Ridge or LASSO? Please, refer to other courses. I just want to show you that, for these very popular regression models again, they are unconstrained convex programs, right? You have variables. You don't have constraints. You have a convex term here. Another convex term here. That's how you know, these problems are solvable. That's how you know, formulating a problem like this makes sense, because this model is really useful for you to do something. If you formulate a model that cannot be solved, that is not really useful. That's how we use the, optimization perspective to look at Ridge and the LASSO.