in this video will describe and visualize the least squares method for estimating the regression model parameters, and we'll also state the important assumptions that go into, the least squares method. So the goal of regression is to use the measured response and measured predictors to try to estimate the parameters of the model. And in linear regression, we assume that the form of the model is linear, and so we're estimating our intercept, parameter and slope, or set of slope parameters, depending on if we have one or more than one predictor. Now, once we estimate the model parameters, we would then use the regression model to do maybe one of three things. So the first one would be to further validate the model, and maybe test assumptions about the model to see if the model fits properly would work on future data sets, things like that. Once we're reasonably sure that the model fits well and does a good job of accounting for the trends in our data. Then we could use the model to explain the relationship so we could take explanations from the regression model, and basically infer conclusions about the broader population from which the data came. And the third thing that we could do is make predictions so we might predict a future value of the response based on a set of predictor values. So one thing that's really important to note about doing least squares is that at this point we're assuming that we've collected data. And so the response variable which at first we treat as a random variable. Once we collect the response data along with the predictor data, the response is then fixed. So, in some context and regression we treat the response is a random variable, which means that if we re sampled from the population many different times, we would get different values of the response for a fixed set of predictors. But once we actually have data in hand, we think about the response as one set of realizations from a set of random variables. And once we have those realizations, we have some fixed quantities, and we use those fixed quantities, namely the fixed response and the fixed set of predictors, to then estimate our model parameters. So really, the goal of this video is to conceptualize the estimation process, and we'll look at a couple of visualizations of the estimation process and then the next video, and subsequent videos in this module will be, sort of the mathematical underpinnings, basically, the formal, rigorous framework for least squares. But here we're just looking at, to sort of understand what's happening conceptually. So let's start with the simplest case, the simplest case of being what you could call simple linear regression, namely a regression with a response, and just one predictor variable. And in this case, the linear regression equation, for the AIF unit, in the sample would be y I is equal to beta not, plus beta one times x I plus that error term epsilon I. Now, this would be the equation for the population, and our goal would be to estimate those population level parameters beta not and beta one from sample data. So the way that we'll choose or estimate beta not, and beta one will be to find the line that minimizes the sum of the squared vertical distances between the data points, and the line, and this is what we call the line of best fit. So you can think about, many possible lines that we're trying to fit to our data. So in this slide, we have some data points that are, roughly linear, although, of course, they fall off of an exact linear trend, and due to the error term that we assume to be present in the model, and we can think about fitting, many possible lines through the data. So there are, several light gray lines which are possible fits, possible lines that we can use to explain the relationship between the response, and the predictor, and then there's the gold line, and the gold line represents the line of best fit. And by best fit, we mean thinking about taking each data point, and dropping a vertical line from the data point down to the gold line, and that gold line is the line that minimizes the sum of the squares of those vertical distances, so let's try to visualize the process. So imagine you have a set of, X I Y II pairs, so instead of data points, and they look roughly linear, and what we mean by that is We think that these data points were generated from a linear model, but there were some some error in the measurement process, and so they don't fall exactly on the line, but they fall off of the line based on the random variable epsilon I. Now there are many possible lines that we could draw through this data, and many of them are intuitively bad, so if you look at the data here on this slide, the vertical line, a roughly vertical line seems like a band fit to the data, right? It wouldn't do a good job of explaining the relationship or predicting y from x but some of these lines appear to be reasonable, so some of the gray lines may appear to be reasonable, and intuitively, it looks like the gold line is reasonable also. Now the gold line is, in fact, the least squares line, so it's the line of best fit through the data, and that line is found by thinking about taking each data point, and dropping a vertical line from the data point down to the gold line. Now that gold line is special in the sense that if you take all of those vertical distances, you square them and you sum them up. And you did that for the gold line, but also all of the gray lines the gold line would have the smallest sum of the squares of those deviance is, and so that's what we mean by the best fitting line to the data. The least squares line, the line that minimizes the sum of the squares of those vertical distances. So again, we can think about the least squares procedure as a procedure, having an infinite number of lines available to it. So an infinite number of slopes and intercepts but picking out the one that is best in the sum of the square deviation sense. Now for the multiple linear regression case, the case where we have more predictors than just a single predictor, the process is similar. But it's harder to visualize, and impossible to visualize in the strictest sense if you're in dimensions higher than two predictors. So let's try to go through, and try to visualize in some sense, the fit of a multiple linear regression model. Now recall that a multiple linear regression model can be written in Matrix vector form. So here on this slide, I have the response represented by y is equal to the design matrix. So that's the Matrix with a column of ones, and then each subsequent column being your predictors predictor measurements. And then that matrix multiplied by a vector beta containing the parameters that we want to estimate, and then we add on that random error term. So here again, the goal of the of the least squares procedure is to choose the beta vector. So beta, not beta one beta, two, etcetera all the way through beta P. So that the systematic part at the predictor piece, the X matrix explains as much of the response is possible without overfitting the data. Which means without trying to, model random error, which is caught up in the epsilon term. So for two dimensions to two predictors, we can still visualize this process. So here we would be selecting instead of a line, we would be selecting a plane. And the plane would be in three dimensional space, so two predictors and a response and we would be choosing that plane among infinitely many planes. It would be the plane that minimizes, the the sum of the squares of the deviations, but it would be in three dimensions instead of two. So in matrix vector form, the equation to be minimized is really the square of the two norm of the distance between the response and the systematic piece, namely the X matrix times beta, and you are minimizing that equation over beta. So beta is your variable that you're choosing to minimize the square of the two norm and the other pieces, namely the response y and the the predictors and the X matrix. Those are fixed, and what we do to find the least square solution is select the beta vector that minimizes this equation. So another visualization may help, so consider the design matrix X, an,d let's assume that X has more rows than it does columns. And statistically, what that means is that there are more units in the sample, so more individual things that you've measured the response and the predictors on. Then there are predictors themselves right predictors that you've actually measured. Another way to say that is that there are more sample points more units in your sample than there are parameters in the model, So you have a tall matrix X instead of a wide matrix. And that will mean that the system, the linear system Y, is equal to X times beta plus epsilon. That system is over determined, and an over determined linear system does not have an exact solution. So visually, what that means is that you have this plane. So this gray plane, and it represents the column space of X. So all of the possible linear combinations of the columns of X are represented by the plane, and y does not fall within the column space of X? So y is the vector sticking out of the plane, and not lying in the plane? And so that's the visual representation of there being no solution to the linear equation, so we can't find an exact solution. But what we can do is try to find an estimated solution or something like an approximate solution, and that approximate solution is represented in this figure by y hat. Now y hat does lie in the plane, so it's part of the column space of X, and it's not y, right? Y does not lie in the column space effects, but it's somehow close to y. So it's in some sense, like a best approximation to y, and the way that you get that approximation is by the least squares procedure. And visually, you can think about it as an orthogonal projection of y onto the plane, namely onto the column space of X. So hopefully this video has given you a sense of how to conceptualize, and visualize the least square solution, and in the next few videos, we will work on the mathematics of least squares.