[SOUND] How does Zillow estimate home prices? How does Netflix know what I would like to watch? How does a financial institution know if someone is more likely to default on a loan? Besides similar questions to this in the last module when you learned about simple linear regression. In this module, we will learn about multiple regression which will take us one step further in our analysis. The main difference in multiple regression as compared to simple linear regression is the ability to have more than one explanatory variable in our analysis. We still have one response variable y, but now we can have many explanatory variables which are our x's. There is no limit to the number of independent variables a model can use, which means multiple regression models can handle more complex situations than the simple linear regression models. A study done on newborns established a predictive equation for childhood and adolescent obesity. The goal of the study was to know which baby was the greatest risk so preventive measures can change this outcome for the newborn. The study is looking at the probability of a baby becoming obese based on several factors. Mother's BMI, father's BMI, number of people living in the household, mother's professional category, if the mother smoked during pregnancy, and finally, the baby's birth rate. So in total there are seven independent variables being used to predict the response variable, some of these variables are numerical values. The number of people in the household for example or birth weight and some are categorical like whether the mother smoked or not. In this module, we will learn how to find which variables are good to have in the prediction model as well as know how to develop the formula which will allow us how to do the prediction. By the way, this is a really fun calculator to use, the link is provided in the reference section. Go ahead and play with this, you may be surprised by the results. At least for me, some were counterintuitive, that's what made it fun. Multiple regression will also use the least square method to find the regression equation, but now it has a slope for each x variable. Y hat is the point estimate of the mean value of the dependent variable when the values of the independent variables are x1, x2, x3 and so on. Remember, y hat is an estimate, so you will have some errors just as we discussed for simple linear regression. The assumptions for single linear regression which we learned in the last module, must hold for multiple regression as well. The validity of the assumptions are checked by analyzing the residuals. Those are the errors in the prediction, just as we did for simple linear regression. There was one element of Excel output in the top table that I never reviewed when we were learning about simple linear regression and that is adjusted r square. We have learned that r square is the proportion of the total variations that is explained by the overall regression model. When we have multiple regression, the value of r squared will increase even if the new variables being added to the model has no relationship to y. Adjusted r square corrects this tendency in our square. As you can see here, adjusted square is smaller than r square and this will always be the case. While we used r square in simple linear regression to know what percent of variations are explained by the regression model. In multiple regression, we will use the adjusted r square to find a proportion. While adjusted r square value will tell us how good the regression model is in explaining the variations. The significance of each independent variable is assessed separately. In multiple regression, it is customary to test the significance of every independent variable used in the model. Again, the non hypothesis for each independent variable, states that no relationship exist between the independent variable and the response variable. Therefore, if p value is less than significance level then, they will reject the null hypothesis. Meaning, that there is a significant relationship between the regression variable and the response variable. Let's begin with an example. A regional manager of a large pharmaceutical company wants to evaluate the performance of the company's sales force. The general belief is that marketing and advertising is the most important factor for sales. Data on 50 sales agents has been collected. The study objective is to find impact of promotional budget on yearly sales. So we have one explanatory variable here and we will run a simple linear regression. P value for promotional budget is very small. Thus there is a significant relationship between promotional budget and sales. However ony 40% of the variation is explained by this variable. This is not satisfactory, the general manager has asked his analyst to develop a better model. Based on input from the sales representative, four other variables are added to the model. Sales agent time with the company, market potential in $, market potential change since last year in $, and average customer rating of the sales representative. So now y is assumed to be influenced by five different independent variables. We now have to collect data for all these variables, and then we can run the multiple regression model. Here's the output for the multiple regression. Here the adjusted r square is 0.70, which means the model now can explain 70% of the variation seen in sale as compared to the simple linear regression where we could only explain 40% of the variations. So the new model is definitely better than the simple linear regression model. Note that I'm using the adjusted r square, as I explained earlier, r square will increase even if the new variables being added to the model has no relationship to y. Adjusted r square corrects this tendency in r square. While the model is better, we still need to make sure that we only have variables in our model that have an effect on the response variable. This means that we need to evaluate each independent variable's p value to establish its significance. To do this we will focus on the third table, and on the p values of each variable. Any variable with a p value less than 0.05 will be considered significant. We will use this as the level of significance here. You could use 0.01 or anything else. If the p value is greater than 0.05 then the variable doesn't have a significant relationship with the response variable sales. In this case, the variable market share change has a p value that is greater that the value of 0.05, thus it's not significant. When you find a variable that is not significant, we will remove it from the model and run the analysis again. This is the results of the analysis without the market share change, the adjusted r square has not changed much. Actually has improved a bit compared to before, it's 0.702. So the regression model has not suffered from removing this variable because this variable was not useful in explaining the variations in sales. Now look at the p values for all the remaining variables. Everything now has a significant relationship with the response variable of sales. Now that we are satisfied with this model we can write the regression equation by looking at numbers to place in b0. And all the coefficients of the independent variables which we will find in the third table of Excel output. Thus this is our regression equation for sales and its relationship with length of sales agents tenure, the promotional budget, market potential and average customer ratings of sales agents. Now you can use this equation and make prediction. What is the point prediction value of sales for a salesperson with a 5 years of experience, promotional budget of $5000, and market production of 8 million and average rating of 4.3? X which is sales rep's promotional budget here is 5000, that's x1. X2 is the number of months the representative has been employed and it's 60 months because we had it recorded in months. Market potential for sales agent is 8 million and sales reps has a rating of 4.3. Now place these values in the regression equation. Given the regression equation and the values of the independent variables, the estimated value of y sales is 1,229.89583 just remember to express it in a correct way. We had recorded and analyzed the data by expressing the sales in thousands of dollars. So we need to multiple the prediction value by 1,000, which gives us $1,229,895.83, and this is the midpoint of the confidence interval. Now let's practice, given the regression equation we just developed, what would be the expected increase in sales if the promotional budget is increased by $10,000? X1 is the promotional budget and has a positive coefficient of 2.9. To answer the question asked, $10,000 increase will be entered as 10 and that would result in an additional value to the sales by 29,000. Now you may ask yourself, what factors have the greatest impact in the sales? We can find this out by looking at the coefficients. Focusing on the coefficients of independent variables, we see that the customer ratings has the largest coefficient, following by the time the agent has stayed with the company. If agent's rating improves by 1 point, then the average sales will be increased by 50.324 or $50,324. The analysis has made us aware that the better customer rating has the most positive impact on the performance of the agent more than the promotional budget, which originally was thought to be the most important factor. Furthermore, we see that as the agents length of stay increases, so does there ability to sell more. Looking at this data, it might be a good idea to provide a work environment where sales agents are happy employees who stay on and their professional satisfaction results in superior customer service. All of which will be very good for the company's bottom line. These are the type of revealing insights we get from multiple regression. Ability to identify what variables influence the response variable and to what degree. As always, please watch the external videos for more examples. [SOUND]