Up to this point, we've talked about multiple regression analysis with binary explanatory variables, which are categorical variables with two categories, and quantitative explanatory variables. But we haven't yet discussed what to do when we have a categorical explanatory variable that has more than two categories. It is not uncommon to have categorical explanatory variables with three or more categories. Fortunately, it was relatively simple to incorporate these types of explanatory variables into a multiple regression analysis. There are a lot of different methods for examining explanatory variable group differences on a response variable. The type of comparison depends on how we choose to code our explanatory variable. The process of coding categorical explanatory variables is called dummy coding or parameterization. And these dummy coding or parameterization methods can produce explanatory group comparisons ranging from very simple to very complex. For example, if our response variable is number of nicotine dependent symptoms, we might want to compare the number of symptoms for one group to the average number of symptoms for the other groups combined. This type of comparison is called effect coding, or effect parameterization. In this course, we're going to use one of the most basic parameterizations, which is called reference group coding or reference group parameterization. This method is very similar to the post hoc pairwise comparison that you may have conducted as a follow up to running an analysis of variance in the second course of this Specialization Data Analysis Tools. That is, reference group coding allows us to compare each group of our explanatory variable, referred to as the comparison groups, to another group, which is referred to as the reference group. For example, if our response variable is number of nicotine dependent symptoms, reference coding allows us to compare number of nicotine dependent symptoms for each group of our categorical variable to a designated reference group. However, unlike an analysis of variance post hoc test for which we conduct the comparisons after testing the ANOVA, the comparisons are part of the estimation of the multiple regression model. This allows us to examine explanatory variable group differences on the response variable after adjusting for the other explanatory variables in the model. To demonstrate how to analyze a categorical explanatory variable with three or more categories, we will return to our Nisarg data multiple regression analysis predicting number of nicotine dependent symptoms from multiple explanatory variables. We could also add a race ethnicity explanatory variable. Our ethnicity race variable has 4 categories, coded 0 = Hispanic, 1 = non-Hispanic White, 2 = non-Hispanic Black, and 3 = non-Hispanic Other ethnic racial group. In this example, what we want to know is whether Hispanic individuals have more or less nicotine dependent symptoms compared to individuals from the other three ethnic racial groups. That is, we want to compare Hispanic individuals, the reference group, to individuals from other ethnic racial groups, the comparison groups, on number of nicotine dependent symptoms after controlling for the other explanatory variables in the model. To do this, we will use the same GLM procedure that we used to test our earlier multiple regression model. However, this time we're going to add a line of code to tell SAS that the ethnicity race explanatory variable is a categorical variable. We do this using the class command. We type class, then the name of our categorical explanatory variable, then in parentheses we type ref, which tells SAS to use the reference group parameterization for comparing our groups, then an equal sign, and, in quotes, the group that we want to designate as our reference group. In this example, we want to compare the Hispanic group to the other three ethnic racial groups, so this will be our reference group. If you remember, our ethnicity race variable is coded 0 for Hispanic, so put the value 0 in quotes after the equal sign. If we did not specify the reference group parameterization and the reference group of interest, SAS, by default, would use reference group parameterization and would designate the last group as the reference group. So the SAS default would be to compare the ethnicity race group representing other ethnicity or racial group. This is because it had the highest numerically coded value of the four groups, with a value equal to 3. So SAS, by default, will consider it the last group of the explanatory variable. If we use the default, it's important to know the default parameterization, because it can be different for other SAS progression procedures and will have an impact on how we interpret the group comparisons. If we hadn't used a class command, SAS would have assumed that our ethnicity race variable is a quantitative variable, so the regression coefficient would make no sense. Finally, we simply add our ethnicity race variable, which was named ethrace to the list of explanatory variables in the model command. Here's the output. Basically, it is the same output that we see with the GLM procedure, but if we look at our table of parameter estimates, we see that there are three regression coefficients for a categorical ethnicity race variable. Note first that our Hispanic reference group, coded 0, has a regression coefficient of 0 and no estimate of the standard error or p-value. This is because it is our reference group. The other three regression coefficients compare our other ethnicity race groups to the Hispanic group. So, ethrace 1 with a value of 1 is the dummy code for the non-Hispanic White group, compares non-Hispanic White to Hispanic, ethrace 2 compares non-Hispanic Black to Hispanic, and ethrace 3 compares the other non-Hispanic racial group to Hispanic. We can see that none of these three groups were significantly different from the Hispanic group in number of nicotine dependent symptoms, because the p-values all exceed our alpha level of 0.05. As with the previous regression analysis, we see that major life depression and number of cigarettes smoked are positively associated with number of nicotine dependent symptoms. If we wanted to make other comparisons, for example, to compare non-Hispanic White to non-Hispanic Black, then we would simply change our reference group from 0 to 1, from Hispanic to non-Hispanic White in the SAS code, and rerun the analysis. This would provide a comparison of the three other ethnicity racial groups, the non-Hispanic White group. Here's an example of the code in which we change the reference group from 0 to 1. You can see that it's the same code with the exception of changing ref="0" to ref="1". And here's the output. Now, the group coded 1 has a parameter estimate equal to 0 because it is the reference group. In the other coefficients, compare each of the other three groups to the non-Hispanic White group. Participants in the non-Hispanic Other ethnic racial group had a greater number of nicotine dependent symptoms compared to non-Hispanic White participants. There were no significant differences for Hispanic and non-Hispanic Black participants compared to non-Hispanic White participants.