Shortly after starting a position with a new employer, I proposed a research project to my supervisor that included a linear regression model. The supervisor explained I could not proceed with the project as designed because the group I was working in did not provide forecasts or predictions. My proposal was to use a regression model to understand relationships between variables and not forecast. But being relatively new to the group, I did not push the issue further and changed the research proposal.
I realized at that moment academics often teach linear regression as a prediction tool - at least they did when I was last in the classroom. But the thought of how to describe linear regression models to help understand data continued to nag me until I found my “aha” moment looking through a regression analysis explanation in a data science context. I hope the following discussions will provide an “aha” moment for you as well (I wish I had discovered this when studying econometrics!) and will help you better understand linear regressions and better explain it to others.
Part I of this blog discusses a couple of simple examples of linear regression models where the dependent variables ‘sex’ and ‘marital status’ are binomial with only two values: 0 or 1. I will simply focus on extracting group means from the data and will not get into technicalities of the data, such as heteroskedacity, which is commonly an issue with income data. That will come later. For now, my goal is to build a practical understanding of regression analysis on which to build.
A Naive Model: The Mean
Using data from the Panel Study of Income Dynamics from the University of Michigan, we will look at several factors related to the wage earned by the head of household in the survey.
The most basic understanding of our dependent variable, ‘wage’, is a measure of central tendency (i.e. mean, median, mode, etc.). Since we will explore linear regressions here, we will use the mean of ‘wage’ as a naïve model: sum(wage) / number of respondents. The mean wage of the group is $55460. It is a ‘naïve’ because we are making absolutely no assumptions about either the distribution of the data or its dependencies on any other data.
A Univariate Linear Regression Model: Controlling for Gender
But we know we have additional pieces of information to consider in the data we have collected, so let’s start with looking at the mean wages of males and females:
sum(wages, males) / number of male respondents = $64107 and sum(wages, females / number of female respondents = $35431.
We can also use a simple univariate linear regression model to calculate the mean wages for males and females (I will forego the academic specifications of the model in this discussion since there are several academic textbooks and online sources that provide such discussion and focus instead on the model results and interpretation):
The resulting coefficients from the model allow us to calculate the mean wages for each group. Defining the gender variable “sex” as a factor, the linear regression model interprets gender as a dummy variable where 0 = Male and 1 = Female. So interpreting the model coefficients, the constant is the mean wage for
males and subtracting the Female estimate from the constant leave us with the mean wage for females.
Recall the simple univariate linear regression formula ywage=β0+βsex+ϵ can now be shown as wage = $64107 + $-28677(sex). Replacing sex in this formula with a 0 for male leaves the constant coefficient, or the mean wage for men, while solving the formula by replacing sex with a 1 for female gives us the mean wage for females.
The difference in the means is large and with 6,815 people in the data set and we would expect it is statistically significant and not just because of random errors when selecting the data. We can do this by looking at the t-values which are calculated in linear regression output from most statistical software packages.
Interpretation of the t-values is straight-forward in this case. The benchmark in econometrics is that coefficients with a t-value probability, ‘Pr(>|t|)’, less than 0.05 are statistically significant. But statistically significant from what? Well, recalling that the constant is the mean wage for males in our data, the probability of the true population mean = $0 is less than 1 percent. In this example is the meaning of the t-value probability for female. We interpret it as the probability that the wages of females are not statistically different from the wages of males. The probability of the difference resulting from random chance turns out to be much less than 1 percent.
So recapping, we can use this information to explain a couple of aspects of the data. On average, wages for males are $28,677 higher than average wages for females. Additionally, there is a very low probability that the gender difference in wages is because of random errors. We can be pretty confident that the difference in the survey data exists in the broader population.
A Multivariate Linear Regression Model: Controlling for Marital Status
Now extend the above example to a more complex set of factors in the data: marital status.
Here, we have several more potential influences on wages than whether the survey respondent is male or female. Unlike the previous univariate linear regression model, the model used here will be a multivariate model, but the same concepts will apply.
Similar to the univariate regression model, the multivariate model has a constant that represents married respondents with a mean wage of $74207. The mean coefficient value for each marital status category is its difference from the constant mean for married respondents, i.e., never married respondents earn, on average, $36,431 less than married respondents, widowed $38,218 less and so on. It is important in this model to know the meaning of the constant value, as most software programs will not include the ‘Married’ label (I added it manually) but will just refer to it as a ‘Constant’ or ‘Intercept’ value. The t-value probability interpretation is like that of the univariate model above. In this multivariate model, each individual marital status mean wage is statistically different (and much lower!) from the mean wage of our base case scenario - married respondents. Keep in mind, this model does NOT inform us whether the difference mean wages between Never Married and Widowed respondents, which are quite similar, are statistically significant.
One last point. You may have heard the terms ‘all else equals’ or ‘ceteris paribus’ used with regression model discussions. This simply means that we are only comparing individual marital status groups to our base married status. Here, it makes intuitive sense since each respondent can only be in one of these categories at a time.
Comments