top of page
Search
  • Writer's pictureDavin Cermak

Demystifying Linear Regression… and How to Explain the Results Simply (Part II)

Using a Linear Regression with Continuous Variables


The original post of “Demystifying Linear Regression” (https://www.cermak-consulting.com/post/demystifying-linear-regression-and-how-to-explain-the-results-simply-part-i) described how to interpret linear regression models with binary (0,1) variables. This post will explain how to interpret similar models which have continuous dependent variables instead.

Consider how household income affects household expenditures. We assume households decide how much they will spend based on their expected income (even though some households can spend more than their incomes). Conversely, we assume households do NOT base their incomes on how much they choose to spend. It may sometimes happen, but we assume it is not frequent enough to worry about with our model design.


In theory, we could use a binary approach to measure expenditures by coding each dollar of income as a binomial value. Think of this process as generating an Excel spreadsheet with 5445 columns with each unique income level as its own column. The columns would take on a value 1 when there is a corresponding expenditure and 0 when there is none. While theoretically possible, such a model would yield very poor results in practice. Nearly all columns would contain only a few 1s which would lead most estimates to be unreliable. Even with this many independent variables, there would be many dollar values which have no information for which to evaluate.


The Benefit Using Continuous Variables

A linear regression model with continuous variables allows us to use all the existing data to estimate household expenditures at any level of income. We can think of using the model we are going to develop as ‘filling in the blanks’ for incomes that are missing or would give unreliable results.

Understanding Continuous Data Such as Household Expenditures and Income

Before proceeding, it is important to understand limitations of the data prior to trying to interpret model results! From the Summary Statistics table, notice that I have limited household incomes to between $50,000 and $100,000. I have also chosen only households that have expenditures within +-5% of income. These limitations leave us with 118 households to examine out of the 9420 households in the full set of surveys. The survey data chart shows the ‘holes’ in the expense and income data. It also shows a strong linear relationship between the two values. These restrictions on the data allow for straight-forward understanding of linear model results.



NOTE: The data used is a select subset of the original data that fulfills the required assumptions of a linear regression model (see https://www.statology.org/linear-regression-assumptions/). I will not go into the details of these assumptions in this post, other than to say that they are often NOT met when modelling a full data set. Because I chose the subset to understand how to interpret a simple linear regression model, it would not be a reliable estimator of the entire data set. I address how to deal with and interpret linear regression models without these limitations in a future blog post.

A Simple Univariate Linear Regression Model: Household Expenditures and Income

Let’s look at the simple univariate model results from the data subset described above:






We can interpret the model coefficients similarly to the previous post’s binary model with one important difference: the dependent variable need not be a 0 or 1 value. We instead treat the dependent variable as continuous, which allows it to take on any value of income. The simple univariate linear regression formula for the current model is:

which can be expanded to


The constant value of the model represents the estimated dependent variable value when the independent variable is 0. So, families with no income have estimated household expenditures of $ -134. Since I limit the data to households with incomes between $50,000 and $100,000, the model is likely not an accurate predictor of household expenditures when incomes are outside of this range. Even households with no income would be likely to report some expenditures paid for out of household savings, government benefits, etc. The β(income)β(income) coefficient, 1.05, tells us that, on average, every additional $1.00 of household income increases household spending by $1.05. This coefficient would also only apply within our data limitations.


A Visual Look at the Model

A visual helps to better understand the univariate linear regression model. In the chart below, the black points are the individual survey responses from the subset of households chosen for this model. As noted above, there are several household income and expense pairs that do not exist in the data represented by gaps between points. There are also many income levels with only one observation. The linear regression model, represented by the blue line, allows us to estimate household expenditures at any income level. This gives us much more information about our households than if we had treated each income in the data set as a binomial data point.












The red lines on the chart show the distance between the actual data pairs and the model estimate. Linear regression algorithms minimize the sum of these distances as long as our data meets the assumptions mentioned in the note above. Thus, model coefficients generate the most accurate linear estimate of household expenses given household incomes. Any combinations of slope and intercept values other than those of the blue line would produce less accurate estimates than the linear regression.

Additional Model Statistics are Important When Using Continuous Variables

When using just the model coefficients to explain the relationship between variables, there are several statistics of which you, as the analyst, need to understand and always be aware:

  • Confidence Intervals and Level of Significance

  • R2R2 and Adjusted R2R2

  • Observations

  • Residual Standard Error

  • F Statistic







These will be explained in my next blog post.

40 views0 comments

コメント


bottom of page