Image from Pixabay — Image by Gerd Altmann from Pixabay (Pixabay license)

A Tutorial on Generalized Least Squares Regression Using Python and Statsmodels

We’ll learn how to use the GLS estimator to fit a linear model on a real world data set

10 min readNov 20, 2022

--

The Generalized Least Squares (GLS) estimator is an effective alternative to the Ordinary Least Squares (OLS) estimator for fitting linear models on data sets that exhibit heteroskedasticity (i.e., non-constant variance) and/or auto-correlation.

In a previous article, we had detailed out the motivation for the GLS estimator and described how it is built. In this article, we’ll use the GLS estimator to fit a linear model on a real world data set.

While it is possible to learn how to use GLS by simply following the tutorial shown in this article, I would strongly recommend reviewing the following article first:

A Deep-Dive into Generalized Least Squares Estimation

A detailed look at how to fit a robust model on heteroskedastic, auto-correlated data sets

towardsdatascience.com

A brief introduction to GLS

When we use the Ordinary Least Squares estimator to fit a linear model on a data set, we implicitly assume that the model’s errors are homoskedastic (constant variance) and uncorrelated. But this assumption often breaks down for real world data sets. In the face of heteroskedastic, auto-correlated errors, the OLS estimator is still consistent and unbiased but it loses its efficiency. Specifically, OLS is no long produces coefficient estimates with the least possible variance.

The issue is that when using OLS, the formula for calculating the variance of the coefficient estimates assumes homoskedastic, uncorrelated errors.

For a linear model given by the following equation:

An Ordinary Least Squares estimation of β yields the following OLS estimator:

The vector of estimated coefficients, estimated using ordinary least squares (Image by Author)

In the above formula, β_cap is the vector of coefficients estimated by OLS, X is the matrix of regression model’s coefficients (including the intercept), y is the response variable.

The variance of the coefficient estimates is calculated as follows:

The variance of the OLS estimated coefficients (Image by Author)

σ² is the constant variance of the error term ϵ of the model.

But when the errors are heteroskedastic and/or correlated, the above formula outputs an incorrect variance leading to errors in the calculation of standard errors, z-scores, p-values and confidence intervals of the estimated coefficients. Coefficients that are not statistically significant may get incorrectly flagged as significant. The whole procedure for statistical inference yields incorrect values.

The Generalized Least Squares estimator remedies this situation. The GLS estimator of β for the linear model y = βX + ϵ is given by the following formula:

The vector of estimated coefficients, estimated using generalized least squares (Image by Author)

And the variance of the coefficient estimates is given by the following equation:

The variance of the GLS estimated coefficients (Image by Author)

Here, Ω (capital Omega) matrix is an [n x n] size matrix which along with the σ² scaling factor forms the covariance matrix of errors as follows:

Here, σ² forms a scaling factor, a constant, that we extract out of the matrix such that ω_ij=ρ_ij/σ². Each diagonal element ρ_ii is the variance of the error e_ii corresponding to the ith row in the data set, while each non-diagonal element ρ_ij is the covariance between e_i and e_j.

To use equations (5) and (6), we must estimate σ² and Ω. There are several techniques available to do so. In this article, we’ll describe one such technique and use it to fit a GLS model on a data set drawn from the US census bureau.

The regression model

The following linear model of county-level poverty rates in the United States uses socioeconomic factors such as the median age of the county’s inhabitants, the vacancy rate of houses for sale in the county and the percentage of the county’s population with at least one college degree.

A linear model for estimating county-level poverty (Image by Author)

To train this model, we’ll use use data from the US Census Bureau aggregated at a county level. The data is a subset of the 2015–2019 American Community Survey (ACS) 5-Year Estimates conducted by the US Census Bureau. The following table contains the data that we’ll use (tap or click on the image to zoom it):

A subset of the American Community Survey data set pulled from the US Census Bureau under their Terms Of Use — A subset of the American Community Survey data set pulled from the US Census Bureau under Terms Of Use

The data set used in this example can be downloaded from here. The complete ACS data set can be pulled from the US Census Bureau’s website using publicly available APIs, or directly from the Census Bureau’s Community Explorer website.

We’ll use Python and Pandas to load the ACS data file into memory, and we’ll use the Python based statsmodels package to build and fit the linear model.

Let’s start by importing the required packages and loading the data file into a Pandas DataFrame:

import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
from patsy import dmatrices
from matplotlib import pyplot as plt
import numpy as np
#Load the US Census Bureau data into a Dataframe
df = pd.read_csv('us_census_bureau_acs_2015_2019_subset.csv', header=0)

Let’s construct the model’s equation in Patsy syntax. Statsmodels will automatically add the intercept of regression to the model, so we don’t have to explicitly specify it in the model’s equation:

reg_expr = 'Percent_Households_Below_Poverty_Level ~ Median_Age + Homeowner_Vacancy_Rate + Percent_Pop_25_And_Over_With_College_Or_Higher_Educ'

Build and train the model and print the training summary:

olsr_model = smf.ols(formula=reg_expr, data=df)olsr_model_results = olsr_model.fit()print(olsr_model_results.summary())

We see the following summary:

Training summary of the Linear Model (Image by Author)

We’ll ignore the value of R-squared (or adjusted R-squared of 0.191) reported by statsmodels as our interest lies in estimating the main effects (namely the coefficient estimates), their standard errors and 95% confidence intervals, and whether the coefficients are statistically significant. Let’s zoom into the relevant portion of the output:

Coefficient estimates, their standard errors, p values and 95% Confidence Intervals (Image by Author)

We see that all coefficient estimates are significant at a p of < .001. By default, Statsmodels has assumed homoskedastic errors and accordingly used Eq (3) reproduced below to calculate the covariance matrix of coefficient estimates:

If the errors are not homoskedastic, then the above formula will yield misleading results.

A visual test of heteroskedasticity should tell us if the variance of the errors (as estimated by the variance of the fitted model’s residuals) is constant. We’ll plot the fitted model’s residuals (which are unbiased estimates of the model’s errors) against the fitted (i.e., predicted) values of the response variable, as follows:

#Carve out the y and X matrices
y_train, X_train = dmatrices(reg_expr, df, return_type='dataframe')#Get the predicted values of y from the fitted model
y_cap = olsr_model_results.predict(X_train)#Plot the model's residuals against the predicted values of yplt.xlabel('Predicted value of Percent_Households_Below_Poverty_Level')plt.ylabel('Residual error')plt.scatter(y_cap, olsr_model_results.resid)plt.show()

We see the following scatter plot (the red lines are drawn just to illustrate the heteroskedastic variance):

Plot of the fitted model’s residual errors against the estimated value of the response variable (Image by Author)

The model’s errors are clearly heteroskedastic. The use of the GLS estimator is indicated for this data set.

To do so, we must first estimate the covariance matrix of errors, namely the Σ (Sigma) matrix.

We’ll use the following procedure for estimating the covariance matrix of errors Σ:

Chop up y_cap into intervals of size 1.
For each interval, collect the residual errors from the OLS fitted model and compute their sample variance.
Regress these sample variances on the centers of the intervals of y_cap. For e.g. the center of the interval (5, 6] is 5.5.
Use this fitted regression model to estimate the sample variance of the residuals of the OLSR model.
Use these estimated sample variances of residuals as the estimates of the corresponding error term’s population variances.

Let’s implement this procedure.

Create a range object for intervalizing (binning) y_cap:

interval_range=range(round(min(y_cap)), round(max(y_cap)))

Convert it to a list:

ycap_intervals=list(interval_range)

Intervalize y_cap:

intervalized_ycap=pd.cut(x=y_cap, bins=ycap_intervals)

Print it out to see how it looks like:

print(intervalized_ycap)

We see that the intervalizing has created 37 unique intervals across 3219 y values:

\The intervalized fitted y values (Image by Author)

Next, we use a dictionary to extract the set of unique intervals of y_cap. The keys of the dict are the intervals while the values will be list of residuals corresponding to those intervals. We’ll fill in the values soon.

distinct_intervals_of_ycap={}
for intr in list(intervalized_ycap):
    distinct_intervals_of_ycap[intr] = []

Create a DataFrame consisting of fitted (predicted) y values from the OLSR model, the interval that each fitted value lies in, and the residual from the fitted OLSR model corresponding to each fitted y value:

series_dict={'Y_CAP':y_cap, 'INTR':intervalized_ycap, 'RESID':olsr_model_results.resid}df_intervalized_ycap = pd.DataFrame(data=series_dict)

Now iterate through the rows of this DataFrame, and for each row, collect the residual, and using the interval as the key, add the residual into the appropriate slot in distinct_intervals_of_ycap dict:

for index, row in df_intervalized_ycap.iterrows():
    resid = row['RESID']
    interval = row['INTR']
    distinct_intervals_of_ycap[interval].append(resid)

Next, construct two arrays of size=length of data set. The fist array will store the midpoints of the intervals of fitted y values. The second array will store the sample variance of the residuals corresponding to all the fitted y values that lie within the interval.

midpoints_of_ycap_intervals = []sample_variance_of_intervalized_residuals = []for key in distinct_intervals_of_ycap.keys():
    if pd.isnull(key):
        continue
    resids = distinct_intervals_of_ycap[key]
    #Skip intervals with <= 1 residual
    if len(resids) > 1:
     sample_variance_of_intervalized_residuals.append(np.var(a=resids, ddof=1))
        midpoints_of_ycap_intervals.append(key.mid)

Plot the sample variances of intervalized residuals against the midpoints of the intervalized fitted y to get a sense of the relationship between them:

plt.xlabel('Midpoints of intervalized fitted y values')plt.ylabel('Variance of intervalized residuals')plt.scatter(x=midpoints_of_ycap_intervals, y=sample_variance_of_intervalized_residuals)plt.show()

We see the following plot:

Variance of intervalized residuals from the fitted OLS model plotted against the midpoints of the intervalized fitted y values from the model (Image by Author)

From looking at the plot, we suspect an exponential relationship between the variance and y_cap.

An exponential relationship between the variance of residuals and the fitted y values is commonly seen when the errors are heteroskedastic.

Let’s build a log-linear model between the intervalized variances and the intervalized fitted y. Specifically, we’ll regress log(sample_variance_of_intervalized_residuals) on
midpoints_of_ycap_intervals and a constant:

log_sample_variance_of_intervalized_residuals = np.log(sample_variance_of_intervalized_residuals)

Construct a DataFrame consisting of log_sample_variance_of_intervalized_residuals and sample_variance_of_intervalized_residuals:

df_ycap_mids_resid_vars = pd.DataFrame(data={'LOG_INTERVALIZED_RESID_VARIANCE':log_sample_variance_of_intervalized_residuals,'MIDPOINT_INTERVALIZED_YCAP':midpoints_of_ycap_intervals})ols_resid_variance_model = smf.ols(
    'LOG_INTERVALIZED_RESID_VARIANCE ~ MIDPOINT_INTERVALIZED_YCAP',  data=df_ycap_mids_resid_vars)ols_resid_variance_model_results = ols_resid_variance_model.fit()

Use this model to predict the sample variances of all residuals in the original OLSR model in which we regressed Percent_Households_Below_Poverty_Level on Median_Age, Homeowner_Vacancy_Rate, Percent_Pop_25_And_Over_With_College_Or_Higher_Educ and a constant:

df_ycap_from_ols_model = pd.DataFrame(data={'MIDPOINT_INTERVALIZED_YCAP':y_cap})log_resid_sample_variance_estimate = ols_resid_variance_model_results.predict(
    df_ycap_from_ols_model['MIDPOINT_INTERVALIZED_YCAP'])resid_sample_variance_estimate = np.exp(log_resid_sample_variance_estimate)

Let’s visually inspect the estimated sample variance of residuals by plotting them against the fitted values from the original OLSR model:

plt.xlabel('Fitted y values')plt.ylabel('Estimated sample variance of residuals')plt.scatter(x=y_cap, y=resid_sample_variance_estimate)plt.show()

We see the following plot:

Estimated sample variance of residuals plotted against the fitted y values from the OLS model (Image by Author)

The plot shows the exponential relationship which is entirely expected since we are using a log-linear model.

We now have all the ingredients to build the Σ matrix.

We’ll begin with creating an Identity matrix of size [n x n] and fill in its diagonal with the estimated sample variances of residuals:

sigma_matrix = np.identity(len(y_cap))np.fill_diagonal(sigma_matrix, np.exp(resid_sample_variance_estimate))

Build and train the GLS model using this Σ matrix:

gls_model = sm.GLS(endog=y_train, exog=X_train, sigma=sigma_matrix)gls_model_results = gls_model.fit()print(gls_model_results.summary())

We see the following summary:

Training summary of the linear model estimated using General Least Squares (Image by Author)

Let’s compare the parameter estimates from the OLS model, and the GLS model:

data={'OLS_MODEL':olsr_model_results.params, 
      'GLS_MODEL':gls_model_results.params}print(pd.DataFrame(data=data))

We see the following comparative output:

A comparison of coefficient estimates from the OLS and the GLS estimator (Image by Author)

Let’s also compare the standard errors of the parameter estimates from the OLS model, and the GLS model:

data={'OLS_MODEL':olsr_model_results.bse, 'GLS_MODEL':gls_model_results.bse}print(pd.DataFrame(data=data))

We see the following comparative output showing what we would expect from the GLS estimated model, namely that its standard errors are much smaller than those from the OLS estimated model:

A comparison of coefficient standard errors from the OLS and the GLS estimator (Image by Author)

The GLS estimated model is seen to be more precise than the OLS estimated model for this data set.

Here’s the code used in the article:

References, Citations and Copyrights

Data set

The The American Community Survey data set used in this article can be downloaded from here. The complete ACS data set can be pulled from the US Census Bureau’s website using publicly available APIs (See Terms of Service and this link), or directly from the Census Bureau’s Community Explorer website.

Images

All images in this article are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

If you liked this article, please follow me at Sachin Date to receive tips, how-tos and programming advice on topics devoted to regression, time series analysis, and forecasting.