Multiple line plots on a chart — Image by Mediamodifier from Pixabay (Pixabay license)

An Illustrated Introduction to Systems of Regression Equations

How to fit a mesh of inter-dependent regression models using techniques such as GLS.

25 min readDec 9, 2022

In this article, we’ll study a very interesting topic in statistical modeling, namely, how to fit a system of inter-dependent regression models. At first glance, this topic may appear a bit esoteric, but systems of interconnected regression models pop up surprisingly often in real life.

In the rest of this article, we’ll use the term ‘equation’ and ‘model’ interchangeably.

The concept is best illustrated with an example.

The Capital Asset Pricing Model

The following regression model is commonly used in Finance for estimating the price that the market is willing to pay for a capital asset. Examples of capital assets are a stock or a bond of a company, your house, a barrel of crude oil, or a bushel of corn.

Let’s look at what the above equation is telling us. r_it is the rate of return over some trailing period on asset i at time point t. An example of r_it is the return you will get at time point t after having invested $100 in shares of Ford Motor Company for the previous 3 months.

r_ft is the risk-free rate of return obtained at time point t over the same trailing period. Here, risk-free means that there is no risk whatsoever of losing your principal amount over the specific trailing period. As such, there is no such thing in this world as a 100% risk-free investment. But in practice, at least in the United States, US treasury bills that become due for payment in 3 months (DTB3) come closest to being considered a risk-free investment. That’s because, the chance of the US government running out of money in the next 3 months and not unable to pay back loans (which is what T-bills are) that are due in 3 months is basically zero.

Thus in equation (1), the term (r_it — r_ft) is the risk-free rate of return on asset i at time t over a specific trailing period. It is the excess return that one would get upon investing in asset i at time t instead of just having stashed away the money in the risk-free asset f. Notice that the risk free rate could in theory be negative. If Ford appreciates by only 0.25% over a 3 month period while the 3-month T-bill gives you 0.3% over the same 3-month period, your risk-free rate for investing in Ford over 3 months is a negative 0.05%! One way to look at this situation is that by investing in the 3-month T-bill, you are willing to pay a premium of .05% to the US government for protecting your capital during that particular 3-month period.

Getting back to Eq (1), the term r_mt is the return at time t on investing in the overall market in which asset i is being traded. For Ford, asset ‘m’ could be the S&P 500 index, or perhaps even better, the S&P 500 Automobiles Industry Index (SP500.251020) (although this particular choice for “ market” may be debatable). In any case, (r_mt — r_ft) is the risk adjusted return gotten at time t on investing in the market that Ford is traded in.

For NYSE:F, we can state the CAPM as follows:

The capital asset pricing equation for Ford Motor Company’s stock on NYSE (Image by Author)

Thus, the Capital Asset Pricing Model (CAPM) regresses the risk adjusted return over a certain trailing time period on asset i at time t, on the risk adjusted return on the corresponding market at t, over the same trailing time period. β_i is the regression coefficient for the regression variable that represents the risk adjusted return for the market.

The Capital Asset Pricing Model is a linear regression model.

There are undoubtedly several factors that influence the return on asset i that the return on the market is not able to capture. If i is a stock such as NYSE:F, there are stock specific factors such as its quarterly financial results, news about major investments announced by Ford during the quarter, and any adverse events such as large-scale vehicle recalls, all of which will impact the market price of NYSE:F.

In eq (2), the portion of the variance in the return on NYSE:F that the return on SP500.251020 is not able to capture, is the portion that leaks into the error term ϵ_(NYSE:F)t.

In general, the error term ϵ_it of the model in Eq (1) explains the variance in (r_it — r_ft) that the regression variable (r_mt — r_ft) is unable to capture.

How does all of this relate to the topic of this article?

The CAPM as a system of regression equations

If we use the CAPM to estimate the risk adjusted returns on M assets, say all 500 stocks in the S&P index, we would end up a system of 500 regression equations as follows:

System of regression equations for the capital asset pricing model — A system of regression equations for the capital asset pricing model (Image by Author)

Notice that for each asset i=1,2,…M, the corresponding market may be different. For e.g. when i=NASDAQ:MSFT, the market could be S&P 500 Information Technology Sector Index.

The reason to lay out these equations is naturally so that we can estimate the corresponding regression models. To do so, we would need to estimate all 500 regression coefficients β_1, β_2, β_3,…,β_500.

Since each equation in (3) is a linear regression model, a naive approach would be to individually fit the 500 regression models on historical stock price data.

But this estimation strategy will yield imprecise estimated coefficients, for the following reason:

Over any given time period, say 3 months, the return on a stock is often correlated with the return on other related stocks and also the return on other markets that those related stocks trade in.

Consider the following 3-equation system consisting of the CAPMs for Ford, U.S. Steel and GlobalFoundries:

A system of 3 regression equations (Image by Author)

In (4a), (4b), (4c), the term r_(DTB3)t is the return on the 3-month T-bill.

Over any medium-to-long term period, the stock price of Ford is influenced by the profit that Ford makes from selling vehicles. This profit is influenced by the price at which Ford buys raw materials to make those vehicles. Steel and computer-chips are two raw materials that are used in large quantities for making motor vehicles. These sources of influence are not adequately captured in the regression variable (r_(SP500.251020)t — r_(DTB3)t), namely the risk adjusted return of investing in the S&P 500 Automobiles (Industry) index in the regression equation for Ford. We know that in a regression model, what the regression variables cannot explain leaks into the error term. Thus, in the model in (4a), the unexplained variance in (r_(NYSE:F)t — r_(DTB3)t) is captured within the error term ϵ_(NYSE:F)t.

Now, the prices at which companies such the steel maker U. S. Steel (NYSE:X) and the chip maker GlobalFoundries (NASDAQ:GFS) are able to sell their steel and computer chips to companies such as Ford directly determine the profits of those steel and computer chips companies. And those profits in turn guide their stock prices and so also the risk adjusted returns on the investments in those stocks. Again, these sources of influence, namely the sales prices and the profit on steel and computer chips are not adequately captured in the corresponding market rates of return regression variables in the models (4b) and (4c). And therefore, these influences leak into the error terms ϵ_(NYSE:X)t and ϵ_(NASDAQ:GFS)t of the respective models.

What we have inferred is that for any given time frame t such as 3 months, the error term ϵ_(NYSE:F)t of the capital asset model for Ford is correlated with the the error terms ϵ_(NYSE:X)t and ϵ_(NASDAQ:GFS)t of the capital asset pricing models for U. S. Steel and GlobalFoundries respectively.

So the regression equations (4a), (4b), and (4c) aren’t really independent. In fact, they can be chained into each other.

The effect of correlated error terms

What we showed in the previous section is an important observation. There is information contained within these correlations which we would not be able to harvest if we estimate the three equations (4a), (4b), and (4c) as three different linear regression models.

If we fit the three linear models individually using Ordinary Least Squares (OLS), this unharvested information will leak into the residual errors of the respective fitted models making the residual errors bigger. Bigger residual errors will mean a poor goodness-of-fit of the respective fitted model.

Secondly, the fitted models with exhibit a greater amount of variance in their estimated coefficients. Greater variance in coefficients will lead to larger standard errors and wider confidence intervals for each coefficient. These are all signs of a sub-optimally fitted model.

We have thus come upon the following result:

Ordinary Least Squares estimation is inefficient when used to individually fit the linear regression models contained in systems of regression equations in which the error terms from different models are correlated.
In such cases, the correct strategy is to fit the entire system of regression equations as a single model using a technique such as Generalized Least Squares (GLS) or Feasible Generalized Least Squares (FGLS).

Let’s formalize these observations.

A framework for a system of regression equations

Consider the following set of M linear regression models:

A system of M linear regression models (Image by Author)

Let’s assume that we have fed each model with N samples of data. That would mean that y_1, y_2, y_3,…,y_M are all vectors of size [N x 1].

Let’s look at how the y vector for the M-th equation looks like:

The response variable vector for the Mth equation (Image by Author)

In the above vector, y_tM is the t-th sample in the data set that is fed into the Mth equation in the system.

X_1, X_2, X_3, …,X_M are matrices containing observed values of regression variables. Let X_1 contain N observations each for k_1 variables including the regression intercept at column 1. Similarly, X_2, X_3, …, X_M contain N observations each for k_2, k_3,…k_M regression variables. That would make X_1 a matrix of size [N x k_1], X_2 of size [N x k_2], and so on and X_M of size [N x k_M].

Here’s how the regression matrix for the M-th equation looks like:

The regression variables matrix for the Mth equation (Image by Author)

In the above matrix, the first column is a column of 1s and it acts as the placeholder for the regression intercept β_1M of the M-th equation. The remaining columns contain regression variables. The Mth regression model contains K_M variables. Hence, X_M contains k_M columns.

Let’s look closely at the subscript-structure of each element in this matrix. For any element x_tjm in X_m, the first subscript on x_tjm refers to the data sample number, the second subscript indicates the regression variable number, and the third subscript indicates the equation number.

Observed value of the j-th regression variable in the t-th sample in the data for the m-th regression model — Observed value of the *j-th* regression variable in the *t-th* sample in the data for the *m-th* regression model (Image by Author)

Thus, element x_tjm in this matrix is the observed value of the j-th regression variable in the t-th sample in the data set that was fed into the m-th regression model in the system.

Even though the elements in X_M (or in general X_m) are triple-subscribed, we represent X_m as a two-dimensional matrix. The third subscript (m), just serves the purpose of reminding us about the equation number (m) in the system that the variable lies in.

Some or all of the regression variables may be common amongst the different X matrices.

Let’s look at the coefficients.

In the system of M equations, β_1, β_2, …, β_M are the coefficient vectors of size [k_1 x 1], [k_2 x 1], …, [k_M x 1]. Here’s how the coefficients vector for the M-th equation looks like. The first subscript on β refers to the coefficient number, and the second subscript indicates the equation number:

The regression coefficients vector for the M-th regression model (Image by Author)

ϵ_1, ϵ_2, …, ϵ_M are vectors of regression errors, each of size [N x 1]. Here’s how the regression errors vector for the M-th equation looks like. The first and second subscripts on ϵ have the same meaning as for β:

The vector of regression errors for the M-th model (Image by Author)

And here’s how these matrices come together to form the regression equation of the M-th model in the system:

The regression equation in matrix form for the M-th model in the system (Image by Author)

Let’s once again circle back to the system of M equations:

We have seen earlier in the context of the CAPM, how, given two regression models for two different stocks p and q (e.g. NYSE:F and NYSE:GFS), and any given time t, the regression errors ϵ_tp and ϵ_tq from the respective models for p and q are correlated with each other. Such correlations make it impossible to efficiently estimate the two regression models individually using a technique such as OLS.

In general, even though the above set of M equations appear at first glance to be unrelated, they may be, in reality, intimately related to each other, and therefore they need to solved as a single system. Such a system of inter-connected regression equations is called the Seemingly Unrelated Regression (SUR) model.

Setting the stage for the estimation of the SUR model

To estimate the SUR model, the first step is to “stack” the y, X, β, and ϵ matrices from the M equations one over the other as follows:

The Seemingly Unrelated Regression Model expressed using block matrices y, X, β, and ϵ — The Seemingly Unrelated Regression Model expressed using block matrices y, X, β*, and* ϵ (Image by Author)

The y matrix (vector) on the left is constructed by stacking the M number of y matrices (vectors) from the M equations. Each y matrix is of size [N x 1]. Hence the stacked y matrix is a column vector of size [MN x 1].

Similarly, the matrix of coefficients β is a stacked column matrix (vector) of size [(k_1+k_2+…+k_M) x 1].

The regression errors matrix ϵ has the same dimensions as the stacked y matrix namely a column vector of size [MN x 1].

Let’s look at the stacked X matrix. It has the following interesting structure:

The structure of the stacked X matrix (Image by Author)

The stacked X matrix illustrates a concept in linear algebra called block matrices. Notice how the individual X matrices from the M equations are arranged along the main diagonal. Each one of these matrices X_1, X_2, …, X_m, …,X_M has N rows corresponding to the N data samples fed into each equation. Hence the the stacked X matrix has MN rows. Each matrix X_m has potentially a different number of columns k_m. The stacked X contains (k_1+k+2+…+k_m+…+k_M) columns. All off-diagonal positions in X are occupied by 0 which is a matrix containing only zeros. The 0 matrix lying at row i and column m in X is of size [N x k_m].

The matrix X is called a block matrix as it is made up of “blocks”. Here, a block is used to refer to the matrices X_1, X_2,…,X_M and the 0 matrices.

Similarly, matrices y, β, and ϵ are also block matrices.

The SUR model for a system of three regression equations

Let’s revisit our example 3-equation system:

In the above system, there are three CAPM models, one each for the stocks of Ford, U.S. Steel, and GlobalFoundries. Thus, M = 3.

Each model contains 2 regression variables:

a stock specific intercept α, and
the risk adjusted return for the corresponding market asset such the S&P metals index.

Thus, there are two coefficients to be estimated for each model, i.e. k_1=k_2=k_3=2. And K=(k_1+k_2+k_3) = 6 coefficients to be estimated in all.

Just for the purpose of illustration, we’ll consider a tiny subset of a made-up data set that contains the Risk Adjusted Returns (RAR) for the three stocks over five time points, i.e. , each model is fed a data set of 5 samples. Thus N = 5.

Here’s how this tiny slice of the data looks like:

A 5-sample data set of risk adjusted returns of three stocks (Image by Author)

Now let’s look at how the block matrices for this data will look like.

Here’s the block y matrix of size [15 x 1]:

Following is the block matrix of coefficients:

The block β matrix — The block β *matrix (Image by Author)*

Here’s the block X matrix:

Notice how the individual X_m matrices from the three equations are fitted along the main diagonal while 0 matrices are fitted in the off-diagonal places. The individual X_m matrices are each of size [5 x 2]. Similarly, all 0 matrices are of size [5 x 2] since in our sample system, each equation contains 2 regression variables. The block X matrix is of size [15 x 6].

And here’s the block matrix of error terms. It has the same size as the block y matrix, namely [15 x 1]. The first subscript on each error term refers to the sample number (1/2/3/4/5) in the respective data set while the second subscript refers to the equation number (1/2/3).

wThe block matrix of errors (Image by Author)

Let’s knock together all the block matrices into a single SUR model:

The SUR model for a system of 3 regression equations (Image by Author)

We may want to verify that the above matrix equation is dimensionally correct. X is of size [15 x 6] and β is of size [6 x 1]. So Xβ is of size [15 x 1]. Add to that ϵ of size [15 x 1] and we get y of size [15 x 1].

Is SUR a classical linear model?

From the looks of it, we may be tempted to identify the “block-matrixed” SUR model illustrated above as a classical linear model and fit it using straight up OLS. But that would be a grave mistake. And that’s because one of the fundamental assumptions of the classical linear model is that there is no auto-correlation amongst the errors of the model, and we have already seen how the Capital Asset Pricing equation for Ford contains an error term that is correlated with the error terms of the Capital Asset Pricing equations of U.S. Steel and GlobalFoundries.

Since the SUR model in our example is not a classical linear model, it cannot be efficiently estimated using Ordinary Least Squares. So how do we estimate it?

A promising approach happens to be to use the Generalized Least Squares (GLS) and Feasible Generalized Least Squares (FGLS) which I have covered in detail in this article (and a tutorial on GLS in this article).

GLS (or FGLS) requires us to specify the covariance matrix of the error terms of the model. So let’s inspect the fine-grained structure of the covariance matrix of regression errors of the SUR model. We’ll use our 3-equation regression system of asset pricing models to guide us through this process.

The covariance matrix of errors of the SUR model

The covariance matrix of errors is obtained by multiplying the errors matrix ϵ with its transpose ϵ’ and dividing the result by N (number of samples feed into each model). The transpose of ϵ is simply ϵ with its rows and columns interchanged. Here’s how the transpose of the block error matrix of the 3-equation SUR model looks like:

The covariance of ϵ, denoted as Cov (ϵ) is ϵϵ’/(N).

Let’s take a closer look at this matrix multiplication. The block error matrix ϵ is of size [15 x 1]. Its transpose ϵ’ is of size [1 x 15]. Hence ϵϵ’ is square matrix of size [15 x 15]. In general case, if ϵ is of size [MN x 1], the covariance matrix of errors is a square matrix of size [MN x MN], and it contains (MN)² covariance terms.

Following is an important property of the covariance matrix of errors to keep in mind:

The diagonal elements of the covariance matrix contain the variance of the individual error terms, while the off-diagonal elements contain the covariance between different error terms.

Here’s how the covariance matrix of errors looks like for the 3-equation SUR model (you may want to zoom into the figure to take in the detail):

The covariance matrix of error terms of the 3-equation SUR model (Image by Author)

The matrix multiplication of ϵ with its transpose has caused the three blocks in the block ϵ matrix to multiplex into 9 different blocks (shown with thick block borders).

For any ϵ_ti in the covariance matrix, the first subscript on ϵ represents the data sample number and second subscript represents the model number in the equations system.

Thus, ϵ_ti is the error term for the t-th data sample of the i-th model in the system. Ditto for ϵ_sj. For e.g. ϵ_12 is the error term for the first data sample of the CAPM for U.S. Steel stock which is equation #2 in the 3-equation system.

An element (ϵ_ti)(ϵ_sj)/(5) in the matrix represents the covariance between the error terms ϵ_ti and ϵ_sj.

Let’s zoom into a portion of the main diagonal of this matrix:

A zoomed in view of the portion of the diagonal of the covariance matrix (Image by Author)

The diagonal elements of Cov(ϵ) are ϵ²_ti/5 where t=1,2,…,5 for the 5 data samples supplied to each CAPM model, and i=1,2,3 for each of the three CAPM models in the system. We make the following observation:

The diagonal elements of the covariance matrix of errors of the SUR model are the variances of the error terms of the individual models within the equations system.

For e.g., ϵ²_32/5 is the variance of the error term corresponding to data sample #3 supplied into the second CAPM in the system, i.e. the model for the returns on U.S. Steel stock.

The off-diagonal elements of the the covariance matrix also have an interesting covariance structure. Let’s zoom into a single block of off-diagonal elements:

A block of off-diagonal elements of the covariance matrix of errors of the SUR model (Image by Author)

We’ll look at the blue bordered diagonal elements of this block first. Notice how for each (ϵ_ti)(ϵ_sj) the first subscript on the two ϵs is the same. The first subscript indicates the data sample number fed into each equation. For our 3-equation stock pricing model, the first subscript indicates a specific time point at which the trailing 3-month returns of all stocks and market indices were measured. Thus, all diagonal elements of this block contain the correlations between the error terms of different models in the system for the same time point (whatever that time point might be).

Incidentally, the 15 elements along the main diagonal of the covariance matrix already satisfy this property as they have the form (ϵ_ti)(ϵ_ti).

We’ll make a quick note of this property.

The diagonal elements of all blocks in the covariance matrix of errors in the SUR model contain the correlations between the error terms of the individual models within the equations system, at the same time point.

Let’s highlight all such diagonally positioned blocks:

The covariance matrix of the blocked errors with the diagonal elements in each block highlighted. (Image by Author)

Let’s once again inspect the zoomed in block we inspected earlier and note the orange bordered cell:

This cell contains the covariance (ϵ_11)(ϵ_42). Notice how all four subscripts are different. This is a characteristic of all cells in this block as also all blocks which are not along the main diagonal of the blocked covariance matrix. These covariances are of error terms from different equations in the system and across different time points. In time series settings in which the the ‘t’ and ‘s’ subscripts represent time points (as also in cross-sectional settings where they represent individuals), it’s uncommon for different equations in a system to be correlated at two different time points (or individuals). We’ll soon see the advantages resulting from setting all such covariances to zero.

Let’s once again zoom into one of the blocks along the main diagonal of the blocked covariance matrix:

A block along the main diagonal of the covariance matrix of errors of the SUR model (Image by Author)

All off-diagonal elements in this block contain variances of the type (ϵ_ti)(ϵ_sj) where i = j = 1. That is, all these elements contain the covariance of errors from a particular equation in the system (in the above case, equation 1: the CAPM for Ford) across different time points. Depending on the modeling scenario, such auto-correlations of errors within individual equations in a system can be assumed to be zero thereby greatly simplifying the estimation of the SUR model.

Estimation of the Seemingly Unrelated Regression model

As mentioned earlier, the Generalized Least Squares (GLS) technique (covered in depth in this article) provides a promising avenue for estimation of the SUR model.

The argument for the GLS technique goes along the following lines:

Given a linear regression model y = Xβ + ϵ, we wish to estimate the coefficient vector β.

But the estimation of β using Ordinary Least Squares (OLS) will be inefficient (imprecise) if the errors are heteroskedastic (non-constant variance) and/or correlated.

In the previous section, we have seen that the error terms of the Seemingly Unrelated Regression model form a covariance structure that is both heteroskedastic and correlated.

Therefore, we must use a different kind of estimator for estimating β. One such estimator that takes into account both heteroskedasticity and correlation between errors is the Generalized Least Squares estimator.

The GLS estimator of β is given by the following matrix equation:

The Generalized Least Squares Estimator for β *(Image by Author)*

Let’s interpret this equation in the context of the SUR model.

In the above equation, β_cap is the estimated value of β. In the context of the SUR model, β is the block matrix we reviewed earlier. Here’s how it looks like for the 3-equation CAPM system:

In general, β_cap is of size [K x 1].

In equation (5), X’ is the transpose of the block X matrix we looked at earlier. X is of size [MN x K]. Let’s reproduce it below for reference:

X’ is of size [K x MN].

y is the block matrix of dependent variables of size [MN x 1]. In our 3-equation CAPM system fed with 5 data samples per equation, it looks like this:

The newly seen matrix in Equation (5) is the Ω (Omega) matrix. Ω is the covariance matrix of regression errors. As we have seen, Ω is a square matrix of size [MN x MN].

For the 3-equation SUR model fed with 5 samples each, here is Ω in its full glory:

Equation (5) gives us the following insight:

The matrices X, X’ and y are all known quantities.

If Ω is known, the GLS estimator shown in equation (5) can be used to efficiently estimate all K coefficients in the system of equations contained within the SUR model.

But now we run into a practical difficulty.

Ω contains the correlations between population-level regression errors which are inherently unobservable to the experimenter.

We will overcome this difficulty as follows:

By making two key assumptions which we’ll soon describe, we will simplify the structure of Ω thereby making Ω tractable.
We will fit the individual equations in the SUR model using Ordinary Least Squares and we’ll use the residual errors from those fitted models as unbiased estimates of the corresponding regression errors of the SUR model.

Making Ω tractable

Let’s make the following assumption about the fine-grained structure of the covariances of errors in Ω:

In a system of regression equations, the regression errors from individual equations are assumed to not be correlated across different time points (or across different individuals).

Recollect that for any regression error ϵ_ti in the system, the first subscript refers to the data sample number, in other words a specific time point or individual in the system, and the second subscript refers to the equation number in the system.

The assumption we are making is that:

For any ϵ_ti and ϵ_sj in Ω, (ϵ_ti)(ϵ_sj) = 0 if t != s.

The effect of this assumption is to make all such elements in Ω be zero. After applying this assumption to the 3-equation CAPM system, the simplified Ω matrix looks like this (for simplicity, imagine its divided by 5 ):

The Ω matrix for the 3-equation SUR model after assuming that errors are not correlated across time points or across individuals (Image by Author)

The second assumption we’ll make is one of heteroskedasticity of errors within and across equations. Specifically, we’ll assume the following:

In a system of regression equations, the regression errors for any individual equation (or pair of equations) are homoskedastic i.e. they demonstrate constant variance.

This assumption yields the following result:

For any ϵ_ti and ϵ_tj in Ω, (ϵ_ti)(ϵ_tj) = σ²_ij

The effect of this assumption is that all shaded elements in Ω attain a block-specific constant value. After applying this assumption to the 3-equation CAPM system, the simplified Ω matrix looks like this:

The Ω matrix for the 3-equation SUR model after assuming that errors are not corrleated across time periods or across individuals, AND, the regression errors for any individual equation (or pair of equations) are homoskedastic — The Ω matrix for the 3-equation SUR model after assuming that errors are not corrleated across time points or across individuals, AND, the regression errors for any individual equation (or pair of equations) are homoskedastic (Image by Author)

Notice that in-spite of these two assumptions, the errors of the SUR model aren’t necessarily homoskedastic (constant variance) and uncorrelated. A glance down the main diagonal of the above matrix reveals that σ²_11, σ²_22, and σ²_33 may not all turn out to be of equal value. Several off-diagonal elements σ²_ij (where i != j) also may not all be zero. In the face of likely heteroskedasticity and correlated errors, the OLS estimator continues to be inappropriate for estimating this model.

Having said that, the Ω matrix in its simplified format can now be stated in the following innovative manner:

The simplified Ω matrix for the 3-equation system represented in block matrix form — The *simplified* Ω matrix for the 3-equation system represented in block matrix form (Image by Author)

In the above representation, I is a [5 x 5] Identity matrix, i.e. a matrix that has ones along its main diagonal and zeros at all other positions. I is of size [5 x 5] because we have fed each equation in our 3-equation system with 5 data samples.

We can represent Ω even more concisely by pulling out all nine σ² elements into their own [3 x 3] matrix that we’ll call Σ (capital Sigma) and performing what’s known as a Kronecker product of Σ with a [5 x 5] Identity matrix:

The covariance matrix of errors for the 3-equation system expressed as a Kronecker product (Image by Author)

The Kronecker product of Σ and I is done as follows:

We create a placeholder matrix of size [5 x 5].
Next, we multiply each element σ²_ij in Σ with the entire [5 x 5] Identity matrix giving us a [5 x 5] diagonal matrix having σ²_ij along its main diagonal.
We replace the element at the (i, j)th position of the [5 x 5] size placeholder matrix with this [5 x 5] size diagonal matrix.
We repeat steps 2 and 3 for each i,j=1, 2, 3 to create a matrix of size [15 x 15].

The following figure illustrates how the Kronecker product is built block-by-block:

An illustration of how the Kronecker product Σ ⊗ I is built to yield Ω *(Image by Author)*

In general, if a SUR model is constructed over a system of M equations, the Σ matrix is of size [M x M]. If each equation in the system is fed N data samples, the I matrix is of size [N x N]. And Σ ⊗ I yields the covariance matrix of errors Ω for the entire system of size [MN x MN], which is exactly as desired.

The covariance matrix of errors Ω calculated as the Kronecker product Σ ⊗ I (Image by Author)

Let’s circle back to GLS.

Our goal is to estimate the coefficient vector β of the SUR model. We saw that we could use Generalized Least Squares to efficiently estimate β. The GLS estimator mentioned earlier in Eq (5) is reproduced below.

To estimate β, we need to know X, X’, y, and the inverse of the covariance matrix of errors Ω.

X, X’, y are observed quantities. We saw how we can derive a simplified version of Ω as the Kronecker product of Σ and I.

It can be shown that the inverse of Ω is simply the Kronecker product of the inverse of Σ and I. Namely, the following:

An identity involving the Kronecker product of two matrices (Image by Author)

Therefore, what remains to be estimated is Σ.

Estimation of the Σ matrix

Let’s revisit the [3 x 3] Σ matrix for the 3-equation SUR model consisting of stocks of Ford, U.S. Steel, and GlobalFoundries:

The Σ matrix for the 3-equation SUR model (Image by Author)

The values along the main diagonal of Σ are the variances of the errors for the individual models in the 3-equation system. These can be easily estimated by fitting the individual models using OLS, collecting the residual errors from the 3 models, and calculating the variance of the residual errors.

For e.g., we would fit the CAPM for NYSE:Ford illustrated in Eq (4a) using OLS over the data set of 5 samples (in a real setting, the data set would undoubtedly be a lot larger!).

After fitting this model using OLS, we would calculate the residual errors of the fitted model by subtracting the value estimated by the fitted model from the observed value of the Risk Free Return on Ford. We’ll get 5 such residual errors: e_11, e_21, e_31, e_41, e_51 corresponding to the five data samples. As before, the first subscript refers to the sample number and the second subscript indicates the equation number.

Next, we would calculate the variance s²_11 of these five residuals using the standard formula for sample variance as follows:

Sample variance of the residual errors from the first (fitted) model in the system — **Sample** variance of the residual errors from the first (fitted) model in the system (Image by Author)

Here, e_bar_1 is the mean of the 5 residual errors. The mean of residual errors of linear model fitted using OLS is always zero. Hence, e_bar_1 can always be assumed to be zero. The demoninator contains a correction of 2 for the degrees of freedom as this model contains two variables: the intercept, and the RAR on the respective market.

Thus, we fit each of the three models (or in general M models) in the system using OLS, calculate the respective residual errors from the fitted models, and get their variance e_ii using the following general formula:

The sample variance of the residuals from the i-th model in the system containing k_i regression variables — The **sample** variance of the residuals from the i-th model in the system containing k_i regression variables (Image by Author)

The sample variances (or covariances) from an OLS estimated linear model happen to be unbiased estimates of the corresponding population variances (or covariances). Hence, we can use s²_ii as an estimate for the corresponding diagonal element σ²_ii in Σ.

Now let’s look at the off-diagonal elements σ²_ij in Σ. These are covariances between the regression errors from the i-th and the j-th regression models in the system with the constraint that the co-variances are calculated across the same time points.

Once again, we will appeal to the above result about sample and population covariances and use the following formula for sample covariance to estimate the corresponding population covariance σ²_ij:

The sample covariance of the residuals from the i-th and j-th models in the system containing respectively k_i and k_j regression variables — The **sample** covariance of the residuals from the i-th and j-th models in the system containing respectively k_i and k_j regression variables (Image by Author)

Using the above techniques, we can construct the following [3 x 3] (or in general, an M x M) S matrix that will be an unbiased estimate of the Σ matrix:

The matrix of sample variances and covariances of residual errors (Image by Author)

With the S matrix in place, we will proceed with the GLS (more appropriately, Feasible GLS) estimation of β as follows (The Est. (…) indicates an estimated value):

Estimation of coefficients of the Seemingly Unrelated Regression Model (Image by Author)

Since the ‘cap’ in β_cap already indicates that it is the estimated value of β, strictly speaking, the Est (…) notation is not needed for β_cap.

Equation (6) is the estimator that we have been seeking for estimating the system of regression equations using the SUR model.

The beauty of this estimator is that once we calculate S as an estimate of Σ, equation (6) lets us do an essentially single-shot estimation of all regression coefficients in the system of regression equations using an estimator that is unbiased, consistent, and efficient. This estimator allows for non-constant variances a.k.a. heteroskedasticity in the errors of the system, and it takes into account the intricate relationships between the different regression equations in the system without losing its precision.

References, Citations and Copyrights

Images

All images in this article are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

If you liked this article, please follow me at Sachin Date to receive tips, how-tos and programming advice on topics devoted to regression, time series analysis, and forecasting.