Introduction

This tutorial is about Structural Equation Models (SEM).1 SEM is a system of multi layered multiple regressions (you can read as messed up regression). To better understand SEM, we will do a brief recall of multiple linear regression topic.

Following sections will include graphical representation of SEM, introduction of R package sem, calibration of an SEM and interpretation of the outputs on an example data set.

SEM is not a trivial topic. To understand it better, one must go further. There are pitfalls and rules about building an SEM; for instance, identification issues. There are also extensive examples (i.e. SEM with categorical variables or missing data). I will not include them in this tutorial (yet). You can see further explanations on professor Fox’s documents.

Multiple Linear Regression (Recall)

The common representation of a multiple linear regression is as follows.

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n + \epsilon \]

where \(y\) is the response variable, \(x_i, i = 1 \dots n\) are explanatory variables, \(\beta_i\) are coefficients and \(\epsilon\) is the error term. For example let’s assume we would hypothesize the effectiveness of the elected government (\(y\)) can be measured by looking at the effects of GNP per capita (\(x_1\)), energy consumption per capita (\(x_2\)) and percentage of labor force in industry (\(x_3\)). So, we should have the respective data for at least some countries to test our hypothesis. The following data set example is from Bollen2 and included in the sem R package (which we will use the full example in the next section). (Warning: Data, I believe, is standardized (somehow); so the numbers might not make sense.)

head(Bollen_mlr_data)
##          y       x1       x2       x3
## 1 0.000000 4.442651 3.637586 2.557615
## 2 0.000000 5.384495 5.062595 3.568079
## 3 9.199991 5.961005 6.255750 5.224433
## 4 9.199991 6.285998 7.567863 6.267495
## 5 6.666666 5.863631 6.818924 4.573679
## 6 6.666666 5.533389 5.135798 3.892270

The job here would be to estimate the coefficients (\(\beta_i\)) of covariates (explanatory variables) (\(x_i\)) and see if the model is valid and representative enough. Since MLR is not the focus here, I will keep it brief with the following code.

summary(lm(formula=y ~ x1 + x2 + x3,data=Bollen_mlr_data))
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = Bollen_mlr_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4833 -2.1042  0.0612  2.2209  6.5616 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.186938   3.448869  -1.504    0.137
## x1           1.658357   1.084370   1.529    0.131
## x2           0.264084   0.601136   0.439    0.662
## x3          -0.002261   0.480561  -0.005    0.996
## 
## Residual standard error: 3.015 on 71 degrees of freedom
## Multiple R-squared:  0.2224, Adjusted R-squared:  0.1896 
## F-statistic:  6.77 on 3 and 71 DF,  p-value: 0.0004429

To quickly examine; model p-value is ok, coefficient p-values are not and Adjusted R-squared value is too low. SEM is not this easy to set up and interpret.

Graphical Representation

It is a good practice to lay out a graphical representation of the SEM. The MLR example is given below. It is quite self explanatory but all details will be explained in the full SEM example in the next section.

Multiple Linear Regression example graphical representation.

Multiple Linear Regression example graphical representation.

There are distinctions between variables. If a variable is used only as an input (i.e. no arrows directed at it) it is called an exogenous variable (in this case \(x_i\)), otherwise it is called endogenous variable (in this case \(y\)). If the variable cannot be directly observed (in this case error term \(\epsilon\)), it is called a “latent variable” and represented with ellipsis instead of a rectangle.

SEM Example

First let’s start with the sem package. If you have never installed it, first install the package. Then, load with the library command.

install.packages("sem")
library(sem)

Check the full Bollen data.

head(Bollen)
##      y1       y2       y3       y4       y5       y6       y7       y8
## 1  2.50 0.000000 3.333333 0.000000 1.250000 0.000000 3.726360 3.333333
## 2  1.25 0.000000 3.333333 0.000000 6.250000 1.100000 6.666666 0.736999
## 3  7.50 8.800000 9.999998 9.199991 8.750000 8.094061 9.999998 8.211809
## 4  8.90 8.800000 9.999998 9.199991 8.907948 8.127979 9.999998 4.615086
## 5 10.00 3.333333 9.999998 6.666666 7.500000 3.333333 9.999998 6.666666
## 6  7.50 3.333333 6.666666 6.666666 6.250000 1.100000 6.666666 0.368500
##         x1       x2       x3
## 1 4.442651 3.637586 2.557615
## 2 5.384495 5.062595 3.568079
## 3 5.961005 6.255750 5.224433
## 4 6.285998 7.567863 6.267495
## 5 5.863631 6.818924 4.573679
## 6 5.533389 5.135798 3.892270

Explanations of the data and the columns are as follows (you can also see it by typing ?Bollen on R console). “This data set includes four measures of democracy at two points in time, 1960 and 1965, and three measures of industrialization in 1960, for 75 developing countries.”

Columns y1 to y4 represent the measures of democracy in 1960 (say, \(\eta_1\)3), y5 to y8 represent the measures of democracy in 1965 (say, \(\eta_2\)4) and x1 to x3 represent the measures of industrialization in 1960 (say, \(\xi_1\)5). Democracy in 1960 and 1965, and industrialization in 1960 cannot be directly observed, therefore they need to be estimated. They are the latent variables of SEM. Let’s also assume there are estimated to be nonzero correlations between y1-y5, y2-y4, y2-6, y3-y7, y4-y6 and y6-y8. An example SEM graphical representation is as follows.

SEM Bollen full example graphical representation.

SEM Bollen full example graphical representation.

In addition to the above explanation, two headed arrows represent correlation. y1-y8 and x1-x3 are endogenous variables. eta1 and xi1 are exogenous latent variables and eta2 is an endogenous latent variable. The model above claims that the latent variables affect the observed variables .

Training the R Models

There are different ways to train the model. I will show only one of them; the more convenient way in my opinion. We will represent the relations with arrows such as “->” and “<->” between the variables and assign the arrow names (just like \(\beta_i\)).

xi1 -> eta1, gamma11, NA
xi1 -> eta2, gamma21, NA
eta1 -> eta2, beta21, NA
eta1 -> y1, NA, 1
eta1 -> y2, lam2, NA
eta1 -> y3, lam3, NA
eta1 -> y4, lam4, NA
eta2 -> y5, NA, 1
eta2 -> y6, lam2, NA
eta2 -> y7, lam3, NA
eta2 -> y8, lam4, NA
xi1 -> x1, NA, 1
xi1 -> x2, lam6, NA
xi1 -> x3, lam7, NA
y1 <-> y5, theta15, NA
y2 <-> y4, theta24, NA
y2 <-> y6, theta26, NA
y3 <-> y7, theta37, NA
y4 <-> y8, theta48, NA
y6 <-> y8, theta68, NA

Each line consists of a relation representation, variable name and a fixed value separated by commas. Some relations are fixed to 1 (author’s specifications). Save the above lines to a .txt file (e.g. bollen_relations.txt).

bollen_model <- specifyModel(file="bollen_relations.txt")
bollen_sem <- sem(bollen_model,data=Bollen)
summary(bollen_sem)
## 
##  Model Chisquare =  39.64376   Df =  38 Pr(>Chisq) = 0.396585
##  AIC =  95.64376
##  BIC =  -124.4208
## 
##  Normalized Residuals
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.13800 -0.37850 -0.02113 -0.03988  0.27810  1.05100 
## 
##  R-square for Endogenous Variables
##   eta1   eta2     y1     y2     y3     y4     y5     y6     y7     y8 
## 0.2004 0.9645 0.7232 0.4755 0.5743 0.7017 0.6673 0.5697 0.6425 0.6870 
##     x1     x2     x3 
## 0.8464 0.9465 0.7606 
## 
##  Parameter Estimates
##         Estimate   Std Error  z value    Pr(>|z|)                   
## gamma11 1.47133149 0.39495897  3.7252768 1.951010e-04 eta1 <--- xi1 
## gamma21 0.60047451 0.22721817  2.6427223 8.224247e-03 eta2 <--- xi1 
## beta21  0.86504140 0.07537528 11.4764609 1.732269e-30 eta2 <--- eta1
## lam2    1.19078349 0.14020140  8.4933780 2.007161e-17 y2 <--- eta1  
## lam3    1.17454027 0.12121268  9.6899123 3.328143e-22 y3 <--- eta1  
## lam4    1.25098062 0.11757298 10.6400352 1.940560e-26 y4 <--- eta1  
## lam6    2.17965696 0.13931700 15.6453050 3.576518e-55 x2 <--- xi1   
## lam7    1.81820952 0.15290304 11.8912579 1.314124e-32 x3 <--- xi1   
## theta15 0.59041780 0.36306858  1.6261881 1.039096e-01 y5 <--> y1    
## theta24 1.45958287 0.70251366  2.0776576 3.774090e-02 y4 <--> y2    
## theta26 2.21251146 0.75241914  2.9405305 3.276507e-03 y6 <--> y2    
## theta37 0.72120105 0.62333119  1.1570110 2.472679e-01 y7 <--> y3    
## theta48 0.36770923 0.45324028  0.8112898 4.171993e-01 y8 <--> y4    
## theta68 1.39032689 0.58859218  2.3621226 1.817063e-02 y8 <--> y6    
## phi     0.45466119 0.08845700  5.1399118 2.748674e-07 xi1 <--> xi1  
## V[eta1] 3.92768936 0.88311593  4.4475354 8.686115e-06 eta1 <--> eta1
## V[eta2] 0.16668647 0.23158520  0.7197630 4.716709e-01 eta2 <--> eta2
## V[y1]   1.87971345 0.44228808  4.2499753 2.137941e-05 y1 <--> y1    
## V[y2]   7.68379277 1.39404037  5.5118868 3.550072e-08 y2 <--> y2    
## V[y3]   5.02264245 0.97585896  5.1468938 2.648351e-07 y3 <--> y3    
## V[y4]   3.26806049 0.73807250  4.4278312 9.518533e-06 y4 <--> y4    
## V[y5]   2.34430126 0.48850470  4.7989328 1.595133e-06 y5 <--> y5    
## V[y6]   5.03532995 0.93992491  5.3571620 8.453934e-08 y6 <--> y6    
## V[y7]   3.60814214 0.72394255  4.9840173 6.227752e-07 y7 <--> y7    
## V[y8]   3.35239711 0.71788064  4.6698531 3.014152e-06 y8 <--> y8    
## V[x1]   0.08248753 0.01985853  4.1537585 3.270583e-05 x1 <--> x1    
## V[x2]   0.12205511 0.07105475  1.7177615 8.584013e-02 x2 <--> x2    
## V[x3]   0.47296610 0.09196885  5.1426770 2.708510e-07 x3 <--> x3    
## 
##  Iterations =  178

The null hypothesis of the test is model is a good representation of the data (i.e. a good fit). Check the Model Chisquare and Pr(>Chisq) values. If the latter is above 0.05 you are ok.6 R-square for Endogenous Variables part explains how much the response variables of respective equations are explained by regressions. For the parameter estimates check the Pr(>|z|) values, if they are below 0.05 then it is statistically significant. phi and V[.] values indicate response variable variances (i.e. error term variances of their respective equations).

Furthermore

Here is the list of further topics in SEM. Some are internal to the SEM above, some are extensions.

Conclusion

Objective of this tutorial is to give the bare minimum background to start with SEM, have the reader to easily run their model via sem R package and interpret its output. Here, only one approach with a single example is given to the reader, but it is a good start. Given the very few online and coherent documents7, I hope this tutorial will help the reader to understand the concepts and to advance in theory.


  1. This tutorial follows the work, examples and the R package of professor John Fox. As far as my research goes, he is the only person that puts coherent and understandable work on SEM online. Thanks a lot.

  2. Bollen, K. A. (1989) Structural Equations With Latent Variables. Wiley.

  3. eta1

  4. eta2

  5. xi1

  6. Here they claim with the increasing sample size, the hypothesis will be rejected even if it is a good fit. So it brings questions about the legitimacy about the Chisq test (also known as Likelihood Ratio Test).

  7. Social scientists in this area should improve their documentation skills put their work online