Go to: CoHort Software | CoStat | CoStat Statistics  

Multiple Regression in CoStat

Multiple regression is the simultaneous linear regression of several x columns of data (independent variables) on one y column of data (the dependent variable). The general form of the resulting equation is:

    y = b0 + b1x1 + b2x2 + b3x3 ... bnxn where the b values are the coefficients that the regression finds optimal (least squares) values for.

CoStat can do a multiple regression of a full model (where all of the x columns are in the model) or one subset (where you specify a subset of the x columns).

Note that some of the x columns may have been created from other x columns with CoStat's 'Transformations' procedure. For example, you could make a column with x12, or a column with x1*x2. In this way, you can make model of a "response surface" or other more complex models.

Often, an experimenter has a large number of x columns and wishes to know if there is a smaller, simpler model with a subset of these x columns which adequately explains the dependent variable. For this type of multiple regression problem, see Subset Selection in Multiple Regression.

Sample Run

In the sample run, we will estimate the relationship of the employment level with several economic variables (unemployment rate, GNP, etc.). The data is from an article testing computational accuracy (Longley, 1967).  

PRINT DATA
2000-08-05 11:06:07
Using: c:\cohort6\longley.dt
  First Column: 1) GNP def
  Last Column:  7) Employment
  First Row:    1
  Last Row:     16

 GNP def     GNP    Unemployment Armed Forces  14 yrs     Time    Employment 
--------- --------- ------------ ------------ --------- --------- ---------- 
       83    234289         2356         1590    107608      1947      60323 
     88.5    259426         2325         1456    108632      1948      61122 
     88.2    258054         3682         1616    109773      1949      60171 
     89.5    284599         3351         1650    110929      1950      61187 
     96.2    328975         2099         3099    112075      1951      63221 
     98.1    346999         1932         3594    113270      1952      63639 
       99    365385         1870         3547    115094      1953      64989 
      100    363112         3578         3350    116219      1954      63761 
    101.2    397469         2904         3048    117388      1955      66019 
    104.6    419180         2822         2857    118734      1956      67857 
    108.4    442769         2936         2798    120445      1957      68169 
    110.8    444546         4681         2637    121950      1958      66513 
    112.6    482704         3813         2552    123366      1959      68655 
    114.2    502601         3931         2514    125368      1960      69564 
    115.7    518173         4806         2572    127852      1961      69331 
    116.9    554894         4007         2827    130081      1962      70551 

Longley ran this seemingly routine regression on several mainframe computers and found incredibly varied answers, largely because the x values are large relative to their standard error and because of mild collinearity among the x values. CoHort's Regression compares quite well - the estimated coefficients are accurate to 10 significant figures.

There is a fascinating follow-up article by Beaton, et al. (1976), which points out that a greater source of inaccuracy may be the data itself. Slight variations in the original data cause large variations in the results. This is an important consideration and further investigation of the matter is encouraged before accepting the results of any regression.

For the sample run, use File : Open to open the file called longley.dt in the cohort directory. Then:

  1. From the menu bar, choose: Statistics : Regression : Multiple (Full Model)
  2. Keep If:
  3. Calculate constant: (checked)
  4. Print Residuals: (checked)
  5. Save residuals: (don't)
  6. Validation Method: Bootstrap
  7. Validate N Times: 100
  8. OK
REGRESSION: MULTIPLE (FULL MODEL)
2002-09-26 16:06:26
Using: C:\cohort6\LONGLEY.DT
  X Columns:
       1) GNP def         3) Unemployment    5) 14 yrs      
       2) GNP             4) Armed Forces    6) Time        
  Y Column: 7) Employment
Keep If: 
Calculate Constant: true

Total number of data points = 16
Number of data points used = 16
Regression equation: 
col(7)[Employment] = -3482258.6346
  +15.0618722714*col(1)[GNP def]
  -0.0358191793*col(2)[GNP]
  -2.0202298038*col(3)[Unemployment]
  -1.0332268672*col(4)[Armed Forces]
  -0.0511041057*col(5)[14 yrs]
  +1829.15146461*col(6)[Time]
 

R^2     = 0.99547900458    AIC = 187.828836554    MSEP    = 172802.886467
adj R^2 = 0.99246500763    BIC =   199.5078489    PRESS   = 2886892.54145
PRE R^2 = 0.98482749195    MAE = 179.371521174    LOO MAE = 333.425946693

For each term in the ANOVA table below, if P<=0.05, that term was a
significant source of Y's variation.

Source                              SS       df        MS         F     P
------------------------ ------------- -------- --------- --------- ---------
Regression               184172401.944        6  30695400 330.28534 .0000 ***
col(1)[GNP def]          174397449.779        1 1.74397e8 1876.5326 .0000 ***
col(2)[GNP]              4787181.04445        1   4787181  51.51051 .0001 ***
col(3)[Unemployment]     2263971.10982        1 2263971.1 24.360538 .0008 ***
col(4)[Armed Forces]     876397.161861        1 876397.16 9.4301143 .0133 *  
col(5)[14 yrs]            348589.39965        1  348589.4 3.7508541 .0848 ns 
col(6)[Time]             1498813.44959        1 1498813.4 16.127371 .0030 ** 
Error                    836424.055506        9 92936.006
------------------------ ------------- -------- --------- --------- ---------
Total                        185008826       15

Table of Statistics for the Regression Coefficients:

Column                       Coef.  Std Error  t(Coef=0)      P      +/-95% CL
------------------------ ---------  ---------  ---------  ---------  ---------
Intercept                 -3482259  890420.38  -3.910803  .0036 **   2014270.8
col(1)[GNP def]          15.061872  84.914926   0.177376  .8631 ns   192.09091
col(2)[GNP]              -0.035819   0.033491  -1.069516  .3127 ns   0.0757619
col(3)[Unemployment]      -2.02023  0.4883997  -4.136427  .0025 **   1.1048368
col(4)[Armed Forces]     -1.033227  0.2142742  -4.821985  .0009 ***  0.4847218
col(5)[14 yrs]           -0.051104  0.2260732  -0.226051  .8262 ns   0.5114131
col(6)[Time]             1829.1515   455.4785  4.0158898  .0030 **   1030.3639

Degrees of freedom for two-tailed t tests = 9
If P<=0.05, the coefficient is significantly different from 0.

Residuals:

      Row     Y observed     Y expected       Residual
---------  -------------  -------------  -------------
        1          60323  60055.6599702  267.340029759
        2          61122  61216.0139424  -94.013942399
        3          60171  60124.7128322  46.2871677573
        4          61187  61597.1146219  -410.11462193
        5          63221  62911.2854092   309.71459076
        6          63639  63888.3112153  -249.31121533
        7          64989  65153.0489564   -164.0489564
        8          63761  63774.1803569  -13.180356867
        9          66019  66004.6952274  14.3047726001
       10          67857  67401.6059054  455.394094552
       11          68169  68186.2689271  -17.268927115
       12          66513  66552.0550425  -39.055042523
       13          68655  68810.5499736  -155.54997359
       14          69564   69649.671308  -85.671308042
       15          69331   68989.068486   341.93151396
       16          70551  70757.7578252  -206.75782519


Validation Method: Bootstrap
Validate N Times:  100
Leave-Group-Out PRESS   = 7966592.18225
Leave-Group-Out PRE R^2 = 0.9801802
Leave-Group-Out MAE     = 564.585568568

(The validation method randomly assigns rows of data to validation groups,
so the Leave-Group-Out statistics printed above will vary.
You can reduce the variability by increasing 'Validate N Times'.)

Group  Leave-Group-Out Validation Equations
-----  ----------------------------------------------------------------------
    1  10958996.4318 +246.007245432*col(1)[GNP def] +0.1405383348*col(2)[GNP] +0.8217102887*col(3)[Unemployment] -1.2388561368*col(4)[Armed Forces] +2.63894001237*col(5)[14 yrs] -5770.9069564*col(6)[Time]
    2  -3913158.8845 +138.451642639*col(1)[GNP def] -0.0747379129*col(2)[GNP] -2.7571974387*col(3)[Unemployment] -1.060357447*col(4)[Armed Forces] +0.21777855892*col(5)[14 yrs] +2036.04532194*col(6)[Time]
    3  -2393179.0155 +15.0310564322*col(1)[GNP def] -0.0261600929*col(2)[GNP] -1.898797727*col(3)[Unemployment] -0.800027798*col(4)[Armed Forces] +0.2010804385*col(5)[14 yrs] +1254.37302405*col(6)[Time]
...
  100  -4161802.887 +4.28677342928*col(1)[GNP def] -0.0522384443*col(2)[GNP] -2.6041771478*col(3)[Unemployment] -0.9534097692*col(4)[Armed Forces] -0.0074760369*col(5)[14 yrs] +2178.89210068*col(6)[Time]

(The validation method randomly assigns rows of data to validation groups,
so the LGO Validation Equations will vary. Since these equations are
generated during the individual validation runs, increasing
'Validate N Times' will not decrease the variability of the coefficients.)

 


Go to: CoHort Software | CoStat | CoStat Statistics | Top