Go to: CoHort Software | CoPlot Pro | CoPlot | CoStat | CoStat Statistics  

Subset Selection in Multiple Regression
in CoStat

CoStat includes several methods for finding the best models which are subsets of a full multiple regression model. (This is also known as "attribute selection", "feature selection", and "variable selection".)

Summary: For subset selection in multiple regression with more than 40 X variables (when All Subsets starts to become too slow), the Simons 2 procedure does a dramatically better job of finding the best subset models than any other approximate subset selection procedure available anywhere. This procedure is only available in CoPlot Pro.

Details

Problems Associated With Selecting Subsets:

Problem #1 - The number of possible subsets can be huge.
The number of possible subsets grows very quickly with the number of X columns. If there are k X columns, there will be 2^k -1 possible models. Usually, your goal is to find a model with a small subset of X's which provide a good fit of the data. By limiting the maximum number of X's in the models (Max N X's In Model, you can limit the search to fewer possible models, but the number can still be huge (it approaches k^Max N X's In Model). Ideally, you would use the All Subsets method to check all of the possible subsets with 1 - Max N X's In Model. But, when you ask for models with, for example, 9 or more X's selected from a file with 40 or more X's, the computer time for checking all subsets (that is about 2x10^14 models) becomes prohibitive (days, weeks, years, 100's of years, ...). Fortunately, there are alternatives to the All Subsets method: various approximate methods (Simons 2, Simons 1, Replace 1, Forward, and other methods which are not in CoStat) for efficiently finding good models. All of the approximate methods are faster than All Subsets, but none is guaranteed to find the best models.

Two of the approximate methods, Simons 2 and Simons 1 (which use algorithms developed by Robert Simons of CoHort Software and are only available in CoHort's software products), are dramatically better than Replace 1 (which is essentially the same as SAS's best method, MaxR), Forward Selection (which is also in SAS), and other commonly available (Backward Elimination and Stepwise) and not commonly available (Garrote, Ridge, Forward Stagewise, LARS, Lasso, and even Miller's Replace 2) approximate methods. Simons 2 and Simons 1 are much better at finding the best models and they do so in a reasonable amount of time. We did a Monte Carlo test of the subset selection methods in CoStat. In the test, 1000 random data files were generated. Each had 100 rows, 39 X columns, and 1 Y column. Each of the selection methods was asked to find the best subsets with 1, 2, 3, 4, 5, 6, and 7 X's. The results from each method were compared to the results from All Subsets (which by definition finds the best available subsets). Below is a tally of how many times each method failed to find all 7 of the subsets ranked #1, and how often each failed to find all 140 of the subsets ranked #1 through #20. (The times are on a 200 MHz AMD K6 computer.)

 Method                     Time (sec)  N Failures Rank#1  N Failures in Top 20
 -------------------------  ----------  -----------------  --------------------
 Forward                            78  663                 1000
 Replace 1 (like SAS MaxR)         134  322                 1000 
 Replace 2                         292   81                  651
 Simons 1                          578    9                  333
 Simons 2                         1923    1                   23
 All Subsets                     43451    0 by definition      0 by definition

Results: For the test of finding the best models (Rank#1), the Replace1 procedure (which is the equivalent of MaxR, the best approximate procedure available in SAS), failed with 322 out of the 1,000 data files. In contrast, the new Simons2 procedure, the best approximate procedure available in CoStat, failed with only 1 out of the 1,000 data files.

Recommendations:

The extra computer time needed for the better selection methods will often yield better results.

Availability:

Problem #2 - Often, there are several good models.
Because there are so many possible subsets when the number of X columns is large, there are often several almost equally good models to choose from. Sometimes, if you were to remove a few data points, a different model would be ranked #1.

Recommendations:

Problem #3 - The best X columns may be no better than random values.
A common situation is for the data file to have many X columns (40, 100, 400, 1000, ...) and relatively few rows of data (roughly, less than twice as many rows as there are columns). In this situation, the probablility that a good subset of X's will exist (by chance) is very high. In fact, if you add some X columns with random numbers to a data file with relatively few rows, you will see that they are sometimes included in the "best" models.

Clearly, the more rows of data you have the better. But in any case, you need to do some sort of validation tests to ensure that the "best" models are unlikely to have occured by chance and that they remain the best models with other samples of data.

A commonly proposed solution to these problems is to divide the data into 2 parts: (for example, use 1/2 of the rows of data for subset selection and 1/2 for cross validation and for estimation of the regression coefficients). But unless you have lots of rows of data, you are then ignoring valuable information in each part of the process and result is less than optimal. A simulation study by Roecker (1991) showed that it is better to use all of the data for both parts. In the end, as Miller (2002, page 191) says, "the bias is present in the complete data set; cross-validation cannot remove it."

CoStat has options to do validation tests on the N Best subset models that it finds.

Recommendations:

Problem #4 - The regression statistics and regression coefficients are biased.
All of the statistics generated by this procedure (R^2, etc.) and the regression coefficients in the models are biased. The true R^2 values (and the other statistics) probably aren't as high as they appear here. Similarly, the absolute values of the true regression coefficients are probably somewhat smaller. This occurs because the statistical tests were designed for use on one or a few pre-specified models, but this procedure uses them to compare millions of models. Also, we are using the same data values for subset selection and for generating the statistics and the coefficients. This is actually one of the causes of Problems #2 and #3 (above). Techniques to calculate unbiased statistics and regression coefficients have not yet been developed.

Recommendations:

Validation
CoStat takes a slightly non-standard approach to validation. We recommend using the best selection method possible (All Subsets, Simons 2, or Simons 1) to identify the nBest models. The validation procedure then repeatedly chooses a subset of the data and generates the validation statistics (LGO PreSS, LGO Pre R^2, and/or LGO MAE) for each of the nBest models (which have already been selected).

The standard validation procedure recommended by many authors and other statistics programs uses a poor selection method (such as Forward, Replace 1, or Stepwise) and then repeats the subset selection procedure for each validation replication. But these poor selection methods may never find the best models! We argue that it is better to use the best selection method possible to find the best models and then just use validation to compare the models which have already been identified. (Miller, 2002, uses the standard validation approach throughout his book, but he mentions the alternative that CoStat uses on page 189).

References - Alan J. Miller's Subset Selection in Regression (Second Edition) (Chapman & Hall/CRC, 2002) is an excellent book which covers all aspects of subset selection.

Data Format - As with Regression : Multiple (Full Model), there must be three or more columns of data in the data file. The initial columns must be the X columns. The final column must be the Y column.

Options in the Statistics : Regression : Multiple (Subset Selection) dialog box

Keep If:
lets you enter a boolean expression (for example, (col(1)>50) and (col(2)<col(3))). Each row of the data file is tested. If the equation evaluates to true, that row of data will be used in the calculations. If false, that row of data will be ignored. See "Using Equations", "the A button", and "the f() button".
Method:
Usually, you should use the "All Subsets" method, because it is the only method guaranteed to find the best possible models. It actually tests all of the possible subset models. For large problems, this procedure may take a long time and you should consider letting a problem run overnight.

If All Subsets is too slow (for example, some large problems might takes days, weeks, or years), use:

Simons 2 takes longer to run than Simons 1, which takes longer than Replace 1, but the Simons methods are much more likely to find the best models. (See Problem #1 above.) (Forward is sometimes useful when you want to do a very quick pilot test before running All Subsets or one of the Simons methods.
Max N X's In Model:
is the maximum number of X's in the models. For example, your data file might have 20 X columns, but you might want to restrict the search to subsets which have a maximum of 7 X's. For all of the Methods (and especially All Subsets), the procedure takes much longer with larger values of Max N X's In Model.
N Best:
is the number of top models of each subset size which will be saved and printed. Increases in N Best make the procedure only a little bit slower.

If you use one of the validation methods and N Best is less than 20, CoStat will automatically increase N Best to 20. Validation only makes sense if it compares several models.

Print X's As:
The X's in the subset models can be printed as Names, Numbers, or Number:[Names].
Print:
You can choose which information you want to have printed for each model. For all of the formulas:

Here is a description of each of the statistics. (For statistics for which there are variations in how they are calculated, we have chosen the variation supported by SAS.)

Validation Method:
Validation gives you a measure of robustness of the different models with different subsets of the data. The validation process retests the selected models with randomly picked subsets of the data. (See the discussion of validation above, and Problems #2 and #3 above.)

Bootstrap - For each validation replication, the Bootstrap method randomly picks rows from the original data file, with replacement. For each validation replication, it picks the same same number of rows as were in the original data file.

Leave X% Out - There are several Leave X% Out validation methods. Each repeatedly reruns the regression with X% of the rows of data removed. Brieman and Spector (1992) did a simulation study which found Leave 20% Out to be a good choice.

The Leave X% Out validation replicates are done in balanced groups of replicates (for example, Leave 20% Out works with groups of 5 replicates), so that each row of data is left out of exactly 1 of the replicates in each group of replicates. The use of balanced groups (as opposed repeatedly randomly choosing which X% to omit) leads to more stable results with fewer validation replications.

We strongly recommend using a validation method. We recommend the Bootstrap method or Leave 20% Out.

Validate N Times:
For Bootstrap validation method, this is the number of times the procedure resamples the data. For the Leave X% Out validation methods, this is the number of times that the results will be validated by selecting and testing a balanced group of subsets of the data.

The validation process randomly pick which rows will be used, so using larger values of Validate N Times (we recommend 100) helps stabilize the validation statistics.

Print:
You can choose which validation information you want to have printed:

The Sample Run

The data for the sample run is the Longley Data, which has 6 X columns and 1 Y column. With so few X columns, even Method: All Subsets runs very quickly.

For the sample run, use File : Open to open the file called longley.dt in the cohort directory. Then:

  1. From the menu bar, choose: Statistics : Regression : Multiple (Subset Selection).
  2. Keep If:
  3. Method: All Subsets
  4. Max N X's In Model: 6
  5. N Best: 20
  6. Print X's As: Numbers
  7. Print:
  8. Validation Method: Bootstrap
  9. Validate N Times: 100
  10. Print:
  11. OK
REGRESSION : MULTIPLE : SUBSET SELECTION
2002-10-14 18:42:31
Using: C:\cohort6\LONGLEY.DT
  X Columns:
       1) GNP def         3) Unemployment    5) 14 yrs      
       2) GNP             4) Armed Forces    6) Time        
  Y Column: 7) Employment

Method: All Subsets
Max N X's In Model: 6
Keep If: 
Validation Method: Bootstrap
Validate N Times: 100

Total number of data rows = 16
Number of data rows used = 16

The best models:
                                                                           
n Xs Rank   X Columns                                                      
---- ----   ---------------------------------------------------------------
   1    1   2
   1    2   6
   1    3   1
   1    4   5
   1    5   3
   1    6   4
                                                                           
n Xs Rank   X Columns                                                      
---- ----   ---------------------------------------------------------------
   2    1   3, 6
   2    2   2, 3
   2    3   2, 5
   2    4   2, 6
   2    5   3, 5
...
                                                                           
n Xs Rank   X Columns                                                      
---- ----   ---------------------------------------------------------------
   3    1   3, 4, 6
   3    2   2, 3, 4
   3    3   2, 4, 5
   3    4   1, 3, 6
   3    5   2, 3, 6
...
                                                                           
n Xs Rank   X Columns                                                      
---- ----   ---------------------------------------------------------------
   4    1   2, 3, 4, 6
   4    2   3, 4, 5, 6
   4    3   1, 3, 4, 6
   4    4   2, 3, 4, 5
   4    5   1, 2, 4, 5
...
                                                                           
n Xs Rank   X Columns                                                      
---- ----   ---------------------------------------------------------------
   5    1   2, 3, 4, 5, 6
   5    2   1, 2, 3, 4, 6
   5    3   1, 3, 4, 5, 6
   5    4   1, 2, 3, 4, 5
   5    5   1, 2, 4, 5, 6
   5    6   1, 2, 3, 5, 6
                                                                           
n Xs Rank   X Columns                                                      
---- ----   ---------------------------------------------------------------
   6    1   1, 2, 3, 4, 5, 6
                                                                           
The statistics for the best models:
                                                              
n_Xs Rank         R^2         Cp  LOO_Press  Rank#1  LGO_Press
---- ----   ---------  ---------  ---------  ------  ---------
   1    1   0.9673738  52.949425  7589201.1      97  3096552.4
   1    2   0.9434809  100.51322   13016160       3  5214933.2
   1    3   0.9426439  102.17939   13500813       0  5173452.8
   1    4   0.9223501  142.57869   18421169       0  7827711.3
   1    5   0.2525043  1476.0486  1.69617e8       0   69522036
   1    6   0.2091301  1562.3943  1.88314e8       0   84857519
                                                              
n_Xs Rank         R^2         Cp  LOO_Press  Rank#1  LGO_Press
---- ----   ---------  ---------  ---------  ------  ---------
   2    1   0.9823137  25.208364    4679317      58  2120576.2
   2    2   0.9806546  28.511069  5076800.9      15    2239282
   2    3   0.9790585  31.688487  5674302.5       8  2452798.7
   2    4   0.9734556  42.842209  6981785.1       3    3053396
   2    5   0.9688932  51.924638  8876891.5       9  4088699.4
...
                                                              
n_Xs Rank         R^2         Cp  LOO_Press  Rank#1  LGO_Press
---- ----   ---------  ---------  ---------  ------  ---------
   3    1   0.992847   6.2394837  2132127.4      87  1072037.2
   3    2   0.9850996  21.662472  4386976.5       2  2151847.5
   3    3   0.9835103  24.826233  4941030.8       4  2274526.7
   3    4   0.9828873  26.066385  5149873.3       1  2629974.1
   3    5   0.9824913  26.854818  5741368.7       3  2962273.2
...
                                                              
n_Xs Rank         R^2         Cp  LOO_Press  Rank#1  LGO_Press
---- ----   ---------  ---------  ---------  ------  ---------
   4    1   0.9953587  3.2394804  1998041.1      60  1113358.9
   4    2   0.994672   4.6064343  2216518.9      27  1748728.6
   4    3   0.992854   8.2256744  2540766.1      10  1427904.9
   4    4   0.9872082  19.464804  5184195.3       0  3717538.5
   4    5   0.9863071  21.258566  4951193.5       0  2793420.3
...
                                                              
n_Xs Rank         R^2         Cp  LOO_Press  Rank#1  LGO_Press
---- ----   ---------  ---------  ---------  ------  ---------
   5    1   0.9954632  5.0314623  2561221.4      42  2573200.5
   5    2   0.9954533  5.0510991  2344384.2      34  1363439.1
   5    3   0.9949044  6.1438652  2581127.2      22  3728562.8
   5    4   0.9873777  21.127371  5852906.9       1  5618362.2
   5    5   0.9868841  22.110031  5832091.7       1  8692038.2
   5    6   0.983799   28.251542  7676622.8       0  5942862.2
                                                              
n_Xs Rank         R^2         Cp  LOO_Press  Rank#1  LGO_Press
---- ----   ---------  ---------  ---------  ------  ---------
   6    1   0.995479           7  2886892.5     100  4877001.5
                                                              
(The validation method randomly assigns rows of data to validation groups,
so the Rank#1 and LGO statistics printed above will vary.
You can reduce the variability by increasing 'Validate N Times'.)
                                                              
                                                                           
n Xs Rank   Equations With Column Numbers                                    
---- ----   -----------------------------------------------------------------
   1    1   col(7) = 51843.5897819 +0.03475229435*col(2)
   1    2   col(7) = -1335105.2441 +716.511764706*col(6)
   1    3   col(7) = 33189.1733796 +315.966086377*col(1)
   1    4   col(7) = 8380.67418338 +0.48487809832*col(5)
   1    5   col(7) = 59286.3553982 +1.88852315637*col(3)
   1    6   col(7) = 59301.2646582 +2.30780841269*col(4)
                                                                           
n Xs Rank   Equations With Column Numbers                                    
---- ----   -----------------------------------------------------------------
   2    1   col(7) = -1587138.9078 -0.9955303213*col(3) +847.088742485*col(6)
   2    2   col(7) = 52382.1670501 +0.03784032702*col(2) -0.5435743321*col(3)
   2    3   col(7) = 88938.7983051 +0.06317243566*col(2) -0.4097429223*col(5)
   2    4   col(7) = 1198708.11085 +0.06299295723*col(2) -592.38341363*col(6)
   2    5   col(7) = -135.32549508 -1.1151492576*col(3) +0.5877277691*col(5)
...
                                                                           
n Xs Rank   Equations With Column Numbers                                    
---- ----   -----------------------------------------------------------------
   3    1   col(7) = -1797221.1122 -1.4696711189*col(3) -0.7722814913*col(
   3    2   col(7) = 53306.4611883 +0.04078799732*col(2) -0.7968165793*col(
   3    3   col(7) = 109470.955483 +0.07992714636*col(2) -0.4978793189*col(
   3    4   col(7) = -1879252.7716 -64.775010369*col(1) -1.0519056696*col(
   3    5   col(7) = -1198891.4309 +0.00899056576*col(2) -0.8904086536*col(
...

n Xs Rank   Equations With Column Numbers                                    
---- ----   -----------------------------------------------------------------
   4    1   col(7) = -3598729.3743 -0.0401904697*col(2) -2.0883907318*col(
   4    2   col(7) = -2446174.695 -1.5004764434*col(3) -0.9343638696*col(
   4    3   col(7) = -1828915.7377 -7.2827116254*col(1) -1.4734185066*col(
   4    4   col(7) = 82613.0992041 +0.06210170815*col(2) -0.5198036017*col(
   4    5   col(7) = 120323.684241 -136.32592586*col(1) +0.09659841249*col(
...
                                                                           
n Xs Rank   Equations With Column Numbers                                    
---- ----   -----------------------------------------------------------------
   5    1   col(7) = -3449891.5997 -0.0319613069*col(2) -1.9721499421*col(
   5    2   col(7) = -3564921.8744 +27.7148784578*col(1) -0.042127114*col(
   5    3   col(7) = -2705054.5008 -43.916959962*col(1) -1.5262904441*col(
   5    4   col(7) = 92461.3078244 -48.462828184*col(1) +0.07200384932*col(
   5    5   col(7) = -403186.16429 -179.87874985*col(1) +0.09517876035*col(
   5    6   col(7) = -1121975.8255 -127.76330578*col(1) +0.03985731002*col(
                                                                           
n Xs Rank   Equations With Column Numbers                                    
---- ----   -----------------------------------------------------------------
   6    1   col(7) = -3482258.6346 +15.0618722714*col(1) -0.0358191793*col(
                                                                           
Coefficients Table                                                      
                                                                           
Variability of Biased Regression Coefficients:
When the same data is used to search for the best subset models and to
estimate the regression coefficients, the estimated regression coefficients
are biased. A measure of the bias is Mean #1 - Mean All: the mean of
the values of a coefficient when it is in validation replicates where
that model is ranked #1 and the mean of the values of that coefficient for
all validation replicates. Usually, the absolute values of the biased 
coefficients are too large. The difference is larger when the Leave X% Out
percentage is higher. The unique statistical properties of the Bootstrap
method make its results the best measure of the bias.
                                                                           
(The validation method randomly assigns rows of data to validation groups,
so the following results will vary.
You can reduce the variability by increasing 'Validate N Times'.)
                                                                           
         Biased       Mean   Std.Dev.   Mean for   Std.Dev.  Mean #1 -
   X      Coef.    When #1    When #1   All Reps   All Reps   Mean All
----  ---------  ---------  ---------  ---------  ---------  ---------
 Int   51843.59                                                      
   2  0.0347523  0.0352033   0.001667  0.0351513  0.0016724  5.2062e-5
                                                                           
         Biased       Mean   Std.Dev.   Mean for   Std.Dev.  Mean #1 -
   X      Coef.    When #1    When #1   All Reps   All Reps   Mean All
----  ---------  ---------  ---------  ---------  ---------  ---------
 Int   -1587139                                                      
   3   -0.99553  -1.019034  0.2133845  -1.024537     0.2057  0.0055035
   6  847.08874  852.93932  37.350316   854.7123  36.720343   -1.77298
                                                                           
         Biased       Mean   Std.Dev.   Mean for   Std.Dev.  Mean #1 -
   X      Coef.    When #1    When #1   All Reps   All Reps   Mean All
----  ---------  ---------  ---------  ---------  ---------  ---------
 Int   -1797221                                                      
   3  -1.469671  -1.496287  0.1371075  -1.516463  0.1779194   0.020176
   4  -0.772281  -0.802561  0.1441071  -0.798892  0.1911141  -0.003669
   6   956.3798  965.36394  38.794978  966.61598  41.199368  -1.252044
                                                                           
         Biased       Mean   Std.Dev.   Mean for   Std.Dev.  Mean #1 -
   X      Coef.    When #1    When #1   All Reps   All Reps   Mean All
----  ---------  ---------  ---------  ---------  ---------  ---------
 Int   -3598729                                                      
   2   -0.04019  -0.053328  0.0250326  -0.042098  0.0268491   -0.01123
   3  -2.088391  -2.288518  0.3259705  -2.113154  0.4043208  -0.175364
   4  -1.014639   -1.08571  0.1863755  -1.016285  0.2048972  -0.069425
   6  1887.4095  2197.3399  568.26124   1933.823  616.52441  263.51689
                                                                           
         Biased       Mean   Std.Dev.   Mean for   Std.Dev.  Mean #1 -
   X      Coef.    When #1    When #1   All Reps   All Reps   Mean All
----  ---------  ---------  ---------  ---------  ---------  ---------
 Int   -3449892                                                      
   2  -0.031961  -0.056069   0.056141  -0.043624  0.0454442  -0.012445
   3   -1.97215  -2.319053  0.6524732  -2.135074  0.6091017  -0.183979
   4  -1.019969  -1.037111  0.2179187  -1.046027  0.2589043  0.0089161
   5  -0.077537  0.0731863  0.2800716  -0.028957  0.3865007  0.1021435
   6  1814.1014  2154.0816  1009.3157  2013.4852  861.44673  140.59641
                                                                           
         Biased       Mean   Std.Dev.   Mean for   Std.Dev.  Mean #1 -
   X      Coef.    When #1    When #1   All Reps   All Reps   Mean All
----  ---------  ---------  ---------  ---------  ---------  ---------
 Int   -3482259                                                      
   1  15.061872  25.654696  109.00544  25.654696  109.00544          0
   2  -0.035819  -0.049325  0.0580191  -0.049325  0.0580191          0
   3   -2.02023  -2.229122  0.7783844  -2.229122  0.7783844          0
   4  -1.033227  -1.111287  0.3492111  -1.111287  0.3492111          0
   5  -0.051104   -0.02682   0.569697   -0.02682   0.569697          0
   6  1829.1515  2082.7935  994.94113  2082.7935  994.94113          0

The model with n Xs=4 and Rank=1 (the X columns are 2, 3, 4, 6), looks to be a good model. It has the lowest Cp, LOO_PreSS, and LGO_PreSS values of any of the models. And it was Ranked #1 among models with 4 X's more than twice as often as the next best model with 4 X's.

 


For more information, go to: CoHort Software | CoPlot Pro | CoPlot | CoStat | CoStat Statistics | Top
All material Copyright © 1996-2006 CoHort Software. All rights reserved.