CoStat includes several methods for finding the best models which are subsets of a full multiple regression model. (This is also known as "attribute selection", "feature selection", and "variable selection".)
Summary: For subset selection in multiple regression with more than 40 X variables (when All Subsets starts to become too slow), the Simons 2 procedure does a dramatically better job of finding the best subset models than any other approximate subset selection procedure available anywhere. This procedure is only available in CoPlot Pro.
Problems Associated With Selecting Subsets:
Problem #1 -
The number of possible subsets can be huge.
The number of possible subsets
grows very quickly with the number of X columns.
If there are k X columns, there will be 2^k -1 possible models. Usually,
your goal is to find a model with a small subset of X's
which provide a good fit of the data.
By limiting the maximum number of X's in the models (Max N X's In Model,
you can limit the search to fewer possible models, but the number
can still be huge (it approaches k^Max N X's In Model).
Ideally, you would use the All Subsets method
to check all of the possible subsets with 1 - Max N X's In Model.
But, when you ask for models with, for example, 9 or more X's selected
from a file with 40 or more X's,
the computer time for checking all subsets
(that is about 2x10^14 models) becomes prohibitive
(days, weeks, years, 100's of years, ...).
Fortunately, there are alternatives to the All Subsets
method: various approximate methods (Simons 2, Simons 1,
Replace 1, Forward, and other methods which are not in CoStat) for
efficiently finding good models.
All of the approximate methods are faster than All Subsets,
but none is guaranteed to find the best models.
Two of the approximate methods, Simons 2 and Simons 1 (which use algorithms developed by Robert Simons of CoHort Software and are only available in CoHort's software products), are dramatically better than Replace 1 (which is essentially the same as SAS's best method, MaxR), Forward Selection (which is also in SAS), and other commonly available (Backward Elimination and Stepwise) and not commonly available (Garrote, Ridge, Forward Stagewise, LARS, Lasso, and even Miller's Replace 2) approximate methods. Simons 2 and Simons 1 are much better at finding the best models and they do so in a reasonable amount of time. We did a Monte Carlo test of the subset selection methods in CoStat. In the test, 1000 random data files were generated. Each had 100 rows, 39 X columns, and 1 Y column. Each of the selection methods was asked to find the best subsets with 1, 2, 3, 4, 5, 6, and 7 X's. The results from each method were compared to the results from All Subsets (which by definition finds the best available subsets). Below is a tally of how many times each method failed to find all 7 of the subsets ranked #1, and how often each failed to find all 140 of the subsets ranked #1 through #20. (The times are on a 200 MHz AMD K6 computer.)
Method Time (sec) N Failures Rank#1 N Failures in Top 20 ------------------------- ---------- ----------------- -------------------- Forward 78 663 1000 Replace 1 (like SAS MaxR) 134 322 1000 Replace 2 292 81 651 Simons 1 578 9 333 Simons 2 1923 1 23 All Subsets 43451 0 by definition 0 by definition
Results: For the test of finding the best models (Rank#1), the Replace1 procedure (which is the equivalent of MaxR, the best approximate procedure available in SAS), failed with 322 out of the 1,000 data files. In contrast, the new Simons2 procedure, the best approximate procedure available in CoStat, failed with only 1 out of the 1,000 data files.
Recommendations:
Availability:
Problem #2
- Often, there are several good models.
Because there are so many possible subsets when the number
of X columns is large, there are often several almost equally good
models to choose from. Sometimes, if you were to remove
a few data points, a different model would be ranked #1.
Recommendations:
Problem #3 - The best X columns may be no better than random values.
A common situation is for the data file to have many X columns
(40, 100, 400, 1000, ...) and relatively few rows of data
(roughly, less than twice as many rows as there are columns).
In this situation, the probablility that a good subset of X's will exist
(by chance) is very high. In fact, if you add some X columns with
random numbers to a data file with relatively few rows,
you will see that they are sometimes included in the "best" models.
Clearly, the more rows of data you have the better. But in any case, you need to do some sort of validation tests to ensure that the "best" models are unlikely to have occured by chance and that they remain the best models with other samples of data.
A commonly proposed solution to these problems is to divide the data into 2 parts: (for example, use 1/2 of the rows of data for subset selection and 1/2 for cross validation and for estimation of the regression coefficients). But unless you have lots of rows of data, you are then ignoring valuable information in each part of the process and result is less than optimal. A simulation study by Roecker (1991) showed that it is better to use all of the data for both parts. In the end, as Miller (2002, page 191) says, "the bias is present in the complete data set; cross-validation cannot remove it."
CoStat has options to do validation tests on the N Best subset models that it finds.
Recommendations:
Problem #4
- The regression statistics and regression coefficients are biased.
All of the statistics generated by this procedure (R^2, etc.) and
the regression coefficients in the models are biased.
The true R^2 values (and the other statistics) probably aren't
as high as they appear here.
Similarly, the absolute values of the true regression
coefficients are probably somewhat smaller.
This occurs because the statistical tests were designed for use on one or a few
pre-specified models,
but this procedure uses them to compare millions of models.
Also, we are using the same data values for subset selection
and for generating the statistics and the coefficients.
This is actually one of the causes of Problems #2 and #3 (above).
Techniques to calculate unbiased statistics and regression coefficients
have not yet been developed.
Recommendations:
Validation
CoStat takes a slightly non-standard approach to validation.
We recommend using the best selection method possible
(All Subsets, Simons 2, or Simons 1)
to identify the nBest models. The validation procedure then repeatedly
chooses a subset of the
data and generates the validation statistics (LGO PreSS, LGO Pre R^2,
and/or LGO MAE) for each of the nBest models
(which have already been selected).
The standard validation procedure recommended by many authors and other statistics programs uses a poor selection method (such as Forward, Replace 1, or Stepwise) and then repeats the subset selection procedure for each validation replication. But these poor selection methods may never find the best models! We argue that it is better to use the best selection method possible to find the best models and then just use validation to compare the models which have already been identified. (Miller, 2002, uses the standard validation approach throughout his book, but he mentions the alternative that CoStat uses on page 189).
References - Alan J. Miller's Subset Selection in Regression (Second Edition) (Chapman & Hall/CRC, 2002) is an excellent book which covers all aspects of subset selection.
Data Format - As with Regression : Multiple (Full Model), there must be three or more columns of data in the data file. The initial columns must be the X columns. The final column must be the Y column.
Options in the Statistics : Regression : Multiple (Subset Selection) dialog box
If All Subsets is too slow (for example, some large problems might takes days, weeks, or years), use:
If you use one of the validation methods and N Best is less than 20, CoStat will automatically increase N Best to 20. Validation only makes sense if it compares several models.
Here is a description of each of the statistics. (For statistics for which there are variations in how they are calculated, we have chosen the variation supported by SAS.)
Bootstrap - For each validation replication, the Bootstrap method randomly picks rows from the original data file, with replacement. For each validation replication, it picks the same same number of rows as were in the original data file.
Leave X% Out - There are several Leave X% Out validation methods. Each repeatedly reruns the regression with X% of the rows of data removed. Brieman and Spector (1992) did a simulation study which found Leave 20% Out to be a good choice.
The Leave X% Out validation replicates are done in balanced groups of replicates (for example, Leave 20% Out works with groups of 5 replicates), so that each row of data is left out of exactly 1 of the replicates in each group of replicates. The use of balanced groups (as opposed repeatedly randomly choosing which X% to omit) leads to more stable results with fewer validation replications.
We strongly recommend using a validation method. We recommend the Bootstrap method or Leave 20% Out.
The validation process randomly pick which rows will be used, so using larger values of Validate N Times (we recommend 100) helps stabilize the validation statistics.
The Sample Run
The data for the sample run is the Longley Data, which has 6 X columns and 1 Y column. With so few X columns, even Method: All Subsets runs very quickly.
For the sample run, use File : Open to open the file called longley.dt in the cohort directory. Then:
REGRESSION : MULTIPLE : SUBSET SELECTION
2002-10-14 18:42:31
Using: C:\cohort6\LONGLEY.DT
X Columns:
1) GNP def 3) Unemployment 5) 14 yrs
2) GNP 4) Armed Forces 6) Time
Y Column: 7) Employment
Method: All Subsets
Max N X's In Model: 6
Keep If:
Validation Method: Bootstrap
Validate N Times: 100
Total number of data rows = 16
Number of data rows used = 16
The best models:
n Xs Rank X Columns
---- ---- ---------------------------------------------------------------
1 1 2
1 2 6
1 3 1
1 4 5
1 5 3
1 6 4
n Xs Rank X Columns
---- ---- ---------------------------------------------------------------
2 1 3, 6
2 2 2, 3
2 3 2, 5
2 4 2, 6
2 5 3, 5
...
n Xs Rank X Columns
---- ---- ---------------------------------------------------------------
3 1 3, 4, 6
3 2 2, 3, 4
3 3 2, 4, 5
3 4 1, 3, 6
3 5 2, 3, 6
...
n Xs Rank X Columns
---- ---- ---------------------------------------------------------------
4 1 2, 3, 4, 6
4 2 3, 4, 5, 6
4 3 1, 3, 4, 6
4 4 2, 3, 4, 5
4 5 1, 2, 4, 5
...
n Xs Rank X Columns
---- ---- ---------------------------------------------------------------
5 1 2, 3, 4, 5, 6
5 2 1, 2, 3, 4, 6
5 3 1, 3, 4, 5, 6
5 4 1, 2, 3, 4, 5
5 5 1, 2, 4, 5, 6
5 6 1, 2, 3, 5, 6
n Xs Rank X Columns
---- ---- ---------------------------------------------------------------
6 1 1, 2, 3, 4, 5, 6
The statistics for the best models:
n_Xs Rank R^2 Cp LOO_Press Rank#1 LGO_Press
---- ---- --------- --------- --------- ------ ---------
1 1 0.9673738 52.949425 7589201.1 97 3096552.4
1 2 0.9434809 100.51322 13016160 3 5214933.2
1 3 0.9426439 102.17939 13500813 0 5173452.8
1 4 0.9223501 142.57869 18421169 0 7827711.3
1 5 0.2525043 1476.0486 1.69617e8 0 69522036
1 6 0.2091301 1562.3943 1.88314e8 0 84857519
n_Xs Rank R^2 Cp LOO_Press Rank#1 LGO_Press
---- ---- --------- --------- --------- ------ ---------
2 1 0.9823137 25.208364 4679317 58 2120576.2
2 2 0.9806546 28.511069 5076800.9 15 2239282
2 3 0.9790585 31.688487 5674302.5 8 2452798.7
2 4 0.9734556 42.842209 6981785.1 3 3053396
2 5 0.9688932 51.924638 8876891.5 9 4088699.4
...
n_Xs Rank R^2 Cp LOO_Press Rank#1 LGO_Press
---- ---- --------- --------- --------- ------ ---------
3 1 0.992847 6.2394837 2132127.4 87 1072037.2
3 2 0.9850996 21.662472 4386976.5 2 2151847.5
3 3 0.9835103 24.826233 4941030.8 4 2274526.7
3 4 0.9828873 26.066385 5149873.3 1 2629974.1
3 5 0.9824913 26.854818 5741368.7 3 2962273.2
...
n_Xs Rank R^2 Cp LOO_Press Rank#1 LGO_Press
---- ---- --------- --------- --------- ------ ---------
4 1 0.9953587 3.2394804 1998041.1 60 1113358.9
4 2 0.994672 4.6064343 2216518.9 27 1748728.6
4 3 0.992854 8.2256744 2540766.1 10 1427904.9
4 4 0.9872082 19.464804 5184195.3 0 3717538.5
4 5 0.9863071 21.258566 4951193.5 0 2793420.3
...
n_Xs Rank R^2 Cp LOO_Press Rank#1 LGO_Press
---- ---- --------- --------- --------- ------ ---------
5 1 0.9954632 5.0314623 2561221.4 42 2573200.5
5 2 0.9954533 5.0510991 2344384.2 34 1363439.1
5 3 0.9949044 6.1438652 2581127.2 22 3728562.8
5 4 0.9873777 21.127371 5852906.9 1 5618362.2
5 5 0.9868841 22.110031 5832091.7 1 8692038.2
5 6 0.983799 28.251542 7676622.8 0 5942862.2
n_Xs Rank R^2 Cp LOO_Press Rank#1 LGO_Press
---- ---- --------- --------- --------- ------ ---------
6 1 0.995479 7 2886892.5 100 4877001.5
(The validation method randomly assigns rows of data to validation groups,
so the Rank#1 and LGO statistics printed above will vary.
You can reduce the variability by increasing 'Validate N Times'.)
n Xs Rank Equations With Column Numbers
---- ---- -----------------------------------------------------------------
1 1 col(7) = 51843.5897819 +0.03475229435*col(2)
1 2 col(7) = -1335105.2441 +716.511764706*col(6)
1 3 col(7) = 33189.1733796 +315.966086377*col(1)
1 4 col(7) = 8380.67418338 +0.48487809832*col(5)
1 5 col(7) = 59286.3553982 +1.88852315637*col(3)
1 6 col(7) = 59301.2646582 +2.30780841269*col(4)
n Xs Rank Equations With Column Numbers
---- ---- -----------------------------------------------------------------
2 1 col(7) = -1587138.9078 -0.9955303213*col(3) +847.088742485*col(6)
2 2 col(7) = 52382.1670501 +0.03784032702*col(2) -0.5435743321*col(3)
2 3 col(7) = 88938.7983051 +0.06317243566*col(2) -0.4097429223*col(5)
2 4 col(7) = 1198708.11085 +0.06299295723*col(2) -592.38341363*col(6)
2 5 col(7) = -135.32549508 -1.1151492576*col(3) +0.5877277691*col(5)
...
n Xs Rank Equations With Column Numbers
---- ---- -----------------------------------------------------------------
3 1 col(7) = -1797221.1122 -1.4696711189*col(3) -0.7722814913*col(
3 2 col(7) = 53306.4611883 +0.04078799732*col(2) -0.7968165793*col(
3 3 col(7) = 109470.955483 +0.07992714636*col(2) -0.4978793189*col(
3 4 col(7) = -1879252.7716 -64.775010369*col(1) -1.0519056696*col(
3 5 col(7) = -1198891.4309 +0.00899056576*col(2) -0.8904086536*col(
...
n Xs Rank Equations With Column Numbers
---- ---- -----------------------------------------------------------------
4 1 col(7) = -3598729.3743 -0.0401904697*col(2) -2.0883907318*col(
4 2 col(7) = -2446174.695 -1.5004764434*col(3) -0.9343638696*col(
4 3 col(7) = -1828915.7377 -7.2827116254*col(1) -1.4734185066*col(
4 4 col(7) = 82613.0992041 +0.06210170815*col(2) -0.5198036017*col(
4 5 col(7) = 120323.684241 -136.32592586*col(1) +0.09659841249*col(
...
n Xs Rank Equations With Column Numbers
---- ---- -----------------------------------------------------------------
5 1 col(7) = -3449891.5997 -0.0319613069*col(2) -1.9721499421*col(
5 2 col(7) = -3564921.8744 +27.7148784578*col(1) -0.042127114*col(
5 3 col(7) = -2705054.5008 -43.916959962*col(1) -1.5262904441*col(
5 4 col(7) = 92461.3078244 -48.462828184*col(1) +0.07200384932*col(
5 5 col(7) = -403186.16429 -179.87874985*col(1) +0.09517876035*col(
5 6 col(7) = -1121975.8255 -127.76330578*col(1) +0.03985731002*col(
n Xs Rank Equations With Column Numbers
---- ---- -----------------------------------------------------------------
6 1 col(7) = -3482258.6346 +15.0618722714*col(1) -0.0358191793*col(
Coefficients Table
Variability of Biased Regression Coefficients:
When the same data is used to search for the best subset models and to
estimate the regression coefficients, the estimated regression coefficients
are biased. A measure of the bias is Mean #1 - Mean All: the mean of
the values of a coefficient when it is in validation replicates where
that model is ranked #1 and the mean of the values of that coefficient for
all validation replicates. Usually, the absolute values of the biased
coefficients are too large. The difference is larger when the Leave X% Out
percentage is higher. The unique statistical properties of the Bootstrap
method make its results the best measure of the bias.
(The validation method randomly assigns rows of data to validation groups,
so the following results will vary.
You can reduce the variability by increasing 'Validate N Times'.)
Biased Mean Std.Dev. Mean for Std.Dev. Mean #1 -
X Coef. When #1 When #1 All Reps All Reps Mean All
---- --------- --------- --------- --------- --------- ---------
Int 51843.59
2 0.0347523 0.0352033 0.001667 0.0351513 0.0016724 5.2062e-5
Biased Mean Std.Dev. Mean for Std.Dev. Mean #1 -
X Coef. When #1 When #1 All Reps All Reps Mean All
---- --------- --------- --------- --------- --------- ---------
Int -1587139
3 -0.99553 -1.019034 0.2133845 -1.024537 0.2057 0.0055035
6 847.08874 852.93932 37.350316 854.7123 36.720343 -1.77298
Biased Mean Std.Dev. Mean for Std.Dev. Mean #1 -
X Coef. When #1 When #1 All Reps All Reps Mean All
---- --------- --------- --------- --------- --------- ---------
Int -1797221
3 -1.469671 -1.496287 0.1371075 -1.516463 0.1779194 0.020176
4 -0.772281 -0.802561 0.1441071 -0.798892 0.1911141 -0.003669
6 956.3798 965.36394 38.794978 966.61598 41.199368 -1.252044
Biased Mean Std.Dev. Mean for Std.Dev. Mean #1 -
X Coef. When #1 When #1 All Reps All Reps Mean All
---- --------- --------- --------- --------- --------- ---------
Int -3598729
2 -0.04019 -0.053328 0.0250326 -0.042098 0.0268491 -0.01123
3 -2.088391 -2.288518 0.3259705 -2.113154 0.4043208 -0.175364
4 -1.014639 -1.08571 0.1863755 -1.016285 0.2048972 -0.069425
6 1887.4095 2197.3399 568.26124 1933.823 616.52441 263.51689
Biased Mean Std.Dev. Mean for Std.Dev. Mean #1 -
X Coef. When #1 When #1 All Reps All Reps Mean All
---- --------- --------- --------- --------- --------- ---------
Int -3449892
2 -0.031961 -0.056069 0.056141 -0.043624 0.0454442 -0.012445
3 -1.97215 -2.319053 0.6524732 -2.135074 0.6091017 -0.183979
4 -1.019969 -1.037111 0.2179187 -1.046027 0.2589043 0.0089161
5 -0.077537 0.0731863 0.2800716 -0.028957 0.3865007 0.1021435
6 1814.1014 2154.0816 1009.3157 2013.4852 861.44673 140.59641
Biased Mean Std.Dev. Mean for Std.Dev. Mean #1 -
X Coef. When #1 When #1 All Reps All Reps Mean All
---- --------- --------- --------- --------- --------- ---------
Int -3482259
1 15.061872 25.654696 109.00544 25.654696 109.00544 0
2 -0.035819 -0.049325 0.0580191 -0.049325 0.0580191 0
3 -2.02023 -2.229122 0.7783844 -2.229122 0.7783844 0
4 -1.033227 -1.111287 0.3492111 -1.111287 0.3492111 0
5 -0.051104 -0.02682 0.569697 -0.02682 0.569697 0
6 1829.1515 2082.7935 994.94113 2082.7935 994.94113 0
The model with n Xs=4 and Rank=1 (the X columns are 2, 3, 4, 6), looks to be a good model. It has the lowest Cp, LOO_PreSS, and LGO_PreSS values of any of the models. And it was Ranked #1 among models with 4 X's more than twice as often as the next best model with 4 X's.