CoHort Software |
Nonlinear Regression in CoStat
Fitting of Any Equation by
Nelder and Mead's Iterative Simplex Method)
Introduction - Linear regressions are regressions in which the
unknowns are coefficients of the terms of the equations, for example, a
polynomial regression like y=a + b*x + c*x^2. In this case, a, b,
and c are multiplied by the known quantities 1, x, and x^2, to
calculate y. With nonlinear regressions the unknowns are not always
coefficients of the terms of the equation, for example, an exponential
equation like y=e^(a*x).
CoStat's Nonlinear Regression procedure
lets you type in an equation which is function of y
(for example, y=e^(a*x) ). It then solves
for the unknowns in the function.
If you are familiar with linear
regressions (like polynomial regressions) but unfamiliar with
nonlinear regressions, be prepared for a shock. The approach to
finding a solution is entirely different. While linear regressions
have a definite solution which can be directly arrived at, there is
no direct method to solve nonlinear regressions. They must be solved
iteratively (repeated intelligent guesses) until you get to what
appears to be the best answer. And there is no way to determine
if that answer is indeed the best possible answer.
Fortunately, there are several good
algorithms for making each successive guess. The algorithm used here
(the simplex procedure as originally described by Nelder and Mead,
1965) was chosen because it is widely used, does not require
derivatives of the equation (which are sometimes difficult or
impossible to get), is fairly quick, and is very reliable. See
Press et. al., 1986, for an
overview and a comparison of different algorithms.
How does the procedure work? In any regression, you are
seeking to minimize the deviations between the observed y values and
the expected y values (the values of the equation for specific values
of the unknowns).
Any regression is analogous to searching for the lowest point of
ground in a given state (for example, California). (Just so you may know,
the lowest spot is in Death Valley, at 282 feet below sea
level.) In this example, there are 2 unknowns: longitude and
latitude. The simplex method requires that you make an initial guess
at the answer (initial values for the unknowns).
The simplex method will then make n additional nearby guesses
(one for each unknown, based on the initial guess and on the simplex size).
The simplex size determines the distance from
the initial guess to the n nearby guesses.
In this example, we have 3 points (the initial guess and 2 nearby
guesses). This triangle (in our example) is the
"simplex" - the simplest possible shape in the n-dimensional world in
which the simplex is moving around.
The procedure starts by determining the elevation at each of these
3 points. The triangle then tries to flip itself by moving the
in the direction of the lower points; sort of like an amoeba.
The simplex only commits to a move if it results in an improvement.
One of the nice
features of the Nelder and Mead variation of the simplex method is
that it allows the simplex to grow and shrink as necessary to pass
This analogy highlights some of the perils of doing nonlinear
- Sensitivity to bad initial guesses. A bad initial guess can put
you on the wrong side of the Sierras (a huge mountain range). The
simplex will not find its way over to the other side of the Sierras
or start making real progress toward finding the lowest point in the
state. The lesson: a bad initial guess can doom you to failure.
- Going beyond reasonable boundaries. In the example, the simplex
can crawl over the edge of the state border. In a real regression,
this occurs when the values of the unknowns go out of the range of
what you consider to be legitimate values. The procedure does not
let you set limits, but you may see the unknowns heading toward
infinity or 0. You may also see n (the number of rows of data used)
decreasing; this indicates that the equation can't be evaluated for
some rows of data, usually because of numeric overflows (for example,
where u1*col(1) generates numbers greater than 650). If
this occurs, try using different initial values for the equation.
- Local minima. The simplex can be suckered into thinking that
some gopher hole or puddle is the lowest spot in California. This is
more likely if the simplex size is set way too small. When the regression has
3, 4, 5, or more unknowns, and/or if the data set is very large, it
often becomes less likely that the simplex will find the true global
minima. CoStat minimizes the risk of this problem by automatically restarting
the algorithm (see "Restarts" below).
Restarts - After the procedure finds what it believes to be
an answer, it restarts itself at that
point with a reinitialized simplex.
If that point was indeed the best in the area, the procedure
will stop there. But sometimes the procedure can find better
answers. The procedure will continue to restart itself until the
new result is not significantly better than the old result (a relative
change of the Sum of Squaresobserved-expected of <10-9).
Regression in the CoStat Manual
CoStat's manual has:
The sample runs show how to do 9 different types of regressions
and illustrate some
related topics (for example, how to use multiple regression
to fit a response surface). Sample run #9
has a detailed description of nonlinear regression and
solutions for problems that can occur.
Here is the just the part of sample run #9
that actually shows how to use the nonlinear regression procedure:
- An introduction to regression.
- A description of the calculation methods that are used by the program.
- 9 complete sample runs.
The Sample Run
The data and the regression equation are the same for this sample
run and the previous
sample run (page 400).
In this case, the results are
identical. For the sample run, use File : Open to open the file called
in the cohort directory. Then:
- From the menu bar, choose: Statistics : Regression : Nonlinear
- Equation: e^(u1+u2*col(1))
- Y column: 11) 2 e^(a+b*x)
- n Unknowns: 2
- Initial u1: 1
- Initial u2: 1
- Simplex Size: 1
- Keep If:
- Print Residuals: (checked)
- Save Residuals: (don't)
Y Column: 11) 2 e^(a+b*x)
n Unknowns: 2
Initial u1: 1
Initial u2: 1
Simplex Size: 1
Total number of data points = 11
Number of data points after 'Keep If' used: 11
Number of data points used = 11
Degrees of Freedom: 9
Success at iteration #274.
R^2 = 1
(R^2 for nonlinear regressions may be odd or inappropriate.)
Error (residual) SS = 1.3898993e-17
2 e^(a+b*x) = e^(u1+u2*col(1))
u1 = 0.3
u2 = 3
Or: 2 e^(a+b*x) = e^(0.3+3*X)
Or: y = e^(0.3+3*x)
Row Y observed Y expected Residual
--------- ------------- ------------- -------------
1 4.12924942e-7 4.12924942e-7 1.1646703e-21
2 8.29381916e-6 8.29381916e-6 2.3716923e-20
3 1.66585811e-4 1.66585811e-4 5.1499603e-19
4 0.00334596546 0.00334596546 1.7347235e-18
5 0.06720551274 0.06720551274 5.5511151e-17
6 1.34985880758 1.34985880758 6.6613381e-16
7 27.1126389207 27.1126389207 3.5527137e-15
8 544.571910126 544.571910126 -2.273737e-13
9 10938.0192082 10938.0192082 5.4569682e-12
10 219695.988672 219695.988672 1.4551915e-10
11 4412711.89235 4412711.89235 3.7252903e-9
The very, very small Error (residual) SS
indicates that this regression is essentially perfect (within
the computer's limits of precision).
CoHort Software |
CoStat Statistics |