Analysis of Frequency Data deals with data that has been tabulated; that is, the number of sampled items that fall into different categories. The categories can be based on 1 criteria ("1 way", for example, sex), 2 criteria ("2 way", for example, sex and race), or 3 criteria ("3 way", for example, sex, race, and religion). For 2 way and 3 way tabulations, the process is often called cross tabulation. The process of tabulation is also called binning, since it analogous to sorting or categorizing items and putting them into bins.
This type of frequency analysis is quite different from an FFT which finds the component frequencies (as in Cycles Per Second) in a time series.
There are several procedures in CoStat related to frequency data:
CoStat's manual has:
Sample Run 1 - 1 Way, Not-Yet-Tabulated Data, Normal Distribution
In this example, the raw, untabulated data is from the wheat experiment. In the wheat experiment (page 233), three varieties of wheat were grown at four locations. At each of the locations, there were four blocks, within each of which were small plots for each of the varieties. The Height and Yield of each plot were measured. The goal is to visualize the distribution of plant heights and compare this distribution to a normal distribution. The analysis will indicate if the distribution of heights is significantly different from the normal distribution.
For this sample run, the values of one column, Height, need to be tabulated. Open the wheat.dt data file and specify:
CROSS TABULATION 2000-08-03 12:19:18 Using: c:\cohort6\wheat.dt n Way: 1 Keep If: n Data Points = 48 Column Numeric Lower Limit Class Width New Name n Classes ------------- --------- ------------- ------------- ------------- --------- 4) Height true 60 10 Height Classe 10
The procedure then calculates descriptive statistics for the population and asks you which distribution to use when calculating expected frequencies: normal, poisson, or binomial distributions. (The poisson and binomial distributions are only options when the class width is 1 and the lowest limit is -0.5.)
Most data has an expected normal distribution. The significance tests for many statistics (for example, product moment correlation coefficient) assume that the population is normally distributed. In this example, we will test the fidelity of the height distribution to normality by looking at the skewness and kurtosis of the distribution. The theoretical normal distribution (based on the mean and standard deviation) appears as a straight line on this graph. The Poisson and binomial distribution are discussed in the next 2 sample runs.
The procedure can use the observed descriptive statistics to calculate the expected values (an intrinsic hypothesis) or you can enter other values to be used when calculating the expected values (an extrinsic hypothesis). The distinction between testing an intrinsic or extrinsic hypothesis is important because they are tested with slightly different goodness of fit tests (see Sokal and Rohlf, 1981 or 1995, for more information).
The normal distribution uses estimates of 2 parameters from the population (the mean and the standard deviation) when calculating the expected frequencies.
Differences from Descriptive statistics - If you start an analysis with Statistics : Frequency Analysis : 1 Way, Calculate Expected with already tabulated data (and not with raw data and Statistics : Frequency Analysis : Cross Tabulation) the mean and standard deviation calculated here will be based on the tabulated data and will differ somewhat from the mean and standard deviation as calculated in Statistics : Descriptive. The statistics calculated on tabulated data assume that all items in a given bin have a value equal to the bin's lower limit plus 1/2 the class width. So if you have the raw data and want to know the mean and standard deviation, use the statistics calculated in Statistics : Descriptive, since they are more accurate.
Continuing with the sample run, we will choose to calculate expected values based on the normal distribution, using the mean and standard deviation calculated from the data. On the Frequency 1 Expected dialog:
The results are:
1 WAY FREQUENCY ANALYSIS - Calculate Expected Values
2000-08-03 12:21:33
Using: c:\cohort6\wheat.dt
Lower Limit Column: 6) Height Classes
Observed Column: 7) Observed
Distribution: Normal
Mean: 99.5833333333
Standard Deviation: 24.92186371
n Data Points = 48
n Classes = 10
Descriptive Statistics (for the tabulated data)
Testing skewness=0 and kurtosis=0 tests if the numbers have a
normal distribution.
(Poisson distributed data should have significant positive skewness.)
(Binomially distributed data may or may not have significant skewness.)
If the probability that skewness equals 0 ('P(g1=0)') is <=0.05,
the distribution is probably not normally distributed.
If the probability that kurtosis equals 0 ('P(g2=0)') is <=0.05,
the distribution is probably not normally distributed.
Descriptive Statistics fit a normal distribution to the data:
Mean is the arithmetic mean (or 'average') of the values.
Standard Deviation is a measure of the dispersion of the distribution.
Variance is the square of the standard deviation.
Skewness is a measure of the symmetry of the distribution.
Kurtosis is a measure of the peakedness of the distribution.
If skewness or kurtosis is significantly greater or less than 0 (P<=0.05),
it indicates that the population is probably not normally distributed.
n data points = 48
Min = 65.0
Max = 155.0
Mean = 99.5833333333
Standard deviation = 24.92186371
Variance = 621.09929078
Skewness = 0.62821922472 Standard Error = 0.3431493092
Two-tailed test of hypothesis that skewness = 0 (df = infinity) :
P = .0672 ns
Kurtosis = -0.1752294896 Standard Error = 0.67439742269
Two-tailed test of hypothesis that kurtosis = 0 (df = infinity) :
P = .7950 ns
Height Cl Observed Percent Expected Deviation
--------- --------- --------- --------- -------------
60 6 12.500 5.6450522 0.35494776137
70 5 10.417 4.7227305 0.27726948384
80 5 10.417 6.4461811 -1.4461810931
90 15 31.250 7.5061757 7.49382431075
100 2 4.167 7.4566562 -5.45665624
110 5 10.417 6.31944 -1.319439972
120 4 8.333 4.5689842 -0.5689841573
130 2 4.167 2.8181393 -0.8181393174
140 1 2.083 1.4828591 -0.4828590617
150 3 6.250 1.0337817 1.96621828556
Pooling - When expected frequencies for the normal and binomial distributions are calculated, the integrand of the left and right tails are added to the expected frequencies of the lowest and highest classes, respectively. The methods for calculating the expected frequencies can be found in Sokal and Rohlf (1981 or 1995).
The final stage of the sample run sets up the goodness of fit tests. On the Statistics : Frequency Analysis : 1 Way Tests dialog, choose:
The results are:
1 WAY FREQUENCY ANALYSIS - Goodness-Of-Fit Tests
2000-08-03 12:23:34
Using: c:\cohort6\wheat.dt
Observed Column: 7) Observed
Expected Column: 8) Expected
n Intrinsic (parameters estimated from the data): 2
n Observed = 48
n Expected = 48
n Classes Before Pooling = 10
n Classes After Pooling = 6
These tests test the goodness-of-fit of the observed and expected values.
If P<=0.05, the expected distribution is probably not a good fit of the
data.
Kolmogorov-Smirnov Test
(not recommended for discrete data; recommended for continuous data)
D obs = 0.13916375964
n = 48
Since n<=100, see Table Y in Rohlf & Sokal (1995) for critical
values for an intrinsic hypothesis.
Likelihood Ratio Test
(ok for discrete data; ok for continuous data)
G = 12.0082419926
df (nClasses-nIntrinsic-1) = 3
P = .0074 **
Likelihood Ratio Test with Williams' Correction
(recommended for discrete data; ok for continuous data)
G (corrected) = 11.5407353521
df (nClasses-nIntrinsic-1) = 3
P = .0091 **
Chi-Square Test
(ok for discrete data; ok for continuous data)
X2 = 12.0297034449
df (nClasses-nIntrinsic-1) = 3
P = .0073 **
All of these tests confirm that this is not a normally distributed population, which is not surprising since it has a very heterogeneous source.
The test statistics are calculated as follows (from Sokal and Rohlf, 1981 or 1995):
For the Kolmogorov-Smirnov test: D = dmax/n
where:
If the number of rows of data is less than 100, critical values of D can be found for extrinsic hypotheses in Table 32 ( Rohlf and Sokal, 1981) (but not Table X in Rohlf and Sokal, 1995, which is a slightly different table). For intrinsic hypotheses, see Table 33 (Rohlf and Sokal, 1981) (but not Table Y in Rohlf and Sokal, 1995, which is a slightly different table). Or, see other books of statistical tables. If the total number of tabulated data points is greater than 99, the critical values of D are calculated by the procedure from the following equation:
Dalpha= sqrt(-ln(alpha/2)/(2n))
For the likelihood ratio test: G = 2SUMfiln(fi/fhati)
For the Chi-square test: X2 = SUM(fi2/fhati) - n
The test statistics G and X2 can be compared with
tabulated values of the Chi-square distribution. The degrees of
freedom equals the number of classes (after pooling) minus the number
of parameters estimated from the population to calculate the expected
frequencies (in this case 2, that is, the mean and the standard
deviation) minus 1. In this sample run, df = 6-2-1 = 3.
Williams' Correction for the Likelihood Ratio test
(for intrinsic and extrinsic hypotheses) is used because it
leads to a closer approximation of a chi-square
distribution. See
Sokal and Rohlf, Section 17.2.
Yates' Correction for Continuity -
Unlike earlier versions of CoStat, the new CoStat does
not do Yates' Correction for Continuity. It is now
thought to result in excessively conservative tests and
is not recommended. (See
Sokal and Rohlf, 1995, pg. 703.)
If there are no expected values, the goodness of fit tests will be skipped.