Go to: CoHort Software | CoStat | CoStat Statistics  

Analysis of Frequency Data in CoStat
(Including Cross Tabulation, Calculating Expected Values,
Goodness-Of-Fit Tests, Tests of Independence, Chi-Square,
Likelihood Ratio, Log-Linear, and Fisher's Exact Tests)

Analysis of Frequency Data deals with data that has been tabulated; that is, the number of sampled items that fall into different categories. The categories can be based on 1 criteria ("1 way", for example, sex), 2 criteria ("2 way", for example, sex and race), or 3 criteria ("3 way", for example, sex, race, and religion). For 2 way and 3 way tabulations, the process is often called cross tabulation. The process of tabulation is also called binning, since it analogous to sorting or categorizing items and putting them into bins.

This type of frequency analysis is quite different from an FFT which finds the component frequencies (as in Cycles Per Second) in a time series.

There are several procedures in CoStat related to frequency data:

  1. Cross Tabulation - Tabulate the data if not tabulated already.
  2. 1 way, Calculate Expected Values - For 1 way tabulations, calculate the expected values based on the normal, binomial, or Poisson distributions.
    • Print a table of observed and expected frequencies and descriptive statistics.
  3. Analysis -
    • 1 Way Tests
      • Calculate descriptive statistics: the mean, standard deviation, skewness, and kurtosis of the data.
      • Perform Kolmogorov-Smirnov test of goodness-of-fit (before pooling).
      • Test goodness-of-fit (how closely the expected values match the observed values) with the Chi-Square and Likelihood Ratio tests.
    • 2 Way Tests
      • Print margin totals.
      • Test the independence (the lack of interaction) of the two factors with the Chi-Square test, the Likelihood Ratio test, and Fisher's Exact Test.
    • 3 Way Tests
      • Print margin totals.
      • Test the independence (the lack of interaction) of the three factors with Log-Linear models.


Analysis of Frequency Data in the CoStat Manual

CoStat's manual has:

  • An introduction to analysis of frequency data.
  • A description of the calculation methods that are used by the program.
  • 7 complete sample runs.
The sample runs show how to do 7 different types of analysis of frequency data. Here is sample run #1:

Sample Run 1 - 1 Way, Not-Yet-Tabulated Data, Normal Distribution

In this example, the raw, untabulated data is from the wheat experiment. In the wheat experiment (page 233), three varieties of wheat were grown at four locations. At each of the locations, there were four blocks, within each of which were small plots for each of the varieties. The Height and Yield of each plot were measured. The goal is to visualize the distribution of plant heights and compare this distribution to a normal distribution. The analysis will indicate if the distribution of heights is significantly different from the normal distribution.

For this sample run, the values of one column, Height, need to be tabulated. Open the wheat.dt data file and specify:

  1. From the menu bar, choose: Statistics : Frequency Analysis : Cross Tabulation
  2. Keep If:
  3. Column 1: 4)Height (This automatically sets: Numeric (checked), Lower Limit=60, Class width=10, New Name=Height Classes).
  4. Insert Results At: (the end)
  5. Frequency Name: Observed
  6. Print Frequencies: (not checked) (they'll be printed later)
  7. OK
The printed results are:
CROSS TABULATION
2000-08-03 12:19:18
Using: c:\cohort6\wheat.dt
n Way: 1
Keep If: 

n Data Points = 48

Column        Numeric   Lower Limit   Class Width   New Name      n Classes
------------- --------- ------------- ------------- ------------- ---------
4) Height          true            60            10 Height Classe        10

The procedure then calculates descriptive statistics for the population and asks you which distribution to use when calculating expected frequencies: normal, poisson, or binomial distributions. (The poisson and binomial distributions are only options when the class width is 1 and the lowest limit is -0.5.)

Most data has an expected normal distribution. The significance tests for many statistics (for example, product moment correlation coefficient) assume that the population is normally distributed. In this example, we will test the fidelity of the height distribution to normality by looking at the skewness and kurtosis of the distribution. The theoretical normal distribution (based on the mean and standard deviation) appears as a straight line on this graph. The Poisson and binomial distribution are discussed in the next 2 sample runs.

The procedure can use the observed descriptive statistics to calculate the expected values (an intrinsic hypothesis) or you can enter other values to be used when calculating the expected values (an extrinsic hypothesis). The distinction between testing an intrinsic or extrinsic hypothesis is important because they are tested with slightly different goodness of fit tests (see Sokal and Rohlf, 1981 or 1995, for more information).

The normal distribution uses estimates of 2 parameters from the population (the mean and the standard deviation) when calculating the expected frequencies.

Differences from Descriptive statistics - If you start an analysis with Statistics : Frequency Analysis : 1 Way, Calculate Expected with already tabulated data (and not with raw data and Statistics : Frequency Analysis : Cross Tabulation) the mean and standard deviation calculated here will be based on the tabulated data and will differ somewhat from the mean and standard deviation as calculated in Statistics : Descriptive. The statistics calculated on tabulated data assume that all items in a given bin have a value equal to the bin's lower limit plus 1/2 the class width. So if you have the raw data and want to know the mean and standard deviation, use the statistics calculated in Statistics : Descriptive, since they are more accurate.

Continuing with the sample run, we will choose to calculate expected values based on the normal distribution, using the mean and standard deviation calculated from the data. On the Frequency 1 Expected dialog:

  1. Lower Limit: 6) Height Classes
  2. Observed: 7) Observed
  3. Distribution: Normal
  4. Mean: (use default)
  5. Standard Deviation: (use default)
  6. Save Expected: (checked)
  7. OK

The results are:

1 WAY FREQUENCY ANALYSIS - Calculate Expected Values
2000-08-03 12:21:33
Using: c:\cohort6\wheat.dt
  Lower Limit Column: 6) Height Classes
  Observed Column: 7) Observed
Distribution: Normal
  Mean: 99.5833333333
  Standard Deviation: 24.92186371

n Data Points = 48
n Classes = 10

Descriptive Statistics (for the tabulated data)
  Testing skewness=0 and kurtosis=0 tests if the numbers have a
    normal distribution.
    (Poisson distributed data should have significant positive skewness.)
    (Binomially distributed data may or may not have significant skewness.)
  If the probability that skewness equals 0 ('P(g1=0)') is <=0.05,
    the distribution is probably not normally distributed.
  If the probability that kurtosis equals 0 ('P(g2=0)') is <=0.05,
    the distribution is probably not normally distributed.

Descriptive Statistics fit a normal distribution to the data:
Mean is the arithmetic mean (or 'average') of the values.
Standard Deviation is a measure of the dispersion of the distribution.
Variance is the square of the standard deviation.
Skewness is a measure of the symmetry of the distribution.
Kurtosis is a measure of the peakedness of the distribution.
If skewness or kurtosis is significantly greater or less than 0 (P<=0.05),
  it indicates that the population is probably not normally distributed.

n data points = 48
Min = 65.0
Max = 155.0
Mean = 99.5833333333
Standard deviation = 24.92186371
Variance = 621.09929078
Skewness = 0.62821922472  Standard Error = 0.3431493092
  Two-tailed test of hypothesis that skewness = 0 (df = infinity) :
    P =  .0672 ns 
Kurtosis = -0.1752294896  Standard Error = 0.67439742269
  Two-tailed test of hypothesis that kurtosis = 0 (df = infinity) :
    P =  .7950 ns 

Height Cl  Observed   Percent  Expected     Deviation
--------- --------- --------- --------- -------------
       60         6    12.500 5.6450522 0.35494776137
       70         5    10.417 4.7227305 0.27726948384
       80         5    10.417 6.4461811 -1.4461810931
       90        15    31.250 7.5061757 7.49382431075
      100         2     4.167 7.4566562   -5.45665624
      110         5    10.417   6.31944  -1.319439972
      120         4     8.333 4.5689842 -0.5689841573
      130         2     4.167 2.8181393 -0.8181393174
      140         1     2.083 1.4828591 -0.4828590617
      150         3     6.250 1.0337817 1.96621828556

Pooling - When expected frequencies for the normal and binomial distributions are calculated, the integrand of the left and right tails are added to the expected frequencies of the lowest and highest classes, respectively. The methods for calculating the expected frequencies can be found in Sokal and Rohlf (1981 or 1995).

The final stage of the sample run sets up the goodness of fit tests. On the Statistics : Frequency Analysis : 1 Way Tests dialog, choose:

  1. Observed: 7) Observed
  2. Expected: 8) Expected
  3. n Intrinsic: 2 (In this case, two parameters which were calculated from the data, mean and standard deviation, were used to compute the expected values.)
  4. OK

The results are:

1 WAY FREQUENCY ANALYSIS - Goodness-Of-Fit Tests
2000-08-03 12:23:34
Using: c:\cohort6\wheat.dt
  Observed Column: 7) Observed
  Expected Column: 8) Expected
n Intrinsic (parameters estimated from the data): 2

n Observed = 48
n Expected = 48
n Classes Before Pooling = 10
n Classes After Pooling = 6

These tests test the goodness-of-fit of the observed and expected values.
If P<=0.05, the expected distribution is probably not a good fit of the
  data.

Kolmogorov-Smirnov Test
  (not recommended for discrete data; recommended for continuous data)

  D obs = 0.13916375964
  n = 48
  Since n<=100, see Table Y in Rohlf & Sokal (1995) for critical
    values for an intrinsic hypothesis.

Likelihood Ratio Test
  (ok for discrete data; ok for continuous data)

  G = 12.0082419926
  df (nClasses-nIntrinsic-1) = 3
  P = .0074 ** 

Likelihood Ratio Test with Williams' Correction
  (recommended for discrete data; ok for continuous data)

  G (corrected) = 11.5407353521
  df (nClasses-nIntrinsic-1) = 3
  P = .0091 ** 

Chi-Square Test
  (ok for discrete data; ok for continuous data)

  X2 = 12.0297034449
  df (nClasses-nIntrinsic-1) = 3
  P = .0073 ** 

All of these tests confirm that this is not a normally distributed population, which is not surprising since it has a very heterogeneous source.

The test statistics are calculated as follows (from Sokal and Rohlf, 1981 or 1995):

For the Kolmogorov-Smirnov test:   D = dmax/n

where:

  • d = the difference between expected and observed cumulative frequencies
  • dmax = the maximum of the differences
  • n = then number of classes

If the number of rows of data is less than 100, critical values of D can be found for extrinsic hypotheses in Table 32 ( Rohlf and Sokal, 1981) (but not Table X in Rohlf and Sokal, 1995, which is a slightly different table). For intrinsic hypotheses, see Table 33 (Rohlf and Sokal, 1981) (but not Table Y in Rohlf and Sokal, 1995, which is a slightly different table). Or, see other books of statistical tables. If the total number of tabulated data points is greater than 99, the critical values of D are calculated by the procedure from the following equation:

Dalpha= sqrt(-ln(alpha/2)/(2n))

  • where alpha= the significance level

For the likelihood ratio test: G = 2SUMfiln(fi/fhati)

For the Chi-square test: X2 = SUM(fi2/fhati) - n

  • where f is the observed frequency and fhat is the expected frequency.

The test statistics G and X2 can be compared with tabulated values of the Chi-square distribution. The degrees of freedom equals the number of classes (after pooling) minus the number of parameters estimated from the population to calculate the expected frequencies (in this case 2, that is, the mean and the standard deviation) minus 1. In this sample run, df = 6-2-1 = 3.
    Williams' Correction for the Likelihood Ratio test (for intrinsic and extrinsic hypotheses) is used because it leads to a closer approximation of a chi-square distribution. See Sokal and Rohlf, Section 17.2.
       
Yates' Correction for Continuity - Unlike earlier versions of CoStat, the new CoStat does not do Yates' Correction for Continuity. It is now thought to result in excessively conservative tests and is not recommended. (See Sokal and Rohlf, 1995, pg. 703.)

If there are no expected values, the goodness of fit tests will be skipped.

 


Go to: CoHort Software | CoStat | CoStat Statistics | Top