Compares data sets using the Kolmogorov-Smirnov test


This routine reads in a data array and performs a two sided Kolmogorov-Smirnov test on the vectorised data. It does this in two ways:
If only one dataset is to be tested the data array is divided into subsamples. First it compares subsample 1 with subsample 2, if they are thought to be from the same sample they are concatenated. This enlarged sample is then compared with subsample 3 etc., concatenating if consistent, until no more subsamples remain.
If more than one dataset is specified, the datasets are compared to the reference dataset in turn. If the probability the two are from the same sample is greater than the specified confidence level, the datasets are concatenated, and the next sample is tested against this enlarged reference dataset.

The probability and maximum separation of the cumulative distribution function is written for each comparison (at the normal reporting level). The mean value of the consistent data and its error are also reported. In all cases the consistent data can be output to a new dataset. The statistics and probabilities are written to results parameters.


kstest in out [limit]


The name of the NDF array component to be tested for consistency: "Data", "Error", "Quality" or "Variance" (where "Error" is the alternative to "Variance" and causes the square root of the variance values to be taken before performing the comparisons). If "Quality" is specified, then the quality values are treated as numerical values (in the range 0 to 255). ["Data"]
LIMIT = _REAL (Read)
Confidence level at which samples are thought to be consistent. This must lie in the range 0 to 1. [0.05]
The names of the NDFs  to be tested. If just one dataset is supplied, it is divided into subsamples, which are compared (see Parameter NSAMPLE). When more than one dataset is provided, the first becomes the reference dataset to which all the remainder are compared.

It may be a list of NDF names or direction specifications separated by commas. If a list is supplied on the command line, the list must be enclosed in double quotes. NDF names may include the regular expressions ("", "?", "[a-z]" etc.). Indirection may occur through text files (nested up to seven deep). The indirection character is "^". If extra prompt lines are required, append the continuation character "-" to the end of the line. Comments in the indirection file begin with the character "#".

The number of the subsamples into which to divide the reference dataset. This parameter is only requested when a single NDF is to be analysed, i.e. when only one dataset name is supplied via Parameter IN. The allowed range is 2 to 20. [3]
OUT = NDF (Write)
Output one-dimensional NDF to which the consistent data are written. A null value (!)–-the suggested default–-prevents creation of this output dataset.

Results Parameters

DIST() = _REAL (Write)
Maximum separation found in the cumulative distributions for each comparison subsample. Note that it excludes the reference dataset.
Error in the mean value of the consistent data.
The names of the datasets intercompared. The first is the reference dataset.
MEAN = _DOUBLE (Write)
Mean value of the consistent data.
Number of consistent data.
PROB() = _REAL (Write)
Probability that each comparison subsample is drawn from the same sample. Note that this excludes the reference sample.
Standard deviation of the consistent data.


kstest arlac accept
This tests the NDF called arlac for self-consistency at the 95% confidence level using three subsamples. No output dataset is created.

The following applies to all the examples. If the reference dataset and a comparison subsample are consistent, the two merge to form an expanded reference dataset, which is then used for the next comparison. Details of the comparisons are presented.

kstest arlac arlac_filt 0.10 nsample=10
As above except data are retained if they exceed the 90% probability level, the comparisons are made with ten subsamples, and the consistent data are written to the one-dimensional NDF called arlac_filt.
kstest in="ref,obs" comp=v out=master
This compares the variance in the NDF called ref with that in a series of other NDFs whose names begin "obs". The variance consistent with the reference dataset are written to the data array in the NDF called master. To be consistent, they must be the same at 95% probability.
kstest "ref,^96lc.lis,obs" master comp=v
As the previous example, except the comparison files include those listed in the text file 96lc.lis.


Implementation Status: