PSPP: EXAMINE

EXAMINE
        VARIABLES= var1 [var2] … [varN]
           [BY factor1 [BY subfactor1]
             [ factor2 [BY subfactor2]]
             …
             [ factor3 [BY subfactor3]]
            ]
        /STATISTICS={DESCRIPTIVES, EXTREME[(n)], ALL, NONE}
        /PLOT={BOXPLOT, NPPLOT, HISTOGRAM, SPREADLEVEL[(t)], ALL, NONE}
        /CINTERVAL p
        /COMPARE={GROUPS,VARIABLES}
        /ID=identity_variable
        /{TOTAL,NOTOTAL}
        /PERCENTILE=[percentiles]={HAVERAGE, WAVERAGE, ROUND, AEMPIRICAL, EMPIRICAL }
        /MISSING={LISTWISE, PAIRWISE} [{EXCLUDE, INCLUDE}] 
		[{NOREPORT,REPORT}]

The EXAMINE command is used to perform exploratory data analysis. In particular, it is useful for testing how closely a distribution follows a normal distribution, and for finding outliers and extreme values.

The VARIABLES subcommand is mandatory. It specifies the dependent variables and optionally variables to use as factors for the analysis. Variables listed before the first BY keyword (if any) are the dependent variables. The dependent variables may optionally be followed by a list of factors which tell PSPP how to break down the analysis for each dependent variable.

Following the dependent variables, factors may be specified. The factors (if desired) should be preceded by a single BY keyword. The format for each factor is

factorvar [BY subfactorvar].

Each unique combination of the values of factorvar and subfactorvar divide the dataset into cells. Statistics will be calculated for each cell and for the entire dataset (unless NOTOTAL is given).

The STATISTICS subcommand specifies which statistics to show. DESCRIPTIVES will produce a table showing some parametric and non-parametrics statistics. EXTREME produces a table showing the extremities of each cell. A number in parentheses, n determines how many upper and lower extremities to show. The default number is 5.

The subcommands TOTAL and NOTOTAL are mutually exclusive. If TOTAL appears, then statistics will be produced for the entire dataset as well as for each cell. If NOTOTAL appears, then statistics will be produced only for the cells (unless no factor variables have been given). These subcommands have no effect if there have been no factor variables specified.

The PLOT subcommand specifies which plots are to be produced if any. Available plots are HISTOGRAM, NPPLOT, BOXPLOT and SPREADLEVEL. The first three can be used to visualise how closely each cell conforms to a normal distribution, whilst the spread vs. level plot can be useful to visualise how the variance of differs between factors. Boxplots will also show you the outliers and extreme values. ⁴

The SPREADLEVEL plot displays the interquartile range versus the median. It takes an optional parameter t, which specifies how the data should be transformed prior to plotting. The given value t is a power to which the data is raised. For example, if t is given as 2, then the data will be squared. Zero, however is a special value. If t is 0 or is omitted, then data will be transformed by taking its natural logarithm instead of raising to the power of t.

The COMPARE subcommand is only relevant if producing boxplots, and it is only useful there is more than one dependent variable and at least one factor. If /COMPARE=GROUPS is specified, then one plot per dependent variable is produced, each of which contain boxplots for all the cells. If /COMPARE=VARIABLES is specified, then one plot per cell is produced, each containing one boxplot per dependent variable. If the /COMPARE subcommand is omitted, then PSPP behaves as if /COMPARE=GROUPS were given.

The ID subcommand is relevant only if /PLOT=BOXPLOT or /STATISTICS=EXTREME has been given. If given, it should provide the name of a variable which is to be used to labels extreme values and outliers. Numeric or string variables are permissible. If the ID subcommand is not given, then the case number will be used for labelling.

The CINTERVAL subcommand specifies the confidence interval to use in calculation of the descriptives command. The default is 95%.

The PERCENTILES subcommand specifies which percentiles are to be calculated, and which algorithm to use for calculating them. The default is to calculate the 5, 10, 25, 50, 75, 90, 95 percentiles using the HAVERAGE algorithm.

The TOTAL and NOTOTAL subcommands are mutually exclusive. If NOTOTAL is given and factors have been specified in the VARIABLES subcommand, then then statistics for the unfactored dependent variables are produced in addition to the factored variables. If there are no factors specified then TOTAL and NOTOTAL have no effect.

The following example will generate descriptive statistics and histograms for two variables score1 and score2. Two factors are given, viz: gender and gender BY culture. Therefore, the descriptives and histograms will be generated for each distinct value of gender and for each distinct combination of the values of gender and race. Since the NOTOTAL keyword is given, statistics and histograms for score1 and score2 covering the whole dataset are not produced.

EXAMINE score1 score2 BY 
        gender
        gender BY culture
        /STATISTICS = DESCRIPTIVES
        /PLOT = HISTOGRAM
        /NOTOTAL.

Here is a second example showing how the examine command can be used to find extremities.

EXAMINE height weight BY 
        gender
        /STATISTICS = EXTREME (3)
        /PLOT = BOXPLOT
        /COMPARE = GROUPS
        /ID = name.

In this example, we look at the height and weight of a sample of individuals and how they differ between male and female. A table showing the 3 largest and the 3 smallest values of height and weight for each gender, and for the whole dataset will be shown. Boxplots will also be produced. Because /COMPARE = GROUPS was given, boxplots for male and female will be shown in the same graphic, allowing us to easily see the difference between the genders. Since the variable name was specified on the ID subcommand, this will be used to label the extreme values.

Warning! If many dependent variables are specified, or if factor variables are specified for which there are many distinct values, then EXAMINE will produce a very large quantity of output.

Footnotes

(4)

HISTOGRAM uses Sturges’ rule to determine the number of bins, as approximately 1 + \log2(n), where n is the number of samples. Note that FREQUENCIES uses a different algorithm to find the bin size.

15.3 EXAMINE

Footnotes

(4)