Understanding statistics using computer demonstrations.
Dunn, Peter K.
This paper discusses programs that clarify some statistical
ideas often discussed yet poorly understood by students. The
programs adopt the approach of demonstrating what is happening,
rather than using the computer to do the work for the students
(and hide the understanding). The programs demonstrate normal
probability plots, overfitting of models and generalized linear
models. Although the implementation is in Matlab, any suitable
language is appropriate.
**********
The use of technology in teaching mathematics and statistics has
reached the point where it is hard to imagine teaching mathematics and
statistics without using computers. The University of Southern
Queensland (USQ) has chosen to standardise on using Matlab in its
mathematics and statistics courses. First year students learn algebra and calculus by incorporating Matlab as a numerical and graphical tool;
they learn the basic language and graphical commands. Matlab is then
used in many subsequent courses, including most statistics courses.
Matlab is a computational environment for technical and numerical
computing. It has a strong pedigree in robust, stable and efficient
numerical algorithms, and has excellent graphical capabilities. The
user-interface is typically the command-line, but Matlab has facilities
for implementing a graphical user interface, as used for the programs
discussed in this paper.
Cretchley, Harman, Ellerton and Fogarty (1999, 2000) analysed the
influences on attitudes and learning, finding almost all students
responded positively to Matlab for ease of computation and graphing.
Using the software as a tool was found to strongly impact the learning
strategies adopted and the students' confidence toward mathematics.
In addition, "students with high computer-mathematics interaction
feel that computers enhance mathematics learning by providing many
examples, enable the user to focus on major ideas by reducing mechanical
toil ... and find computers helpful in linking algebraic and geometric
ideas" (Galbraith, Haines, and Pemberton, 1999, p. 220).
Matlab is used in numerous ways at USQ. Often it is used to
actually do the work for the student, saving the student time in fiddly or prone-to-error calculations, or allowing the students to do more
complex problems impossible to do "by hand." In higher-level
courses, some Matlab programming is expected.
The programs discussed in this paper, however, use the computer to
show what is happening, rather than to do a particular task. For
example,
It is just about impossible to illustrate Newton's method adequately
with freehand sketches using a piece of chalk. A computer printout
on a transparency, or even an animation using a PC, can be a great
help. Likewise Simpson's rule, the Runge-Kutta method, and other
numerical techniques can be best illustrated with a computer and an
attractive display. (Krantz, 1998, p. 81)
Dunn and Harman (2000) argue that Matlab can be incorporated into
learning in three main ways: Firstly, as a numerical and graphical tool.
In this role, the computer enables faster illustration of the
applications and better visualisation of complex ideas (such as three
dimensional graphical representations). Typical in this approach is to
use command line instructions in Matlab to develop the ideas while no
programming skills are required; see, for example, Cretchley et al.
(1999, 2000).
Secondly, Matlab can be incorporated into the material using
appropriate programs then used by the student to facilitate the learning
of ideas and concepts. In this approach, the programs are provided to
the student for their use or possible modification. Penny and Lindfield
(1995) and Borse (1997), for example, have adopted this practice; Colgan
(2000) discusses a course built around this approach. A basic knowledge
of programming is helpful or necessary in this approach.
Thirdly, Matlab software can be used to show the procedure or
concept in a more sophisticated way. This is done by using larger,
stand-alone and robust programs with well-designed graphical user
interfaces. No programming skills are required in this approach. Of
course, these three approaches can effectively complement one another.
In this paper, the emphasis is given to the third approach. A
number of statistical and mathematical tools are presented in Matlab,
though the language itself is not important. Examples from other areas
of statistics can be found in Dunn (1999). Two tools for use in
regression, a tool for use in teaching generalized linear models, and
details on how to obtain and use the programs are presented in these
findings.
TOOLS FOR USE IN REGRESSION
Regression is an important tool for any student of statistics to
study and, consequently, it is an essential component of study in most
University statistics programs. Two issues often discussed in relation
to multiple regression include normal probability plots (also known as
rankit plots, Q-Q plots, probability plots, quantile-comparison plots,
and normal plots), and overfitting. This paper discusses two Matlab
functions that demonstrate important points about both issues. Although
Matlab is not primarily a statistics package (it does, however, have a
comprehensive Statistics Toolbox), these features make it suitable for
use in statistics classes at certain times. We certainly believe that
our programs based on Matlab enhance the student's experience
within statistics.
It has been argued that the impact of technology "should be
especially strong in statistics" (Bratton, 1999, p. 666) since
technology plays such an important role in the actual practice of
statistics. Bratton (1999) then argues that there are three ways in
which proper implementation of technology into statistics can happen: By
making some topics unnecessary; by introducing new topics; and by
teaching some methods better. In this third category, he especially
makes mention of randomness and probability concepts. One of the
programs we discuss herein addresses this issue. West and Ogden (1998)
discuss a number of Java applets using statistical demonstrations that
operate over the World Wide Web, though none cover the same issues as
those presented here. The rationale, however, is similar to ours.
Saunders (1986) discusses a number of programs used at the Open
University, some of which cover similar ground, but whose implementation
is far more complicated.
In the next section, the normal probability plot is discussed
together with the associated Matlab program. Afterwards, the Matlab
function concerning overfitting is discussed.
The Normal Probability Plot
One of the fundamental assumptions in standard linear regression is
that of normal (Gaussian) errors. The assumption of normality is really
only necessary for hypothesis testing in the model; for example, a
t-test on one of the parameters, or an F-test of submodels.
To assess the normality of the residuals, it is often tempting to
plot a histogram of the residuals and assess normality from such a
graph. However, it is difficult to assess normality from such a graph:
Are the tails too heavy? too light? Fox (1991, p. 44) maintains that a
normal probability plot is useful for detecting departures from
normality in the tails, which is where problems often occur and where
they are often more serious. He also suggests that a weakness of the
plots is that they do not show the overall shape, which is best done
using a histogram or a stem-and-leaf plot. Our program supplies both a
normal probability plot and a histogram. Marasinghe, Meeker, Cook, and
Shin (1996) also include boxplots and the theoretical probability
density function of the distribution.
In normal probability plots, the sample order statistics are
plotted against the expected order of normal order statistics (or
rankits). (Some authors prefer to use the axes in the reverse sense.) If
the sample is from a normal distribution, the plot is expected to be
linear. For a non-normal sample, the plot would be non-linear. Such
plots are quite common in the literature. In fact, Marasinghe et al.
(1996) discuss a useful Lisp-Stat program for demonstrating (among other
things) the behaviour of the normal probability plot and histograms for
different distributions.
Many books, such as Jennrich (1995) for example, provide a number
of normal probability plots generated from actual normal distributions
so students can see the types of plots generated. Jennrich (1995) also
then provides numerous normal probability plots for random numbers from
the double exponential distribution and the Cauchy distribution to
indicate some typical data features that can be detected by such plots.
Cook and Weisberg (1994) include a program enabling readers to gain
experience in judging whether normal probability plots are linear. They
give users a choice of plotting a normal, uniform, chi-square or
t-distributions. Some authors, for example Weisberg (1985), make only
cursory notes.
The program discussed here allows the user to quickly and easily
generate as many plots as necessary to gain experience in recognising
characteristics of the plots, and to explicitly generate plots from data
that exhibit certain characteristics.
There are four main objectives in using this tool:
* To gain experience in recognising the typical characteristics of
normal probability plots;
* To compare and evaluate the characteristics of normal probability
plots for normal and non-normal distributions;
* To understand the difficulties in identifying normality in small
samples which are actually drawn from normal distributions;
* To demonstrate the inadequacies of simply relying on a histogram
to detect non-normality in a data set.
The Program
To help students understand the behaviour of the normal probability
plot for normal and non-normal samples, the Matlab program nppdemo has
been developed. The initial screen is shown in Figure 1. There are three
regions: The left panel is where the normal probability plot is shown.
The middle panel is where a histogram of the data is shown, and the
controls are in the right panel.
The control panel is divided into three parts. The top section
allows the user to select one of five (mutually exclusive) types of
distributions to be plotted. The middle section allows the user to set
parameters of the distribution, including the number of points to be
randomly generated. The lower section allows the user to plot the data
or to quit the program.
The top section gives the user five choices of random data to plot.
So the program can operate without the use of the Statistics Toolbox (an
extra-cost addition to Matlab), the random data has been generated from
the two random number generators supplied with the basic Matlab program:
randn for normal random variables, and rand for continuous uniform
random variables.
The left panel displays the normal probability plot. To assess the
normality of the errors, one has to determine if the points are
approximately on a straight line. To aid the user in making this
assessment, the normal probability plot is plotted with a straight line
corresponding to the normal distribution with the mean and variance
estimated from the data.
In the middle panel, a histogram is plotted. The histogram is
plotted with bars horizontal, rather than vertical as is more common.
This has been done since then the data in both graphs is plotted on the
left axis. This also means the histograms mimic the stem-and-leaf plots
used by Fox (1991).
The first option in the top section of the right panel generates
random normal values for a given mean and variance. This plot should
produce a normal probability plot that is approximately linear.
[FIGURE 1 OMITTED]
The second option generates right (positively) skewed data from an
exponential distribution. (The variance cannot be supplied for an
exponential distribution; it is equal to the mean.) These random numbers
are easily generated from uniform random numbers (Devroye, 1986, sect.
II.2.1) and are expected to produce a normal probability plot that is
concave upward. A picture of the screen with 100 random points from this
distribution showing the corresponding histogram and normal probability
plot is shown in Figure 2.
The third option generates left (negatively) skewed data. This is
generated from exponential random numbers also, and made left skewed by
subtracting from the maximum value in the sample. The normal probability
plot is expected to be concave down.
[FIGURE 2 OMITTED]
The fourth option generates random numbers from a distribution with
long (heavy) tails. Random numbers from a t-distribution with two
degrees of freedom are generated since they can be written as a function
of uniform random numbers (Devroye, 1986, sect. IX.3.3). Since the
variance for such a distribution is undefined, the variance cannot be
entered in this case. A distribution with long tails is expected to
produce a normal probability plot somewhat like a reflected
'S' shape.
The fifth option generates random numbers with short (light) tails.
The numbers are generated from a continuous uniform distribution and are
expected to produce a somewhat like an 'S' shape.
The middle section allows the user to set the mean and (where
possible) the variance of the generated data, as well as the number of
points generated. The mean can be set between - 100 and 100, and the
variance can be set to a positive value less than 100. The number of
random points generated can range between 5 and 1000.
The lower section contains just two buttons. Every time that Plot
is pressed, a new random data set is generated according to the set
parameters, and the normal probability plot and histogram updated. Quit
simply closes the program.
Using the Program
There are a number of ways to use this program in the classroom in
line with the objectives listed earlier. The most obvious is to
demonstrate typical plots for residuals with certain characteristics.
For example, if the model produces residuals skewed to the right, the
typical normal probability plot can be shown by selecting the
appropriate options, and pressing Plot. Students found this more
engaging than simply looking at pictures in text-books.
To show the forms of such plots, it is recommended that the number
of points generated be quite large, say over 200, and possibly set to
the largest value permitted in the program, which is 1000. This ensures
that the pattern is very clearly displayed. Then, having seen the
typical patterns to watch for, smaller samples can be generated to see
how easy or difficult it is to spot these patterns for smaller data
sets.
The program can also be used to display how difficult it is to
detect non-normality, especially for small data sets. This is probably
best demonstrated by choosing a small sample size, say five or ten, and
generating plots from a normal distribution. Since pressing Plot
generates a new data set each time, it is quick and easy to generate a
number of such plots, and it should not take too many attempts to
produce a histogram or normal probability plot that doesn't look
normal even though it is known to be generated from a normal
distribution. Students found this enlightening, and in some respects,
challenging their limited experiences. Students seemed to be very
interested in the program, and were keen to experiment. In particular,
they were very interested to see the shapes of the normal probability
plots for small sample sizes, especially from known normal
distributions. There was much discussion about how difficult it can
sometimes be to spot deviations from normality unless the sample size is
"large." There was much discussion over what constituted a
"large" sample. In later laboratory sessions using real data
sets, students often commented on the lessons learnt from the
demonstration, especially when the data sets were small in size.
The program can also be used to show how difficult it is to
identify a normal distribution only using a histogram. Many data sets
(especially small ones) can have a normal probability plot that looks
reasonably linear, but whose histogram looks non-normal.
The Overfitting Program
One dilemma that arises in multiple regression is balancing the
competing criteria of having enough regressors for an adequate model,
and having too many regressors that produces "artificial
accuracy," or overfitting. The program overfit demonstrates this
situation.
Many criteria can be used to select the "optimal" number
of regressors, including the adjusted coefficient of determination (adjusted [R.sup.2]) as reported by many software packages,
Mallows' [C.sub.p], and PRESS-related statistics. The adjusted
[R.sup.2] penalizes the ordinary coefficient of determination when
regressors are added to the model. Introducing a new, non-significant
regressor will very slightly increase [R.sup.2], but will reduce the
value of the adjusted [R.sup.2].
Mallows' [C.sub.p] is calculated on the basis that a good
model is unbiased. This statistic, then, estimates the bias in the model
which is then used as a criterion for selecting a good model (Weisberg,
1985, sect. 8.8).
The PRESS (PRediction Error Sum of Squares) residuals use a
different criterion again. They are called deleted residuals by Jennrich
(1995), and Atkinson (1985, chap. 3) discusses similar residuals which
he calls jack-knife or cross-validatory residuals. In the case of PRESS
residuals, each observation i is excluded from the analysis in turn, and
the model fitted to the remaining n-1 data points. This model is then
used to predict the value for the missing observation i, and the
residual in predicting the missing observation is the PRESS residual for
that point. (Note that alternative computational formulae exist.) The
PRESS residuals can be used to calculate statistics for assessing the
quality of models. The PRESS statistic is generally given as the sum of
the squared PRESS residuals (also called Allen's PRESS by Jennrich
(1995) after its originator). The sum of the absolute PRESS residuals
can also be used. The model with the smallest PRESS statistic is chosen
as the "optimal" model.
There are three major objectives for using this tool:
* To understand that adding more terms to a regression model (and
hence overfitting) does not necessarily produce a better model;
* To understand that obtaining a well fitting model and obtaining
an accurate predicting model (small residuals) are not necessarily the
same goal;
* To realize that the criteria used for selecting an
'optimal' regression model do not necessarily agree.
The Program
To help students understand the problems of overfitting, the Matlab
program overfit fits regression polynomials to a set of data. The
initial screen is shown in Figure 3. There are two regions on the
screen: The main figure, and controls on the right side; the main figure
is where the plots are shown. The program plots data points, and then
fits higher order polynomials to these data and the effects of
overfitting are observed.
The panel on the right contains three sections: The upper section
allows the user to specify the source of the data points used. The
middle section contains information about the data points (and it
depends on the source of the data selected in the region above). The
lower section contains the main controls.
The top section of the right panel allows the user to specify the
source of the data as either a file or randomly generated by Matlab. The
data files should be text files, containing two columns of space
delimited numbers; the first column contains x values and the second y.
Using the computer generated data allows the student or instructor to
see the effects of overfitting, but being able to load other data means
that the effects can be seen on data relevant or known to the student.
The middle region of the right panel depends on the data source. If
the data are sourced from a data file, the middle region gives the name
of the data file used, and contains a button allowing the user to use a
graphical interface to load the appropriate data file.
If the data are generated by Matlab, there are two options to be
set. The bottom menu sets the type of regression line from which points
are generated. The user can choose between a straight line, a quadratic or a cubic. The program will generate eleven (non-evenly spaced) points
from the specified polynomial, and add an amount of Gaussian
"noise" with the error variance specified in the top region of
this section. The value of the error variance can be set to a positive
value less than 1.
In the lower section, pressing the Plot Data button will plot (in
blue) the data as specified in the top section. Two buttons under the
Plot Data button have arrows, indicating fitting models of higher order
(upward pointing arrow) or lower order (downward pointing arrow) than
the model currently fitted (indicated in the edit window between the
arrows). These arrow buttons plot polynomials from a first order
(linear) up to an eighth order polynomial. These fitted polynomials are
generally shown in red. The only exception is if the data is generated
from Matlab, when the polynomial of the order specified (linear,
quadratic or cubic) is shown in black. At any time, the user can enter
any particular polynomial order (between one and eight) into the edit
window between the arrows to get a plot of that particular order
polynomial fitted to the data.
[FIGURE 3 OMITTED]
Using the Program
The program can easily demonstrate the folly of overfitting as
listed in the objectives. Almost always (for generated data), the eighth
order polynomial is very accurate in the sense of having small
residuals, but the fitted curve shows it has not captured the trend of
the data very well at all. See Figure 4 for an example of fitting an
eighth order polynomial to data which is actually from a linear model
with error variance of 0.2. The students reacted very favourably to
this, even to the extent of making comments and drawing in their notes
some of the plots that resulted from fitting the higher order
polynomials. They seemed to understand the dangers in overfitting a
model quite well after seeing the demonstration. In addition, it became
clear to the students that producing small residuals and finding a good
model were two different goals, a new realization for many students.
After fitting the regression models, the Matlab command window
gives some statistics for each model. The order of the polynomials is
given first (with the specified model flagged by stars if the data is
generated using Matlab), the sum of the squared errors (SSE); the
ordinary coefficient of determination [R.sup.2]; the adjusted [R.sup.2];
Mallows' [C.sub.p]; an unbiased estimate of the error variance
(which can be compared to the specified error variance for generated
data); and the PRESS statistic. This enables the user to see how the
different criteria perform in each of the situations, and which model
might be chosen by each. Importantly, the criteria do not always agree
on the "optimal" model. The output for one such simulation is
given below. For data loaded from a file, the full model is considered
to be an eighth order polynomial for the calculation of [C.sub.p].
In this case, a quadratic model has been specified (as can be seen
by the stars flagging the second order polynomial). The SSE continues to
fall as the order of polynomials increase, as we would expect. Likewise,
the value of [R.sup.2] continues to get closer to one. The adjusted
[R.sup.2], however, seems to imply a model of sixth or eighth order.
(Plotting the sixth and eighth order polynomials shows that these
polynomials do not capture the trend of the data well at all.) The error
variance is close to the specified 0.2 for the quadratic model, but is
actually smallest for some of the higher order polynomials. The PRESS
statistic implies the second, sixth or eighth order polynomials are most
useful. Mallows' [C.sub.p] indicates that a quadratic model should
indeed be fitted (note that [C.sub.p]=p for Matlab generated data, by
the definition of [C.sub.p]). This demonstrates how difficult it is to
decide on the optimal order of a polynomial for fitting to the data.
Student comments, suggest they gained an appreciation of the need for
different criteria and the different goals of each, and the difficulties
in choosing an "optimal" model.
[FIGURE 4 OMITTED]
TOOLS FOR UNDERSTANDING GENERALIZED LINEAR MODELS
Classical linear regression is a standard and popular statistical
tool, but simultaneously satisfying the necessary assumptions (normality
of the data, constant variance, and independent errors) renders it
inadequate in many cases. However, generalized linear models, introduced
by Nelder and Wedderburn (1972) provide a wide class of statistical
models that may be of use. Indeed, classical linear regression is a
special case of generalized linear models.
Background
Generalized linear models were first introduced as a method of
analyzing a broad class of statistical models based on exponential
dispersion models (EDMs); that is, distributions having the probability
density function (or mass function) in the form
f(y;[theta],[empty set]) = a(y,[empty set])exp{[y[theta] -
[kappa]([theta])]/[empty set]}
where a(y,[empty set]) and [kappa]([theta]) are known functions,
and the mean is [mu] = [kappa]'([theta]). For these models,
Var[[UPSILON]] = [empty set][kappa]"([theta]), where V([mu]) =
[kappa]"([theta]) is called the variance function, and [empty set]
is a positive constant of proportionality known as the dispersion
parameter. The linear predictor is generally referred to as [eta] where
[eta] = X[beta]; X refers to the design matrix and [beta] is the
estimated coefficients. A link function is generally chosen such that
[mu] = g([eta]), where g(.) is any monotonic differentiable function.
While the link function can be chosen independently of the distribution,
the canonical link function deserves special mention because of its
common usage; it is the link function such that [eta] = [theta] (and is
therefore dependent upon the distribution). A wide variety of
distributions are EDMs: continuous distributions include the normal,
inverse Gaussian and gamma distributions; discrete distributions include
the Poisson and binomial distributions. The theory behind such models
has been well established by Nelder and Wedderburn (1972) and McCullagh
and Nelder (1989). The special case of linear regression uses a normal
distribution with the identity link function.
The algorithm for fitting such models is reasonably
straight-forward but iterative, and requires a few tricks to use well
(for example, to avoid taking logarithms of zero). The reader is
referred to McCullagh and Nelder (1989).
There are five main objectives of the program:
* To understand the roles of the link function and the chosen
distribution, and to realize they are separate roles;
* To understand the effect of the mean-variance relationship of an
EDM;
* To recognize that the fitted values [mu] = E[Y] actually define
the mean of the distribution for given X;
* To observe that a correct choice of distribution and link
function can model certain types of data (such as strictly positive
data) better than standard regression models;
* To understand the role of the dispersion parameter in generalized
linear models.
The Program
The demonstration tool, called glmshow, shows simple generalized
linear model structures and allows the components of this structure to
be interactively altered. The model structure can be based on a normal,
gamma or Poisson distribution, with a choice of three link functions:
the identity, logarithm and reciprocal link functions. The value of the
dispersion parameter can also be specified (except for the Poisson
distribution, for which it is fixed at 1).
The initial screen is shown in Figure 5. In this screen, the
options that can be adjusted are shown at the bottom. The distribution
and link function can be changed by clicking on the current settings and
making a selection from a drop-down menu. The dispersion parameter can
be changed by entering a new value in the edit box.
Above the options is a slider. Moving this slider (with the mouse
or arrow keys on the keyboard) changes the x-value on the graph, and the
other dependent details are then updated also. The Close button closes
the window.
So that the demonstration might be used in a classroom setting, the
lines are thicker than standard. In addition, the font sizes are larger
than standard and grow with the size of the window as it is maximized or
enlarged.
Using the Program
The demonstration can be used in various ways as an aid for
explaining many of the new concepts students are exposed to with
generalized linear models, as listed in the objectives.
The starting point should be in familiar territory, the normal
distribution-based standard regression model with the identity link
function. Selecting these options (they are also the default settings)
shows the screen in Figure 5. Moving the slider changes the x-value and
dependent quantities. The predicted mean value is indicated by a red dot
at this x-value. It is seen that a (normal) distribution of responses is
predicted, and that the variance of these responses is constant for
various values of x. Indeed, by specifying large values of x (and hence
small values of y), the normal distribution admits negative values,
which may not always be appropriate. This is shown in the demonstration
by part of the normal distribution moving off the screen. Altering the
dispersion parameter changes the variance of the normal distribution.
[FIGURE 5 OMITTED]
The effect of the mean-variance relationship for the gamma and
Poisson distributions can be shown by choosing one of these
distributions and moving the slider. The gamma distribution is shown as
an appropriately scaled probability density function (see Figure 6), and
the discrete Poisson distribution as an appropriately scaled probability
histogram (Figure 7). It becomes clear that the variance of the
predicted distribution changes with x. In particular, as x gets larger
(and hence y gets smaller) the response variable can be seen to never
take on values below y=0. This shows the dependence of the variance on
the mean for distributions other than the normal and is an opportunity
for the instructor to talk about the choice of distribution when
negative values of y are inappropriate.
[FIGURE 6 OMITTED]
For any distribution, the link function can be changed
independently. This is useful for discussing the separate roles of the
distribution and link function. Similarly, the effect of changing the
dispersion parameter can be shown for various distributions, link
functions and x-values.
Students were very interested in this program; it became apparent
that students had a greater understanding of the framework of
generalized linear models after using the program. In particular,
student comments indicated a clarification of the idea that the
predicted values [mu] were means of a distribution of responses for
given x.
USING AND OBTAINING THE PROGRAMS
It is almost obvious that "new technology should be layered
atop a traditional foundation" (Krantz, 1998, p. 28). The programs
discussed in this paper are learning aids, and cannot possibly be
effective without the support of additional adequate teaching. Some
notes on using the programs are presented here. In all cases, the
students gain more by seeing the program initially demonstrated by an
instructor, and then having time to interact with the programs
themselves. It may also be useful to have guided questions for the
students when using the programs to evaluate understanding, even to the
extent of using the programs in assessment.
The first program (normal probability plots) can be presented
during a discussion on regression diagnostics. Students should be able
to gain a reasonable amount of experience with the plots in under 15
minutes. It may be useful to have the students write a short summary of
their conclusions (in an assignment question, for example). Example
questions include writing short answers about (i) the patterns observed
in the plots for different shape distributions, (ii) what happens when
sample sizes are small (say less than 10) even when they are from a
normal distribution; and (iii) general guidelines for detecting
non-normal residuals.
The second program (overfitting regression models) can be
introduced during a discussion of the criteria for finding an
"optimal" model. Since real data can be imported and used, the
time spent using the demonstration can vary; simply using generated
data, students should have a good understanding of the concepts within
10 minutes. The program can be used in assessment with a real
(two-dimensional) data set where students are asked to select a model
for their data, justifying their choice. A suggested data set is data
set 384 from Hand, Daly, Lunn, McConway, and Ostrowski (1994), for which
a second order polynomial seems appropriate. The impact of influential
observations can also be observed; a suggested data set is the two assay
experiments on hormones from Carroll and Ruppert (1998).
The third program (generalized linear models) should be introduced
soon after the components of the generalized linear model have been
explained. There is often confusion about the different roles of the
link function and distribution. Students can manipulate the various
components of the generalized linear model and see how the components
can be independently chosen, yet interact to find a suitable model. In
addition, the concept that the fitted values define the mean of a
distribution for given x is often hard to explain and illustrate (even
though this concept also belongs in the standard regression context);
the program not only illustrates this but allows the value of x to vary,
which serves to consolidate the concept.
[FIGURE 7 OMITTED]
The functions discussed in this paper require Matlab version 5 or 6
(for any available platform), but the Statistics Toolbox is not
necessary. Because the functions use a graphical user interface,
students and instructors do not need to learn Matlab to use the
functions. Each function requires a single (text) file, and all
functions are available from the author's Web pages at
http://www.sci.usq.edu.au/staff/dunn/matlabprograms.html.
The programs must be copied to a directory or folder in the Matlab
path. (This can be found by typing path at the Matlab prompt). In
general, the path can be changed by typing pathtool at the Matlab
prompt, but it is wise to also consult the local documentation as there
are sometimes platform and version specific details to be aware of. The
programs are copyright by the author and can be used freely, but cannot
be exploited for commercial reasons.
Order SSE R^2 adj R^2 s^2 sum(PRESS^2) Cp p
1 14.72 0.140 0.045 1.636 19.986 60.989 2
*2* 1.95 0.886 0.858 0.244 2.482 3.000 3
3 1.93 0.887 0.839 0.275 2.851 3.922 4
4 1.92 0.888 0.813 0.320 4.034 4.908 5
5 1.30 0.924 0.848 0.261 2.862 4.343 6
6 0.87 0.949 0.873 0.218 2.019 4.787 7
7 0.87 0.949 0.831 0.289 3.191 6.336 8
8 0.27 0.984 0.922 0.134 2.217 7.276 9
Acknowledgements
The author would like to thank the editor and reviewers, whose
comments made substantial improvements to the article, and improved the
usefulness of the programs in the classroom.
Reprinted from JCMST 22(3) because some characters in equations
were inadvertently omitted. We apologize for this error.
References
Atkinson, A.C. (1985). Plots, transformations and regression: An
introduction to graphical methods of diagnostic regression analysis.
Oxford: Clarendon Press.
Borse, G.J. (1997). Numerical methods with Matlab: a resource for
scientists and engineers. Boston: PWS Publishing.
Bratton, George N. (1999). The role of technology in introductory
statistics classes. The Mathematics Teacher, 92(8), 666-669.
Carroll, R.J., & Ruppert, D. (1988). Transformation and
weighting in regression. New York: Chapman and Hall.
Colgan, L. (2000). MATLAB in first-year engineering mathematics.
International Journal of Mathematical Education in Science and
Technology, 31(1), 15-26.
Cook, R.D., & Weisberg, S. (1994). An introduction to
regression graphics. New York: Wiley.
Cretchley, P., Harman, C., Ellerton, N., & Fogarty, G. (1999).
Computation, exploration, visualisation: Reaction to Matlab in
first-year mathematics. In W. Spunde, R. Hubbard & P. Cretchley
(Eds.), Proceedings of Delta '99: Australian Symposium on Modern
Undergraduate Mathematics (pp. 81-86). Toowoomba: Delta '99
Committee.
Cretchley, P., Harman, C., Ellerton, N., & Fogarty, G. (2000).
Matlab in early undergraduate mathematics: an investigation into the
effects of scientific software on learning. Mathematics Education
Research Journal, 12, 219-233.
Devroye, L. (1986). Non-uniform random variate generation. New
York: Springer-Verlag.
Dunn, P.K. (1999). Three tools for interactively visualizing some
distribution theory concepts. The American Statistician, 53(2), 137-139.
Dunn, P.K., & Harman, C. (2000). Calculus demonstrations using
Matlab. International Journal of Mathematical Education in Science and
Technology, 33(4), 584-595.
Fox, J. (1991). Regression diagnostics: An introduction. Sage
University Paper Series on Quantitative Applications in the Social
Sciences, 07-079, Newbury Park: Sage.
Galbraith, P., Haines, C. & Pemberton, M. (1999). A tale of two
cities: When mathematics, computers and students meet. In J.M. Truran
and K.M. Truran (Eds.), Making the difference (pp. 215-222). Sydney:
Mathematics Education Research Group of Australasia Inc.
Hand, D.J., Daly, F., Lunn, A.D., McConway, K.J., & Ostrowski,
E. (Eds.). (1994). A handbook of small data sets. London: Chapman and
Hall.
Jennrich, R.I. (1995). An introduction to computational statistics.
Englewood Cliffs, N. J.: Prentice-Hall.
Krantz, S.G. (1998). How to teach mathematics. Providence R.I.:
American Mathematical Society.
Penny, J., & Lindfield, G. (1995). Numerical methods using
Matlab. New York: Ellis Horwood.
Marasinghe, M.G., Meeker, W.Q., Cook, D., & Shin, T.-S. (1996).
Using graphics and simulation to teach statistical concepts. The
American Statistician, 50(4), 342-351.
McCullagh, P. and Nelder, J.A. (1989). Generalized linear models
(2nd ed.). London: Chapman and Hall.
Nelder, J.A., & Wedderburn, R.W.M. (1972). Generalized linear
models. Journal of the Royal Statistical. Society Series A, 135(3),
370-384.
Saunders, D.J. (1986). Computer graphics and animations for
teaching probability and statistics. International Journal of
Mathematical Education in Science and Technology, 17(5), 561-568.
Weisberg, S. (1985). Applied linear regression. New York: Wiley.
West, R.W., & Ogden, R.T. (1998). Interactive demonstrations
for statistics on the World Wide Web. Journal of Statistics Education,
6(3). Retreived April 24, 2003 from JSE archive:
http://www.amstat.org/publications/jse/jse_archive.html.
PETER K. DUNN
University of Southern Queensland
Australia
dunn@usq.edu.au