Assessing growth in young children: a comparison of raw, age-equivalent, and standard scores using the Peabody Picture Vocabulary Test.
Sullivan, Jeremy R. ; Winter, Suzanne M. ; Sass, Daniel A. 等
Many tests provide users with several different types of scores to
facilitate interpretation and description of students' performance.
Common examples include raw scores, age- and grade-equivalent scores,
and standard scores. However, when used within the context of assessing
growth among young children, these scores should not be interchangeable
because they provide different information. To examine how raw,
age-equivalent, and standard scores function when assessing growth among
young children, this article uses scores on the Peabody Picture
Vocabulary Test-Third Edition to compare the use of these scores for the
purpose of measuring growth in receptive vocabulary skills among a
sample of 259 low-income, predominantly Hispanic preschoolers age 3 to 5
years. Results suggest a notable floor effect in the distribution of
age-equivalent scores that was not observed in the raw score or standard
score distributions. This floor effect may significantly affect the
results of correlational data analyses conducted with these scores. In
light of these findings and combined with a trend in the literature in
which researchers often do not provide a clear rationale for choosing
which test scores to use in statistical analyses, the authors offer
suggestions for researchers when using test scores as dependent
variables.
Keywords: assessment, psychological tests, measurement, young
children
**********
Measuring growth or change among young children is a common goal of
educational and psychological assessment, especially within the context
of measuring academic progress over time or intervention studies
attempting to investigate treatment effects using pre-post designs.
Within this context, constructs of interest include cognitive
functioning, early academic skills, motor functioning, language, social
skills, and adaptive behaviors (McConnell, Priest, Davis, & McEvoy,
2002; Spector, 1999). Psychological and educational tests designed for
these purposes typically provide the user with several different types
of scores to use to facilitate interpretation and description of
students' performance, and which scores are most appropriate to
interpret will depend on the purpose of the assessment. For example, raw
scores and percentage-correct scores can be used to describe the
student's current level of mastery, whereas norm-referenced scores,
such as age- and grade-equivalent scores, standard scores, and
percentiles, can be used to describe the student's performance
relative to her same-age peers. Given this variety of scores from which
to choose, test users may wonder which score is the best for reliably
and validly measuring change over time. Further, intervention studies
using standardized tests to assess growth sometimes neglect to report
which test scores were used as measures of the dependent variables,
which may influence the results of statistical analyses and therefore
influence the conclusions reached. The purpose of this article is to
briefly define raw, age-equivalent, and standard scores; review some of
the psychometric limitations associated with these different scores; and
empirically compare these scores for the purpose of assessing growth
among young children in particular, using scores on the Peabody Picture
Vocabulary Test-Third Edition (PPVT-III; Dunn & Dunn, 1997) from a
sample of preschoolers. The PPVT-III is an ideal measure for this
purpose because it is widely used for clinical and research purposes and
provides all three scores under investigation. Through our review and
empirical findings, we hope to demonstrate that these scores should not
be used interchangeably, because they provide different pieces of
information about the student, and that the psychometric limitations of
some of these scores suggest the need for cautious interpretation.
REVIEW OF DEFINITIONS AND LIMITATIONS OF TEST SCORES
Raw Scores
Within the context of cognitive and achievement assessment, raw
scores are typically obtained by simply counting the number of test
items answered correctly by the student (Angoff, 1984). Some tests
(e.g., processing speed subtests) employ more complex scoring procedures
to obtain raw scores, such as subtracting the number of errors from the
number of correct responses. Further, on tests of affective constructs,
such as emotional and behavioral functioning, raw scores are not
determined by item correctness because there are no "correct"
or "incorrect" responses to these items. Rather, raw scores
are determined by adding together the student's (or parent's,
or teacher's) responses to items employing a Likert-type scale.
For the purposes of criterion-referenced assessment, in which test
users are interested in the individual student's mastery without
regard to comparisons with other children, raw scores may be sufficient
in describing the student's performance. For the purposes of
norm-referenced assessment, however, raw scores by themselves are often
less informative (Urbina, 2004). Instead, they must be converted into a
norm-based score to describe performance relative to other children the
student's age. Further, raw scores for different tests (even
different tests of the same construct) cannot be compared with one
another, because the same raw score can mean different things for
different tests based on factors such as the number of items, type of
items, minimum and maximum scores, item difficulty, time limits, and
process for calculating raw scores. At the same time, raw scores (such
as those obtained via curriculum-based assessment) may be more sensitive
than norm-based scores to smaller changes in psychological or
educational functioning over time, and thus may be especially useful in
measuring growth within individual students (Riccio, Sullivan, &
Cohen, 2010).
Age-Equivalent Scores
Although the review that follows will include discussion of
age-equivalent (AE) and grade-equivalent (GE) scores (due to their
conceptual similarities and shared limitations), the study itself
employed only AE scores, because the PPVT-III does not provide GEs. The
use of AE and GE scores for clinical and educational decision-making has
a long history, particularly in the identification of students with
learning disabilities (Hishinuma & Tadaki, 1997; Reynolds, 1981),
diagnosing speech problems and specific language impairment (Lawrence,
1992; McCauley & Demetras, 1990; Plante, 1998), and measuring the
development of adaptive behaviors over time among children with
developmental disabilities (Chadwick, Cuddy, Kusel, & Taylor, 2005).
In spite of their widespread use, AE and GE scores have a number of
concerns that limit their clinical utility and minimize the
interpretations and decisions that should be made on the basis of these
scores. These limitations have been well articulated in the literature
(Angoff, 1984; Bracken, 1988; Pearson Assessments, 2010; Reynolds,
1981), and will be reviewed only briefly here.
AE scores can be defined as "the chronological ages for which
the given test performances are average" (Angoff, 1984, p. 20).
Angoff (1984) described the process used to develop AE scores: children
in the norm group are divided into subgroups based on age (e.g., using
intervals of 3, 6, or 12 months, as is often done with norm-referenced
tests of cognitive ability); either the mean or median test score is
identified for each age-defined subgroup; and this score (either the
mean or median) then becomes the AE score for each age-defined subgroup.
Thus, if a raw score of 30 converts to an AE score of 8-9, this means
that the raw score of 30 was the average score for the group of children
age 8 years 9 months in the norm sample.
Similar to AE scores, GE scores are defined as "the grades for
which these test performances are average" (Angoff, 1984, p. 22).
Thus, GEs are obtained in the same way as AE scores, only grade is used
as the basis for creating subgroups instead of age, with the score
representing the mean or median score for each grade level (Angoff,
1984). For example, if a raw score of 30 converts to a GE score of 3.4,
this means that the raw score of 30 was the average score for the group
of children in grade 3.4 (i.e., 3rd grade, fourth month).
In sum, then, AEs and GEs represent the mean or median raw score
obtained by a particular age group or grade level. The limitations
associated with this type of score are many. First, interpretation of
AEs and GEs depends on the unique distribution and variance of scores
around the mean at each age/grade (Bishop, 1997; Schulz &
Nicewander, 1997), and on the correlation between age/grade and test
performance (Angoff, 1984); these distributions and correlations change
with age/grade, even among different age/grade subgroups on the same
test. For example, reading skills and language skills do not develop at
a stable or constant rate; rather, they typically develop rapidly during
young childhood but then level off into adolescence and adulthood, so
the score distributions will vary with different ages and grade levels
(Bishop, 1997; Bracken, 1988; Reynolds, 1981). The implication of this
issue is that, similar to raw scores, AEs and GEs are not equivalent or
comparable across different tests, so the same AE or GE score may
indicate a substantial deficit on a test of reading comprehension but
not on another test of a different construct. Thus, a GE score of 10.3
on a test of mathematics achievement and a GE score of 10.3 on a test of
reading achievement cannot be directly compared, due to grade-related
differences in the development of these constructs and differences in
score distributions for these constructs (Angoff, 1984). To be able to
accurately interpret an AE or GE score, the test user must have access
to information about the sample and relationships among variables. For
the practitioner, this requirement makes AEs and GEs cumbersome. Another
limitation specifically in the interpretation of GEs is that curricula
and instruction are not constant across different schools and school
districts (e.g., different states emphasize different knowledge and
skills at different grade levels), so these scores do not represent
uniform measures of achievement in reading, writing, mathematics, and so
on (Urbina, 2004).
In light of these limitations of AE and GE scores, why do
researchers and clinicians continue to use them? One of the advantages
of using AEs and GEs is that they are simple, intuitive, and give the
appearance of being easily understood by parents and teachers. For this
reason, these scores are frequently used by clinicians to facilitate
score interpretation (Angoff, 1984; Hishinuma & Tadaki, 1997;
McCauley & Swisher, 1984b). For example, AEs and GEs can be used to
compare several different students on the same test to identify those
students who are functioning at the highest and lowest levels.
Alternatively, AEs and GEs can be used to identify strengths and
weaknesses across multiple subject areas or constructs for an individual
student (although skills in different constructs develop at different
rates; Payne, 1997; Thorndike, 2005). These types of scores also are
used to describe performance on constructs that change rapidly during
childhood or adolescence as a result of normal developmental processes
and learning (McCauley & Swisher, 1984a). However, this notion of
simplicity and ease of interpretation is countered by the argument that
these scores are too easily misinterpreted or overinterpreted (Lawrence,
1992). Thus, the appearance of intuitiveness and simplicity may in fact
represent the most significant danger in using AEs and GEs, as they are
not as simple for nonexperts to interpret as they may appear. For
example, if a student in the 4th grade obtained a GE score of 5.4 on a
test of reading achievement, we would not recommend that she suddenly be
placed in the 5th grade on the basis of this score, nor would we presume
that she possesses the reading skills of a 5th-grader. It would simply
mean that the student performed better than other 4th-graders on the
test, and her raw score was the average score for 5th-graders (Urbina,
2004). From a curricular standpoint, it would not make sense to conclude
that this student would perform well with 5th-grade tasks or subject
matter, because the student has not yet been taught this subject matter
(Angoff, 1984). What would be more informative would be to look at her
standard score to see how well she performed when compared to her
same-age peers (i.e., where did she fall in the distribution of other
4th-graders, or other 9-year-olds?) to identify norm-referenced
strengths and weaknesses.
Standard Scores
Standard scores are used to express the distance of the
student's score from the normative mean, in terms of standard
deviation units (Urbina, 2004). The most commonly used standard score
within the context of cognitive and achievement assessment is the
deviation IQ score, in which raw scores are converted to a score with a
mean of 100 and a standard deviation of 15. Thus, if a raw score
converts to a standard score of 100, the student is performing right at
the mean; if the raw score converts to a standard score of 85 (or 115),
the student is performing one standard deviation below the mean (or one
standard deviation above the mean for a score of 115).
An advantage of using standard scores is that they are comparable
across different tests. That is, a deviation IQ score of 115 can always
be interpreted as one standard deviation above the mean (or the 84th
percentile rank), a score of 130 can always be interpreted as two
standard deviations above the mean (or the 98th percentile rank), and so
on, for any norm-referenced test that uses this measurement index
(assuming the test produces scores that approximate the normal
distribution). If a student scores 130 on a math achievement test and
scores 130 on a reading comprehension test, then we can say that the
student is in the same relative position on both tests. Note that we
could not make this interpretation using simple raw scores for reasons
discussed above. Further, standard scores are more appropriate scores to
use in diagnostic decision-making than AEs and GEs, because standard
scores do take into account the distribution of scores around the mean.
Standard scores also provide the advantage of describing a range of
normal performance, allowing test users to picture students'
performance or behavior along a continuum from deficient to advanced
functioning, with a wide range of "typical" functioning in
between (Lawrence, 1992). In contrast, students scoring either below or
above their group-defined AE or GE are simply seen as deficient or
advanced, respectively, even though most will score either below or
above the average score (Bishop, 1997). Finally, standard scores can be
manipulated statistically (i.e., added, subtracted, or averaged) because
they are on an interval scale of measurement, whereas AEs and GEs cannot
be manipulated mathematically because they are on an ordinal scale
(Plante, 1998; Schulz & Nicewander, 1997). Due to their superior
psychometric properties and norm-referenced interpretability, standard
scores, rather than AEs and GEs, are considered more appropriate for
high-stakes decisions, such as diagnosis, placement, and need for
intervention (Bracken, 1988).
Despite their advantages over raw scores and AE scores, the major
limitation associated with using standard scores for measuring growth
over time is that actual increases in performance as measured by raw
scores may be masked by the use of a score that is norm referenced
(Fletcher et al., 1991; Lindsey & Brouwers, 1999). Thus, if a child
maintains her standard score of 100 across three different data points,
this does not mean that the child's development has stagnated;
rather, her raw score is increasing at the same rate as her same-age
peers' raw scores. Therefore, from a norm-referenced perspective,
her position along the normative distribution has not changed. Her raw
score, however, has increased, indicating individual growth in the skill
being measured. Among children with disabilities, or who are otherwise
at-risk, it may be especially important to assess change via raw scores,
because these children may be developing at a slower rate than their
peers who do not have disabilities. In this case, progress will be
detected if raw scores are used but not if standard scores are used.
CONTENT ANALYSIS
To provide some context for this study, it is informative to
explore how PPVT scores have been used in peer-reviewed published
journal articles. We conducted a content analysis using PsycInfo to
identify all peer-reviewed journal articles published from 2000 to 2010
(articles extracted February 15, 2010) that used any version of the PPVT
with child samples. This search revealed a total of 123 articles, with
most published in 2009 (n = 17), 2001 (n = 16), 2008 (n = 13), and 2003
(n = 13). Two researchers reviewed each of the 123 identified articles
independently to assess inter-rater agreement in terms of which type of
score was used in each study. The initial rate of agreement was 95.1%
(i.e., raters agreed on 117 out of 123 articles); a third researcher
reviewed the remaining six articles to resolve discrepancies among
raters. Results revealed that the majority of researchers used standard
scores (n = 67, 54.5%), raw scores (n = 16, 13.0%), or some combination
of multiple scores, such as standard and raw scores (n = 6, 4.9%). A
much smaller number of articles used age equivalents (n = 3, 2.4%) or
were classified as "other" (n = 2, 1.6%). One study included
in the "other" category used stanine scores and the other
combined scores from the PPVT and an expressive vocabulary test to
create a total vocabulary score. The remaining articles (n = 29, 23.6%)
did not indicate or define the type of scores employed in the study. Of
these 29 articles, the type of scores used could be deduced from the
researchers' statistical results in 14 studies, and standard scores
were used in all 14 of these studies. The type of score used could not
be determined for the remaining 15 studies. This is concerning given
that it is unclear how one should interpret the data.
Two researchers also reviewed each of the 123 identified articles
independently to assess interrater reliability in terms of whether the
authors of each study provided a rationale or justification for their
score selection in the description of their methods. The initial rate of
agreement was 95.1% (i.e., raters agreed on 117 out of 123 articles); a
third researcher reviewed the remaining six articles to resolve
discrepancies among raters. Of those authors who clearly reported the
type of score used in their analyses (n = 94, 76.4%), only 14 (14.9%)
provided a rationale or justification for their score selection in the
description of their methods. These justifications were most prevalent
for raw scores (n = 9, 64.3%), followed by standard scores (n = 5,
35.7%). To illustrate, sample rationales for raw scores include:
"In preparing the data for analysis, raw scores from the PPVT-R
were used instead of the customary Standard Score Equivalent because the
former have not had age factored out; thus, differences based on
straightforward ability may be more apparent" (Cunningham &
Graham, 2000, p. 40) and "Because age differences were the focus of
the current investigation, the children's raw scores were used in
all analyses" (Wolfe & Bell, 2007, p. 440). A sample
justification for standard scores includes: "To obtain a
standardized estimate of children's verbal intelligence, we
utilized standard scores rather than raw scores in all analyses"
(Lewis, Dozier, Ackerman, & Sepulveda-Kozakowski, 2007, p. 1419).
Several interesting trends can be seen in these results. First, the
standard score is the most commonly used score in research studies using
the PPVT. Second, almost one fourth of studies using the PPVT did not
clearly describe the score used in statistical analyses. Although the
type of score used could be deduced in about one half of these studies,
research consumers should not be left to make these judgments. Third,
approximately 85% of studies for which the score used was clearly
identifiable did not include a rationale for why that score was used.
This suggests that researchers may not be thinking about the
implications of using certain scores and how these decisions could
influence their conclusions. At the very least, researchers are not
clearly articulating these decisional processes to the consumers of
their results.
RESEARCH QUESTIONS
In light of the various types of scores being utilized in research
and practice, this study posed the following research questions: (1) Do
raw, AE, and standard scores produce comparable distributional
characteristics with young children? and (2) How does the type of score
used influence commonly employed statistical analyses when assessing
young children? These questions have important implications for
interpreting statistical results and may inform how scores are used and
reported in research with young children. Along with the content review,
these questions provide a nice illustration of the importance of
considering the type of scores utilized in research studies.
METHOD
Participants
Data were obtained from a larger intervention study data-set, in
which data were collected from four Head Start centers in a high-poverty
neighborhood in a large metropolitan city located in south Texas. Nearly
two thirds of the families reported annual income averages below $20,000
and less than 5% of parents reported earning a college degree from a
4-year institution. From the four matched Head Start centers, 259
children (51.7% females) age 3 to 5 years 8 months were sampled, with an
average age of 4.11 (SD = .62). Participants were predominantly of
Mexican American origin (95%), with English often (67%) the preferred
language spoken at home. These data were collected as part of a larger
study to assess the impact of an intervention (n = 122 in the
intervention group) designed to increase healthy eating and physical
activity when compared to a control group (n= 137 in the control group).
Instrument
The PPVT-III is a standardized measure of receptive vocabulary for
use with individuals from age 2-6 to 90 years. Examinees are required to
point to one of four pictures that best represents the meaning of a
verbally presented stimulus word. In addition to its use within the
context of clinical assessment, the PPVT-III frequently has been used as
a measure of vocabulary and receptive language in research studies with
young children. Scores on the PPVT-III possess strong internal
consistency and test-retest reliability, with coefficients consistently
larger than .90 for the age group under study (Campbell, 1998; Dunn
& Dunn, 1997; Williams & Wang, 1997). To meet the goals of the
present study, we employed the commonly used standardized score
estimates (M = 100, SD = 15), age equivalent estimates (expressed as age
in years and months where a particular raw score is the median score),
and raw scores (total number of correct responses) of children's
receptive vocabulary. AE scores on the PPVT-III range from 1 year 9
months to 22 years, which purports to capture the range in which
receptive vocabulary is most likely to increase at a relatively
consistent and progressive rate (Williams & Wang, 1997).
Procedure
Scores on the PPVT-III were initially collected as part of an
evaluation of a 12-week psychoeducational intervention promoting young
children's physical health, early academic skills, and school
readiness (the Healthy & Ready to Learn program), in which the
PPVT-III was used as a measure of receptive vocabulary. As part of the
intervention study, data were collected at pre- and posttreatment (12
weeks apart) by researchers trained on the PPVT-III standardized
administration and scoring procedures. Researchers were required to
practice administration and scoring to ensure their competency prior to
data collection.
RESULTS
Prior to statistical analyses, it is critical to assess the score
accuracy and distributional characteristics. As indicated by the
histograms (see Figures 1 and 2), the raw, standard, and AE
distributions differed considerably for the pretest and posttest data.
These results demonstrate the noticeably larger skew of the AE scores
due to floor effects, which represents an AE score of 1.75 (or 1 year
and 9 months, the lowest possible AE score on the PPVT-III). Notice that
these floor effects are unique to AE scores, as participants with a wide
variety of raw and standard scores are given the same AE score.
Unfortunately, these data suggest that despite having, at times, vastly
different raw (M = 13.91, SD = 4.93, minimum[min.] = 3, maximum [max.] =
23) and standard (M = 65.58, SD = 8.88, min. = 46, max. = 82) scores,
these participants (n -- 91) all had the same AE score at pretest of
1.75. This same trend was revealed at posttest for participants (n --
43) with floor effects on the raw (M = 17.16, SD = 4.35, min. = 9, max.
-- 23) and standard (M = 65.47, SD = 8.41, min. = 45, max. = 81) scores.
Collectively, these results suggest that though raw and standard scores
display considerable variability, the AE scores remain a constant of
1.75; this presents a potential problem for AE scores from a
distributional and accuracy perspective, especially at the lower age
ranges.
[FIGURE 1 OMITTED]
Not surprisingly, these data characteristics explain the change in
correlations between the AE and other scores at various age ranges at
pretest using linear and quadratic models (see Table 1). As expected,
the younger sample produced the smallest correlations between these
scores due to the floor effect, thus suggesting these scores are not
completely interchangeable when used with young children. These results
were replicated at posttest, but to a lesser degree due to the smaller
number of floor effect cases. Figure 3 shows the correlation between AE
and standard scores at pretest and posttest, and illustrates the floor
effects observed with the AE scores.
[FIGURE 2 OMITTED]
To better understand how the distributional characteristics of
scores influence statistical analyses conducted with these scores, data
were analyzed using three multilevel models, (1) with children's
growth (i.e., time) at Level-1 and the child-level variables (i.e.,
treatment) at Level 2. These models were used to examine
participants' growth on the PPVT-III scores from pre- to posttest
and to assess whether the type of PPVT-III scores utilized moderated the
perceived treatment effect. For each analysis conducted, two separate
models were fit sequentially to explore the impact of treatment
variable. Model 1, an unconditional linear growth model, was fit to
examine the amount of dependency in the outcome variables and establish
baseline statistics related to changes in participants' growth
(i.e., slopes) and starting points (intercept or initial growth between
the two time points). Model 2 evaluated whether the treatment status
significantly predicted student growth on each outcome variable.
Results from each PPVT variable are presented in Table 2. The
average intercept ([[beta].sub.00]) and slope ([[beta].sub.10]) across
participants were of less interest, as they estimate the average initial
status (or pretest score) across groups and the overall average amount
of change from pretest to posttest, respectively. However, it is worth
noting that no group differences emerged at pretest ([[beta].sub.1]).
The parameter estimates ([[beta].sub.11]) of primary interest tested
whether the treatment group experienced significantly more growth (or
change) compared to the control group from pretest to posttest. These
parameter estimates are also accompanied by an effect size using the
equations provided by Feingold (2009).
[FIGURE 3 OMITTED]
The results revealed relatively consistent findings across the
three PPVT-III variables from a statistical significance standpoint (see
[[beta].sub.11] in Table 2) using our sample, although the parameter
estimates are considerably different due to the differences in
measurement scales. Collectively, these results indicated no differences
at pretest, and the growth-rates were relatively consistent across the
three PPVT-III variables. This was further supported by the relatively
consistent effect sizes (see [ES.sub.[beta]11]), which contradicts the
notion that raw scores are more appropriate for measuring change given
that scores are not adjusted for age.
It was interesting that growth rates were not more biased for AE
scores, given the large number of participants exposed to floor effects
at pretest. However, these results cannot be generalized to all
data-sets, nor can it be inferred that the score used does not influence
the result, given the differences in how these scores are derived. In
fact, it might be advantageous for researchers to test the sensitivity
of the type of score utilized, perhaps by analyzing data with different
scores to assess the degree to which results are influenced by choice of
score. At the very least, researchers should justify the type of score
employed and consider the implications of using that type of score.
DISCUSSION
Perhaps the most significant finding from the data analysis is the
notable floor effect with AE scores, which was not observed in either
the raw or standard score distributions (see Figures 1 and 2). This
pattern is significant because it suggests that AEs lack the precision
of raw and standard scores in that many children with different raw and
standard scores obtained the exact same AE score. Thus, using AEs may
mask true differences in ability level that are seen with raw and
standard scores, which makes it difficult to make distinctions among
children at the low range of ability. This finding holds important
implications for both practitioners and researchers and is especially
salient within the context of assessing young children. For many
constructs (e.g., reading, language skills), floor effects are most
likely to occur at the younger end of the age spectrum due to less
variability in scores at young ages (Bracken, 1988; Catts, Petscher,
Schatschneider, Bridges, & Mendoza, 2009) and among children at
lower ability levels, such as those with developmental disabilities
(e.g., Dickson, Wang, Lombard, & Dube, 2006). To illustrate, Dickson
et al. (2006) used PPVT-III AE scores to assess the receptive language
skills of children, adolescents, and young adults with developmental
disabilities. These scores demonstrated a marked floor effect, in which
almost half of the sample obtained the lowest possible AE score, either
by earning the lowest possible score or by failing to establish a basal
score. Similar floor effects were observed with adolescents with Down
syndrome on the Stanford-Binet, Fourth Edition (SB-IV) (Couzens,
Cuskelly, & Jobling, 2004). Thus, our findings may generalize to
other tests of ability and academic achievement and other samples of
young children and children and adolescents at the lower end of the
score distribution.
Although not well demonstrated with our example, the AE floor
effect is especially relevant within the context of assessing growth
over time, because growth in raw or standard scores may not be large
enough to be seen with AE scores. In other words, the group of children
scoring at the floor at pretest may again score at the floor upon
posttest even though their raw scores and standard scores have increased
from pretest to posttest, because the AE scale is not sensitive enough
to detect this change. The use of AEs may mask true pre-post changes in
ability levels, which also violates the assumption of normality in most
cases.
Also of note is the impact of floor effects on any statistical
analyses or comparisons based on the distribution of AE scores. For
example, predictive validity analyses between test scores and some later
outcome may be compromised by floor effects and skewed distributions,
thus reducing correlation coefficients due to a lack of differentiation
among children who score low on the test (see Catts et al., 2009).
Similar difficulties will likely be observed with other correlational
analyses, such as test-retest reliability and convergent validity
analyses (e.g., correlations with scores on other measures of cognitive
functioning and achievement), due to limited variability of the AE score
distribution.
With that said, a surprising finding was that growth rates were not
more underestimated for AE scores due to the large number of
participants exposed to floor effects. This finding may be due to unique
characteristics of our study, such as the sample and treatment used,
length of the interval between pre- and posttest, and, perhaps most
importantly, relatively small overall treatment effects. Thus, we cannot
assume that this finding will generalize to other studies, as AE scores
may have been more biased under different conditions. For example, it is
feasible that larger treatment effects may be more attenuated by floor
effects, or that samples with higher functioning participants may be
less influenced by floor effects, or that floor effects would have more
influence when comparing children with a diagnosis to children without a
diagnosis. Thus, the impact on estimated effect sizes may be either
increased or reduced based on these factors. More research will be
necessary to examine the influences of these factors on the utility of
AE scores for measuring growth.
Aside from the notable floor effect, results also suggest that raw
scores, AEs, and standard scores should not be used interchangeably or
interpreted as alternative expressions of one another, because they
clearly provide different information by measuring children's
performance in different ways. Indeed, one score may indicate
performance slightly below average while another score suggests serious
weaknesses. This is consistent with other studies that found low and/or
widely variant correlations between AEs/GEs and standard scores (e.g.,
Hishinuma & Tadaki, 1997; Plante, 1998). For example, Hishinuma and
Tadaki (1997) demonstrated that among some of the subtests on the
Wechsler Individual Achievement Test (WIAT), students in lower grade
levels could obtain a standard score only slightly below the mean of
100, but a GE score significantly lower than their actual grade
placement. Similarly, Couzens et al. (2004) demonstrated that among a
sample of children with Down syndrome, AE scores on subtests of the
SB-IV increased over time, while children's standard scores (i.e.,
IQ scores) on the same measure decreased over time. Thus, interpreting
results based on AE scores would suggest that these children were making
progress over time, but interpreting results based on standard scores
would suggest progressively wider discrepancies between these children
and their same-age peers. Similarly, Gabriels, Ivers, Hill, Agnew, and
McNeill (2007) found different patterns of change in adaptive behaviors
(assessed with the Vineland Adaptive Behavior Scales) over time among a
sample of children with autism spectrum disorders, depending on whether
raw or standard scores were used.
Children in the high and low cognitive ability groups showed
significant decreases in standard scores over time, but when raw scores
were used, children in the high cognitive ability group showed an
increase in adaptive behaviors over time while children in the low
cognitive ability group stayed the same. Thus, the conclusions reached
depended on which scores were used in the analyses.
Which, then, is the best score to use to measure growth among young
children? The susceptibility of AE scores to floor effects (in addition
to the other limitations of these scores discussed previously) makes
them the least useful. The choice between raw and standard scores
depends on what we want to know. If we want to assess criterion-related
change, raw scores are most appropriate. The use of raw scores is
especially important when assessing change among groups of young
children who may be at risk for atypical levels or rates of development,
including English language learners, children with disabilities, and
children of low socioeconomic status (Vagh, Pan, &
Mancilla-Martinez, 2009), because true change in performance may be
masked when using standard scores that compare these children to their
typically developing peers. On the other hand, if we want to assess
growth from a norm-referenced perspective, then standard scores are most
appropriate. Hammer, Lawrence, and Miccio (2008) advocated the use of
raw and standard scores when assessing growth among young children, as
each score informs us in different ways: change in individual
children's knowledge or abilities (raw scores) and change in
knowledge or abilities in comparison to other children's knowledge
or abilities (standard scores).
In light of (1) the results of our content analysis suggesting many
researchers do not clearly describe which test scores they use in data
analyses, (2) our statistical analyses showing how floor effects may
attenuate correlation coefficients, and (3) previous studies
illustrating that conclusions are strongly influenced by which scores
researchers choose to use, we urge researchers to report which scores
they use in their analyses and to provide a sound rationale for this
decision. This rationale should be based on factors such as whether
criterion-referenced or norm-referenced interpretation is more
appropriate given the research question, the nature of the sample (e.g.,
age distribution, clinical or at-risk sample vs. "normal"
sample), and the distribution of scores (e.g., presence of floor
effects). Recall that our content analysis of studies using PPVT scores
revealed that all of the studies that did not clearly indicate which
scores were used, and for which we were able to deduce which scores were
used, employed standard scores. This suggests that standard scores may
be the presumptive "default" score. But researchers must
consider the type of score when evaluating growth and how that decision
influences the interpretation of their data.
DOI: 10.1080/02568543.2014.883453
ACKNOWLEDGMENT
A preliminary version of this article was presented at the annual
meeting of the National Association of School Psychologists, February
2011, San Francisco, California.
REFERENCES
Angoff, W. H. (1984). Scales, norms, and equivalent scores.
Princeton, NJ: Educational Testing Service.
Bishop, D. V. M. (1997). Uncommon understanding: Development and
disorders of language comprehension in children. Hove, UK: Psychology
Press.
Bracken, B. A. (1988). Ten psychometric reasons why similar tests
produce dissimilar results. Journal of School Psychology, 26, 155-166.
doi: 10.1016/0022-4405(88)90017-9
Campbell, J. (1998). Test review: Peabody Picture Vocabulary
Test-Third edition. Journal of Psychoeducational Assessment, 16,
334-338. doi:10.1177/073428299801600405
Catts, H. W., Petscher, Y., Schatschneider, C., Bridges, M. S.,
& Mendoza, K. (2009). Floor effects associated with universal
screening and their impact on the early identification of reading
disabilities. Journal of learning Disabilities, 42, 163-176. doi:
10.1177/0022219408326219
Chadwick, O., Cuddy, M., Kusel, Y., & Taylor, E. (2005).
Handicaps and the development of skills between childhood and early
adolescence in young people with severe intellectual disabilities.
Journal of Intellectual Disability Research, 49, 877-888. doi:
10.1111/j.1365-2788.2005.00716.x
Couzens, D., Cuskelly, M., & Jobling, A. (2004). The Stanford
Binet Fourth Edition and its use with individuals with Down syndrome:
Cautions for clinicians. International Journal of Disability,
Development and Education, 51, 39-56. doi: 10.1080/1034912042000182193
Cunningham, T. H., & Graham, C. R. (2000). Increasing native
English vocabulary recognition through Spanish immersion: Cognate
transfer from foreign to first language. Journal of Educational
Psychology, 92, 37-49. doi: 10.1037/0022-0663.92.1.37
Dickson, C. A., Wang, S.S., Lombard, K. M., & Dube, W. V.
(2006). Overselective stimulus control in residential school students
with intellectual disabilities. Research in Developmental Disabilities,
27, 618-631.
Dunn, L. M., & Dunn, L. M. (1997). Peabody Picture Vocabulary
Test-Third Edition. Circle Pines, MN: American Guidance Service.
Feingold, A. (2009). Effect sizes for growth-modeling analysis for
controlled clinical trials in the same metric as for classical analysis.
Psychological Methods, 14, 43-53. doi:10.1037/a0014699
Fletcher, J. M., Francis, D. J., Pequegnat, W., Raudenbush, S. W.,
Bornstein, M. H., Schmitt, F., . . . Stover, E. (1991). Neurobehavioral
outcomes in diseases of childhood: Individual change models for
pediatric human immunodeficiency viruses. American Psychologist, 46,
1267-1277. doi:10.1037/0003-066X.46.12.1267
Gabriels, R. L., Ivers, B. J., Hill, D. E., Agnew, J. A., &
McNeill, J. (2007). Stability of adaptive behaviors in middleschool
children with autism spectrum disorders. Research in Autism Spectrum
Disorders, 1, 291-303.
Hammer, C. S., Lawrence, F. R., & Miccio, A. W. (2008).
Exposure to English before and after entry into Head Start: Bilingual
children's receptive language growth in Spanish and English.
International Journal of Bilingual Education and Bilingualism, 11,
30-56.
Hishinuma, E. S., & Tadaki, S. (1997). The problem with grade
and age equivalents: WIAT as a case in point. Journal of
Psychoeducational Assessment, 15, 214-225.
doi:10.1177/073428299701500303
Lawrence, C. W. (1992). Assessing the use of age-equivalent scores
in clinical management. Language, Speech, and Hearing Services in
Schools, 23, 6-8.
Lewis, E. E., Dozier, M., Ackerman, J., & Sepulveda-Kozakowski,
S. (2007). The effect of placement instability on adopted
children's inhibitory control abilities and oppositional behavior.
Developmental Psychology, 43, 1415-1427. doi:
10.1037/0012-1649.43.6.1415
Lindsey, J. C., & Brouwers, P. (1999). Intrapolation and
extrapolation of age-equivalent scores for the Bayley II: A comparison
of two methods of estimation. Clinical Neuropharmacology, 22, 44-53.
McCauley, R. J., & Demetras, M. J. (1990). The identification
of language impairment in the selection of specifically
language-impaired subjects. Journal of Speech & Hearing Disorders,
55, 468-475.
McCauley, R. J., & Swisher, L. (1984a). Psychometric review of
language and articulation tests for preschool children. Journal of
Speech and Hearing Disorders, 49, 34-42.
McCauley, R. J., & Swisher, L. (1984b). Use and misuse of
norm-referenced tests in clinical assessment: A hypothetical case.
Journal of Speech and Hearing Disorders, 49, 338-348.
McConnell, S. R., Priest, J. S., Davis, S. D., & McEvoy, M. A.
(2002). Best practices in measuring growth and development for preschool
children. In A. Thomas & J. Grimes (Eds.), Best practices in school
psychology-IV (pp. 1231-1246). Bethesda, MD: National Association of
School Psychologists.
Payne, D. A. (1997). Applied educational assessment. Belmont, CA:
Wadsworth.
Pearson Assessments. (2010). Interpretation problems of age and
grade equivalents. Retrieved from http://www.
pearsonassessments.com/pai/ca/Relatedlnfo/In
terpretationAgeGradeEquivalents.htm
Plante, E. (1998). Criteria for SLI: The Stark and Tallal legacy
and beyond. Journal of Speech, Language & Hearing Research, 41,
951-957.
Reynolds, C. R. (1981). The fallacy of "two years below grade
level for age" as a diagnostic criterion for reading disorders.
Journal of School Psychology, 19, 350-358.
Riccio, C. A., Sullivan, J. R., & Cohen, M. J. (2010).
Neuropsychological assessment and intervention for childhood and
adolescent disorders. Hoboken, NJ: Wiley.
Schulz, E. M., & Nicewander, W. A. (1997). Grade equivalent and
IRT representations of growth. Journal of Educational Measurement, 34,
315-331.
Spector, J. E. (1999). Precision of age norms in tests used to
assess preschool children. Psychology in the Schools, 36, 459-171. doi:
10.1002/(SICI) 1520-6807
Thorndike, R. M. (2005). Measurement and evaluation in psychology
and education (7th ed.). Upper Saddle River, NJ: Pearson.
Urbina, S. (2004). Essentials of psychological testing. Hoboken,
NJ: Wiley.
Vagh, S. B., Pan, B. A., & Mancilla-Martinez, J. (2009).
Measuring growth in bilingual and monolingual children's English
productive vocabulary development: The utility of combining parent and
teacher report. Child Development, 80, 1545-1563.
Williams, K. T., & Wang, J. (1997). Technical references to the
Peabody Picture Vocabulary Test-Third Edition (PPVT-III). Circle Pines,
MN: American Guidance Service.
Wolfe, C. D., & Bell, M. A. (2007). Sources of variability in
working memory in early childhood: A consideration of age, temperament,
language, and brain electrical activity. Cognitive Development, 22,
431-455. doi:10.1016/j.cogdev.2007.08.007
Jeremy R. Sullivan, Suzanne M. Winter, Daniel A. Sass, and Nicole
Svenkerud
University of Texas at San Antonio, San Antonio, Texas
Submitted June 8, 2012; accepted July 24, 2012.
Address correspondence to Jeremy R. Sullivan, Department of
Educational Psychology, University of Texas at San Antonio, 501 West
Cesar E. Chavez Boulevard, San Antonio, TX 78207-4415. E-mail:
Jeremy.sullivan@utsa.edu
NOTE
(1.) Data were also analyzed using a 2 x 2 (Time x Treatment) ANOVA
and analysis of covariance with pretest as the covariate. Not surprising
based on the ICCs, the results were nearly identical to the multilevel
model.
TABLE 1
Linear and Quadratic Correlations Between Peabody Picture
Vocabulary Test-Third Edition Scores at Pretest for Each Age Level
Linear Quadratic
AE Raw AE Raw
([n.sub.T] = 108,
[n.sub.FE] = 60, 56%)
3-year-olds Raw .901 -- .963 --
Standard .808 .955 .829 .968
([n.sub.T] = 127,
[n.sub.FE] = 29, 23%)
4-year-olds Raw .981 -- .989 --
Standard .916 .958 .925 .964
([n.sub.T] = 24,
[n.sub.FE] = = 2, 8%)
5-year-olds Raw .996 -- .996 --
Standard .981 .988 .983 .990
Note. [n.sub.T] and [n.sub.FE] represent the total sample size
for the analysis and the number of participants with floor effect
scores. The percent of participants in each sample with floor
effects was also documented.
TABLE 2
Parameter Estimates and Effect Sizes for Each of the Three MLM Analyses
[[beta] [[beta] [[beta]
.sub.00] .sub.10] .sub.01]
PPVT (Raw score) 30.90 ** 9.58 ** -0.23
PPVT (Standard score) 79.73 ** 3.38 ** -0.33
PPVT (Age equivalent) 2.69 ** 0.58 ** -0.06
[[beta] [ES.sub.
.sub.11] [beta]11] ICC
PPVT (Raw score) 3.98 ** 0.24 0.09
PPVT (Standard score) 2.97 * 0.22 0.00
PPVT (Age equivalent) 0.31 ** 0.28 0.03
Note. ES = effect size; ICC = intraclass correlation; MLM =
Multilevel Model; PPVT = Peabody Picture Vocabulary Test-Third
Edition.
[[beta].sub.00], [[beta].sub.10], [[beta].sub.01], and
[[beta].sub.11] are the estimated intercept, time effect, treatment
effect, and time by treatment effect. Recall the time by treatment
effect ([[beta].sub.11]) is of primary interest, as it tests
whether the treatment condition (treatment vs. control) differs
over time. [ES.sub.[beta]11] represents the overall effect (i.e.,
effect size) associated with [[beta].sub.11] ICC measures the
percent of explainable variability in growth rates due to the
treatment effect.
* p statistically significant at 0.05,
** p statistically significant at 0.0125 (.05/4).