Heteroscedasticity and grouped data regression.
Jackson, John D.
I. Introduction
Of the numerous problems arising in attempts to empirically model
an economic phenomenon using micro level survey data, perhaps the least
attention has been given to those attributable to a "grouped
dependent variable." In this case, the dependent variable is
categorical in nature, having known category boundaries, and of interval
strength. For many surveys, data on individual incomes, home values,
length of time in residence, etc. fall into this class of measures, and
attempts to econometrically estimate models explaining their behavior
encounter more difficulties than is commonly realized. When confronted
with grouped data on the dependent variable analysts typicany will
assign each observation in a particular category the midpoint value of
that category, perhaps after the end points of at category have been
transformed (e.g., to conform to a Pareto, lognormal, or some other
distribution), and then estimate the parameters of the model by ordinary
least squares (OLS) regression. Alternatively, the analyst might attempt
to account for the categorical nature of the dependent variable by
employing a maximum likelihood estimation technique such as probit or
logit. Unfortunately, both of these approaches result in unsatisfactory
parameter estimates. OLS on the category midpoints produces inconsistent
estimates.(1) Even the qualitative dependent variable maximum likelihood
techniques such as n-chotomous probit produce inefficient estimates,
since they ignore the information provided by the known values of the
category boundaries.
About a decade ago, Mark Stewart [13] developed a maximum
likelihood model that would allow consistent and asymptotically
efficient parameter estimation when the dependent variable is grouped.
His procedure has come to be known as "grouped data
regression." Mute testimony as to the lack of attention paid to
this important work is given by the dominance, until very recently, of
the two inappropriate techniques (above) in the relevant empirical
literature.(2)
Due to the recent rapid rise in popularity of the grouped data
regression model, however, some of its limitations should be examined.
One limitation that has not been considered is the fact that the model
is strictly applicable only if the theoretical disturbance term of the
underlying model is homoscedastic. This drawback is particularly
disturbing since Yatchew and Griliches [14] have shown that, for maximum
likelihood models having likelihood functions very similar to that of
group data regression, heteroscedasticity leads not only to inefficient
estimates (as in OLS) but also results in biased and inconsistent
estimates. The object of this paper is to develop a method by which the
grouped data regression model can be extended to the heteroscedastic
case.
In what follows, we begin by outlining Stewart's homoscedastic
model as a useful point of departure. We then suggest a method by which
multiplicative heteroscedasticity can be incorporated into
Stewart's estimation structure. Finally, we illustrate our
suggested procedures with an application to housing demand. We conclude
with a brief summary.
II. Grouped Data Regression
Suppose we posit the following behavioral model
[y.sup.*] = X [beta] + [ ] (1) where [y.sup.*] is an (n x 1)
vector of implicit observations on the dependent variable, X is an (n x
k) matrix of observations on the k independent variables in the model,
[beta] is a (k x 1) vector of unknown coefficients to be estimated, and
[ ] is an (n x 1) vector of stochastic disturbances, each element [
.sub.i] of which is assumed i.i.d. N(0, [[sigma].sup.2]). We say that
[y.sup.*] is a vector of "implicit observations" because in
this conceptual framework, [y.sup.*] is not directly observable. If it
were observable, then each (cardinally measurable) [y.sub.i.sup.*] would
be independently normally distributed with mean [x.sub.i][beta] and
constant variance [sigma.sup.2], as implied by our assumptions on [ ].
Rather, all we are able to observe is [y.sub.i], the category - with
known end point values - within which [y.sub.i.sup.*] falls. More
precisely, if the real number line were partitioned into j mutually
exclusive and exhaustive categories with boundaries [A.sub.j](j = 0, . .
. , J), then we observe [y.sub.i] = j if
[A.sub.j-1] < [y.sub.i.sup.*] < [A.sub.j].
(2)
It is important to emphasize that the observed [y.sub.i] are only
of ordinal strength, but that the category boundaries {A.sub.j} are
known cardinal numbers. Our problem within this framework is to obtain
consistent and asymptotically efficient estimates of the unknown
parameters, [beta] and [sigma].sup.2]], of the model. One approach to
obtaining such estimates is the method of maximum likelihood.
Based on the assumptions above, the probability that [y.sub.i] = j,
i.e., the probability that [y.sub.i.sup.*] falls in the jth category, is
given by
P([y.sub.i] = j) = P([A.sub.j-1] < [y.sub.i.sup.*] <
[A.sub.j])
= P{[([A.sub.j-1] - [x.sub.i][beta])/[sigma]] <
[([y.sub.i.sup.*] - [x.sub.i] [beta])/[sigma]] < [(([A.sub.j] -
[x.sub.i][beta])/[sigma]]}
= F[([A.sub.j] - [x.sub.i][beta]/[sigma]] - F[([A.sub.j-1] -
[x.sub.i]
[beta]/[sigma]] (3)
where F(.) is the standard normal cumulative distribution function
evaluated at (.). For an independent random sample of n observations,
the likelihood function is the product of these probabilities taken
across the j categories and over the n observations, i.e.,
[Mathematical Expression Omitted]
where [[delta].sub.ij] = 1 if the ith observation falls in the jth
category; [[delta].sub.ij] = 0, otherwise. Therefore, the log likelihood
function is
[Mathematical Expression Omitted]
Partially differentiating equation (5) with respect the unknown
parameters ([beta],[sigma] and setting the derivatives equal to zero
yields K + 1 nonlinear equations which can be solved by iterative techniques (e.g., Davidon-Fletcher-Powell) to find consistent and
asymptotically efficient estimates of the [[beta].sub.m] (m = 1, . . . ,
k) and [sigma]. Asymptotic standard errors of these estimates can be
read from the diagonal of the negative inverse of the Hessian matrix of
equation (5).(3)
Readers familiar with the n-chotomous probit model of McKelvey and
Zavonia [11] will note that equations (4) and (5) are exactly the
likelihood and log likelihood functions, respectively, of their probit
model. This probably accounts for the recent tendency to use probit
analysis when the dependent variable is grouped. Nevertheless, the two
models are distinctly different. In the probit model, no information is
available on the scale of the underlying dependent variable.
Consequently, the variance [[sigma].sup.2] is not estimable; all that
can be estimated are the normalized coefficients ([beta]/[sigma] and the
unknown normalized category boundaries ([A.sub.j]/[sigma]). In the
grouped data regression model, on the other hand, the known values of
the category boundaries provide information on the scale of [y.sup.*].
This permits the direct estimation of the non-normalized coefficients
and also allows the estimation of [[sigma].sup.2] [5, 739]. This
distinction should make it clear why applying probit to a grouped data
regression problem will produce inefficient estimates; the probit model
ignores information on the category boundaries.
III. Heteroscedasticity and Grouped Data Regression
Important statistical problems arise for grouped data regression if
the disturbance vector of equation (1) does not satisfy its null mean,
scalar covariance matrix assumptions. The problems arising from a
non-null mean vector are not exclusive to grouped data regression; it is
well known that, even in OLS regression, omitted variables, errors in
measurement, etc. lead to inconsistent estimates. But in OLS,
non-spherical disturbances produce unbiased and consistent estimates;
the only problem is efficiency. In grouped data regression,
non-spherical disturbances - at least in the form of heteroscedasticity
- will also produce inconsistent estimates.
Evidence on this result comes from two sources. Based on the
similarity of the likelihood function of equation (4) to that of n
-chotomous probit, one could adopt the Yatchew and Griliches technique
of taking a Taylor series expansion of the "plimmed" first
order conditions to compute the magnitude of the (non-vanishing)
asymptotic bias in the coefficient estimates. Alternatively, one could
note [5, 738] that the grouped data regression model is a limiting case
of a doubly censored tobit model (where all observations are censored).
Even for the singly censored case, Hurd [7] and others have shown that
the resulting tobit estimates are inconsistent in the presence of
heteroscedasticity. In our opinion, the most straight-forward way to see
the inconsistency of traditional grouped data regression estimators in
the case of heteroscedastic disturbances is that maximizing equation (4)
when the variance is not constant amounts simply to maximizing the wrong
likelihood function.
The appropriate likelihood function when the disturbance variance
varies from one observation to the next is
[Mathematical Expression Omitted]
where [[sigma].sub.i] [ ] [sigma] (a constant) for all i. Clearly,
equation (6) is not the same as equation (4). However, if the
heteroscedasticity can be characterized as a rather general relationship
known as "multiplicative heteroscedasticity," then a procedure
is available to reduce equation (6) to the equivalent of equation (4) -
where all of the results pertaining to (4) also apply.
The assumption of multiplicative heteroscedasticity amounts to
assuming
[Mathematical Expression Omitted]
where [[sigma.sup.2] is now just a constant of proportionality (not
the true variance), [alpha] is an (L x 1) vector of unknown
coefficients, and [w.sub.i] is the ith row of a weighing matrix W (n x
L, where W may be a subset of X). This form is fairly general: if
[alpha] = 0, then (7) reduces to the homoscedastic case; if
[[sigma].sub.l] = 0 (l = 1,..., L; l [ ] q), [[alpha].sub.q] = 1, and
[w.sub.q] = 1n[x.sub.q.sup.2], then the implicit form is the well-known
assumption of Aitken from generalized least squares theory (i.e.,
[[sigma].sub.i.sup.2] = [[sigma].sup.2][x.sub.qi.sup.2]. It is probably
worth noting that whether or not [[sigma].sup.2] is explicitly included
in equation (7) is purely a matter of notation, and not substance. In
other words, if we assume
[Mathematical Expression Omitted]
and we specify W such that its first column is a vector of ones, then
[[sigma].sup.2] is estimated by exp([alpha].sub.1]). Implicitly, then,
the version of (7) above assumes that W[alpha] contains no constant
term.
Substituting equation (7) into (6), the likelihood function can be
written as
[Mathematical Expression Omitted]
or alternatively as
[Mathematical Expression Omitted]
where the tilde superscript indicates that, for each observation, afl
variables and category boundaries have been weighted by a multiplicative
factor of (1/(exp(wi[alpha]))1/2).
It should be evident that the implied transformation in equation
(8) (or (9)) eliminates the heteroscedasticity problem. Defining
[Mathematical Expression Omitted]
where the third equality in equation (11) is based on (7). Maximizing
(9) is therefore equivalent to maximizing (4) where the data have been
transformed according to the weighting procedure suggested above. Hence
maximum likelihood estimates based on (9) have the same properties as
those based on (4), viz. consistency and asymptotic efficiency.
An obvious problem with the straightforward maximization of
equation (9) is that the weighting factor, exp([w.sub.i][alpha]),
involves unknown parameters, i.e., [alpha]. Even if [alpha] were known,
however, the required transformation cannot be accomplished within the
context of commonly available "canned programs." This is
because the category boundaries must be weighted along with the
independent variables in order for the probability that [y.sub.i.sup.*]
falls in category j to be unaffected by the transformation. But canned
programs typically treat these boundaries as prescribed constants (see,
e.g., LIMDEP 5.1). This result, coupled with the fact that [alpha] is
not known, obviates the need to apply direct maximum likelihood
estimation to equation (8), i.e., to jointly estimate [beta], [sigma],
and [alpha] directly by maximum likelihood.(4)
We turn now to an example, the sole purpose of which is to
illustrate the procedures we have proposed in this section.
IV. An Illustrative Example: Housing Expenditure Estimation
The fields of urban economics and state and local finance have, for
a couple of decades, been concerned with modelling hedonic price
functions, property tax capitalization rates, housing demand, etc. Each
of these areas of inquiry, at some level, requires the estimation of a
housing expenditure function. Typically the house purchaser is viewed as
buying a bundle of housing services, determined by various structural
and neighborhood characteristics of the dwelling, where the amount
purchased depends, in part, on the permanent income of the individual.
It is not uncommon to see permanent income proxied by measured income
and several characteristics of the individual (e.g., family size, age
and education level of the purchaser, etc.).
Two characteristics of these studies make housing expenditure
estimation an interesting application for the problem at hand. First,
there is a long history of heteroscedasticity in the general area of
expenditure modelling, but the problem may be even more acute in housing
expenditure models. There is a substantial body of evidence suggesting
that housing markets are segmented by house quality[4]. To the extent
that quality is positively related to income, there is a theoretical
reason for suspecting heteroscedastic disturbances in housing
expenditure models. This potential for heteroscedasticity is often
acknowledged but seldom addressed directly.(5) Second, some sources of
housing expenditure data provide only grouped information on that
variable - the 1980 census is such a data set. Thus housing expenditure
estimation provides the two characteristics needed to apply our
procedure: a grouped dependent variable and a potentially
heteroscedastic disturbance.
In Table I, we present estimates of a housing expenditure model
similar to one found in Long and Caudill[10]. The sample consists of
7107 observations drawn from the 1980 census tapes. The dependent
variable, housing expenditure, consists of twenty four relatively narrow
categories, ranging in width from $2500 to $10,000. Since this is a
national sample, rather than one drawn from a particular metropolitan
area, we proxy structural and neighborhood characteristics with a set of
regional dummies (NE = 1 if residence is in the northeast, NC = 1 if
residence is in the north central region, W = 1 if residence is in the
west). FAMSIZ is the number of persons in the household; EDUC and AGE is
the education level and age, respectively, of the household head; FAMINC
is family income.
Table 1. Housing Expenditure Estimates(*)
Variable MODEL 1 MODEL2
Constant -14170.2 -11016.1
(-5.42) (-4.53)
NE 3900.31 4394.3
(3.89) (4.91)
NC 1270.95 1231.47
(1.30) (1.38)
W 28313.7 26322.6
(31.55) (32.41)
FAMSIZ -557.36 -259.36
(-2.00) (-0.94)
EDUC 2781.2 2550.6
(26.74) (27.32)
AGE 124.19 132.64
(4.35) (5.37)
FAMINC 1034.19 975.26
(55.30) (30.02)
[sigma] 28363.9 -
[[alpha].sub.1] - 19.62
(1862.0)
[[alpha].sub.2] - 0.029
(65.58)
LLF -21247.4 -20832.9
(*) Asymptotic t-ratios are in parentheses.
The first column of Table I displays results for the homoscedastic
version of the model. Asymptotic t statistics are in parentheses; the
estimated standard error of the model is 28363.9; the logarithm of the
likelihood function is -21247.4. In general the results are in
conformance with those that casual
empiricism would suggest. Higher income, older, more educated, and
smaller families spend more on housing. Also families in the northeast
and west spend more on housing than do families in the south. Except for
the NC variable, all relationships are statistically significant at the
[alpha] = .05 level.
The second column of Table I presents our results after correcting
for heteroscedasticity. Based on the theoretical result discussed
earlier, we felt fairly confident with a rather parsimonious specification of the weighing vector. Specifically, we assumed
[[sigma].sub.i.sup.2] =exp ([[alpha].sub.1] +
[[alpha].sub.2][FAMINC.sub.i])
(12)
Perhaps the most important result to be seen in the second column
is that correcting for heteroscedasticity was necessary. The likelihood
ratio statistic for testing the significance of [[alpha].sub.2], i.e.,
taking the homoscedastic model as the restricted model, is [X.sup.2](1)
= 828.2, indicating statistical significance at any reasonable level.
This inference is bolstered by an asymptotic t ratio of 65.58 for
[[alpha].sub.2]. Clearly the disturbance variance varies systematically
with income.
In this application, however, the practical import of correcting
for heteroscedasticity is not terribly striking. There are no sign
changes in the coefficients estimates as a result of the correction.
With the exception of the NE coefficient, the homoscedastic model
appears to overstate the magnitude of the coefficients. But this
overstatement, expressed as a percent of the homoscedastic coefficients,
is quite small in all cases. Thus the biases encountered in not
correcting our housing expenditure model for heteroscedasticity do not
appear to be sizeable.
We suspect that this inference is primarily a result of the fact
that the dependent variable in our illustration has a large number of
small categories and is not endemic to the procedure itself. In
discussing their N-chotomous probit model, McKelvey and Zavonia[11]
suggest that their probit results more closely approximate their least
squares counterparts as the number of categories increases. It seems
reasonable to extend this conjecture to grouped data regression. Since
correcting OLS estimates for heteroscedasticity does not have much
effect on the estimated coefficients, the same may be true for grouped
data regression if the number of categories of the dependent variable is
large. We would expect to see more striking coefficient differences than
those in Table I if the dependent variable had fewer categories.
V. Summary
The focus of this paper has been to provide a method of correcting
for heteroscedasticity in grouped data regression. We began by outlining
Stewart's grouped data regression model and noting that its
estimates would be inconsistent if the model's disturbances were
heteroscedastic. We then proposed a model, incorporating the rather
general assumption of multiplicative heteroscedasticity, which produces
consistent and asymptotically efficient estimates in the grouped data
case. Finally, we applied this model to the problem of estimating
housing expenditure as an illustrative example. While our results
clearly indicated the need to correct for heteroscedasticity, the
implicit biases encountered in not correcting did not appear to be
sizeable. We conjecture that this result is due to the fact that our
measure of housing expenditure involves twenty-four categories. In a
typical application involving only six or seven categories[1], we would
expect to see a much more dramatic effect of our correction on the
estimated coefficients. All in all, with the increasing popularity of
grouped data regression and in view of the problems arising from
heteroscedastic disturbances within its context, we believe that the
procedures we suggest in this paper have a lot to recommend them.
References
[1.] Ault, Richard, John D. Jackson, and Richard Saba. "The
Effects of Long Term Rent Control on Tenant Mobility." Department
of Economics, Auburn University, working paper, 1992. [2.] Boehm, Thomas
P. and Richard A. Hofler, "A Frontier Approach to Measuring the
Effect of Market of Discrimination: A Housing illustration."
Southern Economic Journal, October 1987, 301-15. [3.] David, J. M. and
Legg, W. E., "An Application of Multivariate Probit Analysis to the
Demand for Housing: A Contribution to the Improvement of the Predictive
Performance of Demand Theory, Preliminary Results." American
Statistical Association Proceedings of the Business and Economics
Statistics Section, August 1975. 295-300. [4.] deLeeuw, Frank and
Raymond J. Struyk. The Web of Urban Housing. Washington, D.C.: The Urban
Institute, 1975. [5.] Green, William H. Econometric Analysis. New York:
Macmillan, 1990. [6.] Gujarati, Damodar N. Basic Econometrics. New York:
McGraw-Hill, 1988. [7.] Hurd, Michael, "Estimation in Truncated Samples Where There is Heteroscedasticity." Journal of
Econometrics, 1979, 247-58. [8.] Ihlanfeldt, Keith R. and John D.
Jackson, "Systematic Assessment Error and Intrajurisdiction
Property Tax Capitalization." Southern Economic Journal, October,
1982, 417-27. [9.] Jones, Ethel B. and John D. Jackson. "College
Grades and Labor Market Rewards." Journal of Human Resources,
Spring 1990, 253-66. [10.] Long, James E. and Steven Caudill,
"Racial Differences in Home Ownership and Housing Wealth,
1970-86." Economic Inquiry, January 1992. 83-100. [11.] McKelvey.
Richard D. and William Zavonia, "A Statistical Model for the
Analysis of Ordinal Level Dependent Variables." Journal of
Mathematical Sociology, Summer 1975, 113-20. [12.] Silberman, Jonathan
I. and Talley, Wayne K., "N-Chotomous Dependent Variables: An
Application to Regulatory Decision Making." American Statistical
Association Proceedings of the Business and Economic Statistics Section,
August 1974, 573-76. [13.] Stewart, Mark B., "On Least Squares
Estimation when the Dependent Variable is Grouped." Review of
Economic Studies, 1983. 737-53. [14.] Yatchew, Adonis, and Zvi
Griliches, "Specification Error in Probit Models." Review of
Economics and Statistics, 1984, 134-39.
[1.] Stewart [13] shows this result and even calculates the magnitude
of the asymptotic bias in the multivariate case. [2.] Some examples of
the numerous studies making this error within a regression context
include Jones and Jackson [9]; within a tobit context, Boehm and Hofler
[2]; and within a probit context, Silberman and Talley [12] and David
and Legg [3]. [3.] Stewart provides both first and second order
conditions for maximizing equation (5). In addition, he suggests a
procedure employing the EM algorithm which iterates between estimates of
the conditional mean of [y.sub.i.sup.*] and consequent OLS regression
estimates. The problem with this procedure is that the standard errors
from the last (convergent) OLS iteration may not be appropriate;
obtaining correct standard errors may require inverting the Hessian of
(5) - at least, as a sufficient condition. Thus we concentrate on direct
maximization of (5) rather than Stewart's EM version. [4.] A
computer program, written in SAS MATRIX, which allows the user to
correct for heteroscedasticity in grouped data regression according to
our suggested procedure is available from the authors upon request. [5.]
Typically, if anything is done, it is the suggestion that the functional
form to be estimated (e.g., semi-log, double-log, etc.) should help to
correct for this problem. In defense of these authors, this attitude is
not nearly so cavalier as it might first appear. In many instances OLS
is an appropriate technique for estimating housing expenditure. For such
cases, the log transformation may be a perfectly acceptable ad hoc method of controlling for heteroscedasticity [6]. Furthermore, many
studies require only unbiasedeness and consistency of housing
expenditure parameter estimates [8], so that no correction for
heteroscedasticity is called for.