文章基本信息

标题：Heteroscedasticity and grouped data regression.
作者：Jackson, John D.
期刊名称：Southern Economic Journal
印刷版ISSN：0038-4038
出版年度：1993
期号：July
语种：English
出版社：Southern Economic Association
摘要：Of the numerous problems arising in attempts to empirically model an economic phenomenon using micro level survey data, perhaps the least attention has been given to those attributable to a "grouped dependent variable." In this case, the dependent variable is categorical in nature, having known category boundaries, and of interval strength. For many surveys, data on individual incomes, home values, length of time in residence, etc. fall into this class of measures, and attempts to econometrically estimate models explaining their behavior encounter more difficulties than is commonly realized. When confronted with grouped data on the dependent variable analysts typicany will assign each observation in a particular category the midpoint value of that category, perhaps after the end points of at category have been transformed (e.g., to conform to a Pareto, lognormal, or some other distribution), and then estimate the parameters of the model by ordinary least squares (OLS) regression. Alternatively, the analyst might attempt to account for the categorical nature of the dependent variable by employing a maximum likelihood estimation technique such as probit or logit. Unfortunately, both of these approaches result in unsatisfactory parameter estimates. OLS on the category midpoints produces inconsistent estimates.(1) Even the qualitative dependent variable maximum likelihood techniques such as n-chotomous probit produce inefficient estimates, since they ignore the information provided by the known values of the category boundaries.
关键词：Econometrics;Regression analysis

Heteroscedasticity and grouped data regression.

Jackson, John D.

I. Introduction

Of the numerous problems arising in attempts to empirically model an economic phenomenon using micro level survey data, perhaps the least attention has been given to those attributable to a "grouped dependent variable." In this case, the dependent variable is categorical in nature, having known category boundaries, and of interval strength. For many surveys, data on individual incomes, home values, length of time in residence, etc. fall into this class of measures, and attempts to econometrically estimate models explaining their behavior encounter more difficulties than is commonly realized. When confronted with grouped data on the dependent variable analysts typicany will assign each observation in a particular category the midpoint value of that category, perhaps after the end points of at category have been transformed (e.g., to conform to a Pareto, lognormal, or some other distribution), and then estimate the parameters of the model by ordinary least squares (OLS) regression. Alternatively, the analyst might attempt to account for the categorical nature of the dependent variable by employing a maximum likelihood estimation technique such as probit or logit. Unfortunately, both of these approaches result in unsatisfactory parameter estimates. OLS on the category midpoints produces inconsistent estimates.(1) Even the qualitative dependent variable maximum likelihood techniques such as n-chotomous probit produce inefficient estimates, since they ignore the information provided by the known values of the category boundaries.

About a decade ago, Mark Stewart [13] developed a maximum likelihood model that would allow consistent and asymptotically efficient parameter estimation when the dependent variable is grouped. His procedure has come to be known as "grouped data regression." Mute testimony as to the lack of attention paid to this important work is given by the dominance, until very recently, of the two inappropriate techniques (above) in the relevant empirical literature.(2)

Due to the recent rapid rise in popularity of the grouped data regression model, however, some of its limitations should be examined. One limitation that has not been considered is the fact that the model is strictly applicable only if the theoretical disturbance term of the underlying model is homoscedastic. This drawback is particularly disturbing since Yatchew and Griliches [14] have shown that, for maximum likelihood models having likelihood functions very similar to that of group data regression, heteroscedasticity leads not only to inefficient estimates (as in OLS) but also results in biased and inconsistent estimates. The object of this paper is to develop a method by which the grouped data regression model can be extended to the heteroscedastic case.

In what follows, we begin by outlining Stewart's homoscedastic model as a useful point of departure. We then suggest a method by which multiplicative heteroscedasticity can be incorporated into Stewart's estimation structure. Finally, we illustrate our suggested procedures with an application to housing demand. We conclude with a brief summary.

II. Grouped Data Regression

Suppose we posit the following behavioral model

[y.sup.*] = X [beta] + [ ] (1) where [y.sup.*] is an (n x 1) vector of implicit observations on the dependent variable, X is an (n x k) matrix of observations on the k independent variables in the model, [beta] is a (k x 1) vector of unknown coefficients to be estimated, and [ ] is an (n x 1) vector of stochastic disturbances, each element [ .sub.i] of which is assumed i.i.d. N(0, [[sigma].sup.2]). We say that [y.sup.*] is a vector of "implicit observations" because in this conceptual framework, [y.sup.*] is not directly observable. If it were observable, then each (cardinally measurable) [y.sub.i.sup.*] would be independently normally distributed with mean [x.sub.i][beta] and constant variance [sigma.sup.2], as implied by our assumptions on [ ]. Rather, all we are able to observe is [y.sub.i], the category - with known end point values - within which [y.sub.i.sup.*] falls. More precisely, if the real number line were partitioned into j mutually exclusive and exhaustive categories with boundaries [A.sub.j](j = 0, . . . , J), then we observe [y.sub.i] = j if

[A.sub.j-1] < [y.sub.i.sup.*] < [A.sub.j]. (2)

It is important to emphasize that the observed [y.sub.i] are only of ordinal strength, but that the category boundaries {A.sub.j} are known cardinal numbers. Our problem within this framework is to obtain consistent and asymptotically efficient estimates of the unknown parameters, [beta] and [sigma].sup.2]], of the model. One approach to obtaining such estimates is the method of maximum likelihood.

Based on the assumptions above, the probability that [y.sub.i] = j, i.e., the probability that [y.sub.i.sup.*] falls in the jth category, is given by

P([y.sub.i] = j) = P([A.sub.j-1] < [y.sub.i.sup.*] < [A.sub.j])

= P{[([A.sub.j-1] - [x.sub.i][beta])/[sigma]] < [([y.sub.i.sup.*] - [x.sub.i] [beta])/[sigma]] < [(([A.sub.j] - [x.sub.i][beta])/[sigma]]}

 = F[([A.sub.j] - [x.sub.i][beta]/[sigma]] - F[([A.sub.j-1] -
[x.sub.i]
[beta]/[sigma]] (3)

where F(.) is the standard normal cumulative distribution function evaluated at (.). For an independent random sample of n observations, the likelihood function is the product of these probabilities taken across the j categories and over the n observations, i.e.,

[Mathematical Expression Omitted]

where [[delta].sub.ij] = 1 if the ith observation falls in the jth category; [[delta].sub.ij] = 0, otherwise. Therefore, the log likelihood function is

[Mathematical Expression Omitted]

Partially differentiating equation (5) with respect the unknown parameters ([beta],[sigma] and setting the derivatives equal to zero yields K + 1 nonlinear equations which can be solved by iterative techniques (e.g., Davidon-Fletcher-Powell) to find consistent and asymptotically efficient estimates of the [[beta].sub.m] (m = 1, . . . , k) and [sigma]. Asymptotic standard errors of these estimates can be read from the diagonal of the negative inverse of the Hessian matrix of equation (5).(3)

Readers familiar with the n-chotomous probit model of McKelvey and Zavonia [11] will note that equations (4) and (5) are exactly the likelihood and log likelihood functions, respectively, of their probit model. This probably accounts for the recent tendency to use probit analysis when the dependent variable is grouped. Nevertheless, the two models are distinctly different. In the probit model, no information is available on the scale of the underlying dependent variable. Consequently, the variance [[sigma].sup.2] is not estimable; all that can be estimated are the normalized coefficients ([beta]/[sigma] and the unknown normalized category boundaries ([A.sub.j]/[sigma]). In the grouped data regression model, on the other hand, the known values of the category boundaries provide information on the scale of [y.sup.*]. This permits the direct estimation of the non-normalized coefficients and also allows the estimation of [[sigma].sup.2] [5, 739]. This distinction should make it clear why applying probit to a grouped data regression problem will produce inefficient estimates; the probit model ignores information on the category boundaries.

III. Heteroscedasticity and Grouped Data Regression

Important statistical problems arise for grouped data regression if the disturbance vector of equation (1) does not satisfy its null mean, scalar covariance matrix assumptions. The problems arising from a non-null mean vector are not exclusive to grouped data regression; it is well known that, even in OLS regression, omitted variables, errors in measurement, etc. lead to inconsistent estimates. But in OLS, non-spherical disturbances produce unbiased and consistent estimates; the only problem is efficiency. In grouped data regression, non-spherical disturbances - at least in the form of heteroscedasticity - will also produce inconsistent estimates.

Evidence on this result comes from two sources. Based on the similarity of the likelihood function of equation (4) to that of n -chotomous probit, one could adopt the Yatchew and Griliches technique of taking a Taylor series expansion of the "plimmed" first order conditions to compute the magnitude of the (non-vanishing) asymptotic bias in the coefficient estimates. Alternatively, one could note [5, 738] that the grouped data regression model is a limiting case of a doubly censored tobit model (where all observations are censored). Even for the singly censored case, Hurd [7] and others have shown that the resulting tobit estimates are inconsistent in the presence of heteroscedasticity. In our opinion, the most straight-forward way to see the inconsistency of traditional grouped data regression estimators in the case of heteroscedastic disturbances is that maximizing equation (4) when the variance is not constant amounts simply to maximizing the wrong likelihood function.

The appropriate likelihood function when the disturbance variance varies from one observation to the next is

[Mathematical Expression Omitted]

where [[sigma].sub.i] [ ] [sigma] (a constant) for all i. Clearly, equation (6) is not the same as equation (4). However, if the heteroscedasticity can be characterized as a rather general relationship known as "multiplicative heteroscedasticity," then a procedure is available to reduce equation (6) to the equivalent of equation (4) - where all of the results pertaining to (4) also apply.

The assumption of multiplicative heteroscedasticity amounts to assuming

[Mathematical Expression Omitted]

where [[sigma.sup.2] is now just a constant of proportionality (not the true variance), [alpha] is an (L x 1) vector of unknown coefficients, and [w.sub.i] is the ith row of a weighing matrix W (n x L, where W may be a subset of X). This form is fairly general: if [alpha] = 0, then (7) reduces to the homoscedastic case; if [[sigma].sub.l] = 0 (l = 1,..., L; l [ ] q), [[alpha].sub.q] = 1, and [w.sub.q] = 1n[x.sub.q.sup.2], then the implicit form is the well-known assumption of Aitken from generalized least squares theory (i.e., [[sigma].sub.i.sup.2] = [[sigma].sup.2][x.sub.qi.sup.2]. It is probably worth noting that whether or not [[sigma].sup.2] is explicitly included in equation (7) is purely a matter of notation, and not substance. In other words, if we assume

[Mathematical Expression Omitted]

and we specify W such that its first column is a vector of ones, then [[sigma].sup.2] is estimated by exp([alpha].sub.1]). Implicitly, then, the version of (7) above assumes that W[alpha] contains no constant term.

Substituting equation (7) into (6), the likelihood function can be written as

[Mathematical Expression Omitted]

or alternatively as

[Mathematical Expression Omitted]

where the tilde superscript indicates that, for each observation, afl variables and category boundaries have been weighted by a multiplicative factor of (1/(exp(wi[alpha]))1/2).

It should be evident that the implied transformation in equation (8) (or (9)) eliminates the heteroscedasticity problem. Defining

[Mathematical Expression Omitted]

where the third equality in equation (11) is based on (7). Maximizing (9) is therefore equivalent to maximizing (4) where the data have been transformed according to the weighting procedure suggested above. Hence maximum likelihood estimates based on (9) have the same properties as those based on (4), viz. consistency and asymptotic efficiency.

An obvious problem with the straightforward maximization of equation (9) is that the weighting factor, exp([w.sub.i][alpha]), involves unknown parameters, i.e., [alpha]. Even if [alpha] were known, however, the required transformation cannot be accomplished within the context of commonly available "canned programs." This is because the category boundaries must be weighted along with the independent variables in order for the probability that [y.sub.i.sup.*] falls in category j to be unaffected by the transformation. But canned programs typically treat these boundaries as prescribed constants (see, e.g., LIMDEP 5.1). This result, coupled with the fact that [alpha] is not known, obviates the need to apply direct maximum likelihood estimation to equation (8), i.e., to jointly estimate [beta], [sigma], and [alpha] directly by maximum likelihood.(4)

We turn now to an example, the sole purpose of which is to illustrate the procedures we have proposed in this section.

IV. An Illustrative Example: Housing Expenditure Estimation

The fields of urban economics and state and local finance have, for a couple of decades, been concerned with modelling hedonic price functions, property tax capitalization rates, housing demand, etc. Each of these areas of inquiry, at some level, requires the estimation of a housing expenditure function. Typically the house purchaser is viewed as buying a bundle of housing services, determined by various structural and neighborhood characteristics of the dwelling, where the amount purchased depends, in part, on the permanent income of the individual. It is not uncommon to see permanent income proxied by measured income and several characteristics of the individual (e.g., family size, age and education level of the purchaser, etc.).

Two characteristics of these studies make housing expenditure estimation an interesting application for the problem at hand. First, there is a long history of heteroscedasticity in the general area of expenditure modelling, but the problem may be even more acute in housing expenditure models. There is a substantial body of evidence suggesting that housing markets are segmented by house quality[4]. To the extent that quality is positively related to income, there is a theoretical reason for suspecting heteroscedastic disturbances in housing expenditure models. This potential for heteroscedasticity is often acknowledged but seldom addressed directly.(5) Second, some sources of housing expenditure data provide only grouped information on that variable - the 1980 census is such a data set. Thus housing expenditure estimation provides the two characteristics needed to apply our procedure: a grouped dependent variable and a potentially heteroscedastic disturbance.

In Table I, we present estimates of a housing expenditure model similar to one found in Long and Caudill[10]. The sample consists of 7107 observations drawn from the 1980 census tapes. The dependent variable, housing expenditure, consists of twenty four relatively narrow categories, ranging in width from $2500 to $10,000. Since this is a national sample, rather than one drawn from a particular metropolitan area, we proxy structural and neighborhood characteristics with a set of regional dummies (NE = 1 if residence is in the northeast, NC = 1 if residence is in the north central region, W = 1 if residence is in the west). FAMSIZ is the number of persons in the household; EDUC and AGE is the education level and age, respectively, of the household head; FAMINC is family income.

Table 1. Housing Expenditure Estimates(*)

Variable MODEL 1 MODEL2

Constant -14170.2 -11016.1
 (-5.42) (-4.53)
NE 3900.31 4394.3
 (3.89) (4.91)
NC 1270.95 1231.47
 (1.30) (1.38)
W 28313.7 26322.6
 (31.55) (32.41)
FAMSIZ -557.36 -259.36
 (-2.00) (-0.94)
EDUC 2781.2 2550.6
 (26.74) (27.32)
AGE 124.19 132.64
 (4.35) (5.37)
FAMINC 1034.19 975.26
 (55.30) (30.02)
[sigma] 28363.9 -
[[alpha].sub.1] - 19.62

 (1862.0)
[[alpha].sub.2] - 0.029
 (65.58)
LLF -21247.4 -20832.9
(*) Asymptotic t-ratios are in parentheses.

The first column of Table I displays results for the homoscedastic version of the model. Asymptotic t statistics are in parentheses; the estimated standard error of the model is 28363.9; the logarithm of the likelihood function is -21247.4. In general the results are in conformance with those that casual

empiricism would suggest. Higher income, older, more educated, and smaller families spend more on housing. Also families in the northeast and west spend more on housing than do families in the south. Except for the NC variable, all relationships are statistically significant at the [alpha] = .05 level.

The second column of Table I presents our results after correcting for heteroscedasticity. Based on the theoretical result discussed earlier, we felt fairly confident with a rather parsimonious specification of the weighing vector. Specifically, we assumed

[[sigma].sub.i.sup.2] =exp ([[alpha].sub.1] + [[alpha].sub.2][FAMINC.sub.i])

(12)

Perhaps the most important result to be seen in the second column is that correcting for heteroscedasticity was necessary. The likelihood ratio statistic for testing the significance of [[alpha].sub.2], i.e., taking the homoscedastic model as the restricted model, is [X.sup.2](1) = 828.2, indicating statistical significance at any reasonable level. This inference is bolstered by an asymptotic t ratio of 65.58 for [[alpha].sub.2]. Clearly the disturbance variance varies systematically with income.

In this application, however, the practical import of correcting for heteroscedasticity is not terribly striking. There are no sign changes in the coefficients estimates as a result of the correction. With the exception of the NE coefficient, the homoscedastic model appears to overstate the magnitude of the coefficients. But this overstatement, expressed as a percent of the homoscedastic coefficients, is quite small in all cases. Thus the biases encountered in not correcting our housing expenditure model for heteroscedasticity do not appear to be sizeable.

We suspect that this inference is primarily a result of the fact that the dependent variable in our illustration has a large number of small categories and is not endemic to the procedure itself. In discussing their N-chotomous probit model, McKelvey and Zavonia[11] suggest that their probit results more closely approximate their least squares counterparts as the number of categories increases. It seems reasonable to extend this conjecture to grouped data regression. Since correcting OLS estimates for heteroscedasticity does not have much effect on the estimated coefficients, the same may be true for grouped data regression if the number of categories of the dependent variable is large. We would expect to see more striking coefficient differences than those in Table I if the dependent variable had fewer categories.

V. Summary

The focus of this paper has been to provide a method of correcting for heteroscedasticity in grouped data regression. We began by outlining Stewart's grouped data regression model and noting that its estimates would be inconsistent if the model's disturbances were heteroscedastic. We then proposed a model, incorporating the rather general assumption of multiplicative heteroscedasticity, which produces consistent and asymptotically efficient estimates in the grouped data case. Finally, we applied this model to the problem of estimating housing expenditure as an illustrative example. While our results clearly indicated the need to correct for heteroscedasticity, the implicit biases encountered in not correcting did not appear to be sizeable. We conjecture that this result is due to the fact that our measure of housing expenditure involves twenty-four categories. In a typical application involving only six or seven categories[1], we would expect to see a much more dramatic effect of our correction on the estimated coefficients. All in all, with the increasing popularity of grouped data regression and in view of the problems arising from heteroscedastic disturbances within its context, we believe that the procedures we suggest in this paper have a lot to recommend them.

References

[1.] Ault, Richard, John D. Jackson, and Richard Saba. "The Effects of Long Term Rent Control on Tenant Mobility." Department of Economics, Auburn University, working paper, 1992. [2.] Boehm, Thomas P. and Richard A. Hofler, "A Frontier Approach to Measuring the Effect of Market of Discrimination: A Housing illustration." Southern Economic Journal, October 1987, 301-15. [3.] David, J. M. and Legg, W. E., "An Application of Multivariate Probit Analysis to the Demand for Housing: A Contribution to the Improvement of the Predictive Performance of Demand Theory, Preliminary Results." American Statistical Association Proceedings of the Business and Economics Statistics Section, August 1975. 295-300. [4.] deLeeuw, Frank and Raymond J. Struyk. The Web of Urban Housing. Washington, D.C.: The Urban Institute, 1975. [5.] Green, William H. Econometric Analysis. New York: Macmillan, 1990. [6.] Gujarati, Damodar N. Basic Econometrics. New York: McGraw-Hill, 1988. [7.] Hurd, Michael, "Estimation in Truncated Samples Where There is Heteroscedasticity." Journal of Econometrics, 1979, 247-58. [8.] Ihlanfeldt, Keith R. and John D. Jackson, "Systematic Assessment Error and Intrajurisdiction Property Tax Capitalization." Southern Economic Journal, October, 1982, 417-27. [9.] Jones, Ethel B. and John D. Jackson. "College Grades and Labor Market Rewards." Journal of Human Resources, Spring 1990, 253-66. [10.] Long, James E. and Steven Caudill, "Racial Differences in Home Ownership and Housing Wealth, 1970-86." Economic Inquiry, January 1992. 83-100. [11.] McKelvey. Richard D. and William Zavonia, "A Statistical Model for the Analysis of Ordinal Level Dependent Variables." Journal of Mathematical Sociology, Summer 1975, 113-20. [12.] Silberman, Jonathan I. and Talley, Wayne K., "N-Chotomous Dependent Variables: An Application to Regulatory Decision Making." American Statistical Association Proceedings of the Business and Economic Statistics Section, August 1974, 573-76. [13.] Stewart, Mark B., "On Least Squares Estimation when the Dependent Variable is Grouped." Review of Economic Studies, 1983. 737-53. [14.] Yatchew, Adonis, and Zvi Griliches, "Specification Error in Probit Models." Review of Economics and Statistics, 1984, 134-39.

[1.] Stewart [13] shows this result and even calculates the magnitude of the asymptotic bias in the multivariate case. [2.] Some examples of the numerous studies making this error within a regression context include Jones and Jackson [9]; within a tobit context, Boehm and Hofler [2]; and within a probit context, Silberman and Talley [12] and David and Legg [3]. [3.] Stewart provides both first and second order conditions for maximizing equation (5). In addition, he suggests a procedure employing the EM algorithm which iterates between estimates of the conditional mean of [y.sub.i.sup.*] and consequent OLS regression estimates. The problem with this procedure is that the standard errors from the last (convergent) OLS iteration may not be appropriate; obtaining correct standard errors may require inverting the Hessian of (5) - at least, as a sufficient condition. Thus we concentrate on direct maximization of (5) rather than Stewart's EM version. [4.] A computer program, written in SAS MATRIX, which allows the user to correct for heteroscedasticity in grouped data regression according to our suggested procedure is available from the authors upon request. [5.] Typically, if anything is done, it is the suggestion that the functional form to be estimated (e.g., semi-log, double-log, etc.) should help to correct for this problem. In defense of these authors, this attitude is not nearly so cavalier as it might first appear. In many instances OLS is an appropriate technique for estimating housing expenditure. For such cases, the log transformation may be a perfectly acceptable ad hoc method of controlling for heteroscedasticity [6]. Furthermore, many studies require only unbiasedeness and consistency of housing expenditure parameter estimates [8], so that no correction for heteroscedasticity is called for.