Multinomial logit model of occupational choice: a latent variable approach.
Shabbir, Tayyeb
INTRODUCTION
Economists and other social scientists have had a long-standing
interest in studying the different aspects of an individual's
occupational choice. An important issue in this regard is an econometric analysis of the determinants of occupational choice. A rather well-known
example of such a work is Schmidt and Strauss (1975) which uses a
maximum likelihood procedure to estimate a multinomial logit model (MNL)
where occupational choice is determined by an individual's
education, experience, race and sex. Regarding the above genre of models
(as, in fact, in the parallel and closely related literature on earnings
functions), one is often interested in ascertaining the unbiased
marginal effect of education on the dependent variable. Not only these
estimates allow tests of the human capital theory against alternative
hypotheses, they also have important public policy implications
particularly for the developing countries which are typically
contemplating expansion of their educational sectors. However, these
estimates may become biased in the event that a relevant regressor is
left out of the specification. Particularly difficult problems arise
when such an excluded variable is a latent (or unobserved) one.
Using the Schmidt-Strauss multinomial logit model of occupational
choice as its starting point, the present paper:
(a) Motivates inclusion of a latent variable as one of the relevant
regressors.
(b) Shows how the omission of such a latent variable may bias the
maximum likelihood estimates of the coefficients of included variables.
Further, the conditions under which the direction of such a bias can be
ascertained are noted and finally, this paper.
(c) Describes a methodology that would provide unbiased coefficient
estimates given that certain conditions are met. The proposed procedure
requires data on siblings.
In the last ten to fifteen years, a problem similar to the one
described above has been considered in the context of the human capital
type of (log-linear) earnings functions. One aspect of this literature
dealt with the question of how the OLS coefficient estimates of the
various included variables would become biased if a relevant latent
variable were omitted. See Taubman (1977), Griliches (1979); Behrman et
al. (1980) and Shabbir (1987). However, compared to the treatment of the
omitted variable problem for the (log-) linear regression model, the
issue has not been analysed much for the class of descrete probability
models. (1) However, Lee (1980) is one of the few studies dealing with
the question of the omitted variable bias in MNL models. (2) Though
Lee's results are not explicitly derived for the case of the
omitted variable being a latent one, the present paper is able to extend
his analysis to this case as well.
Rest of the paper is organised such that Section I outlines the
essential features of the Schmidt-Strauss type of MNL model of
occupational choice while Section 2 briefly presents the maximum
likelihood estimation procedure for this model. Section 3 considers the
implications of omitting a relevant latent or unobserved variable in the
specification for the above model and Section 4 presents a procedure to
handle this problem. This procedure requires data on siblings. Finally,
Section 5 contains some concluding remarks and comments.
1. MULTINOMIAL LOGIT MODEL OF OCCUPATIONAL CHOICE
Consider a model of occupational choice where individuals choose
one amongst the L > 1 alternatives facing them. Their behaviour can
be represented in terms of a ordered polychotomous response variable y =
0, 1, 2,.... L which has (L + 1) mutually exclusive and exhaustive
categories where the occupational alternative 0 has the lowest rank and
L has the highest one. Let X be a (jx1) vector of an individual's
characteristics (such as schooling, job experience, family background,
etc.) which affect the occupational choice and let [[alpha].sub.i] be a
(lxj) vector of coefficients attached to X. Then, a Multinomial Logit
Model (MNL) of occupational choice can be specified as follows:
P(y = i | X) = exp([[alpha].sub.i]X)/1 + [L.summation over (k=1)]
exp ([[alpha].sub.k]X) ... ... (1.1)
P(y = 0 | X) = 1/1 + [L.summation over (k=1)] exp
([[alpha].sub.k]X) ... ... (1.2)
where i = 1, 2, ..., L
In the above context, we would further require that the
probabilities add up to unity i. e.
{P(y = 0 | X)+ [L.summation over (i-1)] P(y = i | X)} = 1.
The MNL model given by (1.1) and (1.2) can be equivalently written
in the so-called "odds" form as follows:
ln P(y = i | X)/P(y = 0 | X) = [[alpha].sub.i]X = 1, ..., L ... ...
(2)
where occupational category 0 is being used as the
'numeraire' category and ln represents natural logarithm.
2. INDIVIDUAL LEVEL ESTIMATES OF THE MNL MODEL OF OCCUPATIONAL
CHOICE
The estimation of the parameters of the Multinomial Logit Model (2)
[with selection probabilities given by (1.1) and 0.2)] can be carried
out by using maximum likelihood procedures. (3) Such estimates will be
consistent and efficient asymptotically provided that MNL model is
correctly specified. (4) These estimates can then be used to calculate
the appropriate selection probabilities.
3. THE OMITTED VARIABLE BIAS IN THE MULTINOMIAL LOGIT MODEL
The maximum likelihood estimates of the [alpha] coefficients from
(2) may be biased if relevant variables have been left out of the
specification. Such omitted variables can be measured or unmeasured,
i.e. latent. In this paper, however, we focus on the latent variable
case.
Reconsider the MNL model described in (2) with the simplifying
assumption that X consists only of a scalar, x. Then the ith equation
from (2) is given as below:
In P(y = i | x)/P(y = 0 | x) = [[alpha].sub.i0] + [[alpha].sub.i1]
x = i = 1, 2, ..., L ... ... (3)
Now consider the possibility of (3) being misspecified because it
excludes a variable z which represents family background that may be
defined as 'everything that siblings born and raised in a given
family share together'. The variable z is often treated as a latent
variable since direct measures of it are not readily available.
Incidentally, other studies of the determinants of earnings and other
measures of the socioeconomic achievement of individuals have shown
latent variables similar to our z to be important influences on the
regressand [for instance, see Taubman (1977) and Behrman and Wolfe
(1984)]. In any event, exclusion of z would entail that (3) would
represent a misspecified model while the true MNL model would be given
as follows:
ln [P.sup.*](y = i | x,z)/[P.sup.*](y = 0 | x,z) = [[alpha].sub.i0]
+ [[alpha].sub.i1] x + [[beta].sub.i]z ... ... (4)
where i = 1, 2, ..., L
or in an equivalent form:
[P.sup.*] (y = i | x,z) = [exp([[alpha].sub.i0] + [alpha].sub.i1] x
+ [[beta].sub.i]z)/1 + [L.summation over (i=1)] exp ([[alpha].sub.i0] +
[[alpha].sub.i1] x + [[beta].sub.i]z)] ... (4.1)
and
[P.sup.*](y = 0 | x,z) = 1/1 + [L.summation over (i=1)] exp
([[alpha].sub.i0] + [[alpha].sub.i1] x + [[beta].sub.i]z) ... (4.2)
where [P.sup.*] represent the correctly specified logistic probability function. Also, note that since z is latent, we can define
its (arbitrary) units such that its coefficient is unity. We also assume
that z is continuous.
If the misspecified MNL model (3) is estimated rather than the true
model (4), than [[??].sub.i] the maximum likelihood estimates of
[[??].sub.i] may be biased; (i=1, ..., L). In particular, we will
concentrate on [[alpha].sub.i1] i.e. the estimated 'slope'
coefficient. In order to investigate the bias question more closely let
us consider the following definitions:
(i) Definition: "Unconditional independence of z and x."
If E(zx) = 0, z and x are called unconditionally independent.
Unconditional independence implies [r.sub.1] = 0 where [r.sub.1] is the
coefficient estimate of x in the (auxiliary) regression of z on x, i.e.,
z = [r.sub.0] + [r.sub.1] x.
(ii) Definition: "Conditional independence of z and x."
If E(zx | y) = 0, z and x are called conditionally independent.
Conditional independence implies [[delta].sub.1] = 0, where
[delta].sub.1] in the coefficient estimate of x in the auxiliary
regression of z and y, i.e., z = [[delta].sub.0] + [[delta].sub.1] x +
[[delta].sub.2] y.
Proposition 1: Sufficient Condition for Unbiasedness
Conditional independence of z and x as defined in (ii) above, is
sufficient condition for [[??].sub.i1] to be unbiased.
Proposition 2: Necessary and Sufficient Condition for Unbiasedness
When the omitted relevant variable, z, conditional on y and x is
normally distributed, the sufficient condition, i.e., conditional
independence of z and x is also a necessary one for [[??].sub.i1] to be
unbiased
Proposition 3: Direction of the Bias m [[??].sub.i1]
The asymptotic bias in [[??].sub.i1] is given by:
Plim([[??].sub.i1] - [[alpha].sub.i1]) = [[delta].sub.1]
[[beta].sub.i]
where [[beta].sub.i] is the coefficient (in the ith equation) of
the omitted (latent) variable z and [[delta].sub.1] is the association
of z and x, conditional on y (see definition (ii) above).
The direction of the bias in [[??].sub.i1] can be determined if we
make the assumption that conditional on y and x, z is normally
distributed. Then, if the latent explanatory variable z is omitted from
the true MNL model as given in (4), the maximum likelihood estimates,
[[alpha].sub.i1], of the included explanatory variable x will be:
(a) Unbiased if and only if either [[beta].sub.i] = 0 or
conditional on y, z is independent of x;
(b) Biased upward if either [[beta].sub.i] > 0 and
[[delta].sub.1] > 0 or [[delta].sub.1] < 0 and [[beta].sub.i] <
0; and
(c) Biased downward if either [[beta].sub.i], > 0 and
[[delta].sub.1] < 0 or [[beta].sub.i] < 0 and [[delta].sub.i] >
0.
4. 'WITHIN-FAMILY DEVIATION FORM' ESTIMATES OF THE MNL
MODEL: SIBLING DATA TO THE RESCUE
As noted above, the maximum likelihood estimates, [[??].sub.i1],
will be biased if the (misspecified) MNL Model (3) is estimated. (5)
This bias arises since a relevant variable, z, is omitted from the
specification and conditional on y and z, it is associated with x.
However, it is still possible to get unbiased [[??].sub.i1] if we
estimate the following 'within-family deviation' version of
the misspecified model given in (3):
ln ([P.sup.w.sub.i]/[P.sup.w.sub.m]) = [[alpha].sub.i0,m] +
[[alpha].sub.i1,m] [DELTA]x i, m - 0, 1, ..., L ... i > m ... (5)
where the superscript w refers to 'within-family' (as
against the individual level variables described earlier) and the
subscript m refers to the values of the relevant variables for the
'numeraire' or 'reference' sibling defined here as
the one whose occupation has the lowest rank amongst his/her siblings or
'within' that particular family. Then, for each individual,
[DELTA]x = (x - [x.sub.m]) represents the deviation of his/her x value
from the corresponding value, [x.sub.m], where the latter is the value
of x for the 'numeraire' sibling. Note that whereas the
'numeraire' occupational category was fixed (and set at 0) for
the individual level version of the MNL Model (3), in the above
'deviation-form' version (5), the numeraire category,
[P.sub.m], could vary across families.
5. COMMENTS/CONCLUDING REMARKS
(a) Comments on the 'Deviation' Form vs. Individual Level
The following two comments are pertinent to the above discussion of
the two estimation methods i.e. Deviation Form and Individual Level.
Estimation Method
Comment 1: The most significant point about the MNL Model (5) is
that the maximum likelihood estimates of [[alpha].sub.i1] (i.e.,
[[??].sub.i1m]) would now be (asymptotically) unbiased since z is
assumed to be identical across all siblings in a given family which
implies [[DELTA]z = (z - [z.sub.m]) = 0. Thus, there would be no omitted
variable in (5) and hence no omitted variable bias to contaminate [[??].sub.i1].
The above point can be further elaborated by considering the
following three models together which have, in fact, been already
introduced separately:
ln ([P.sup.*.sub.i]/[P.sup.*.sub.0]) = [[alpha].sub.i0,0] +
[[alpha].sub.i1,0] + [[alpha].sub.i2,0] z i = 1, ..., L ... (6)
ln ([P.sub.i]/[P.sub.0]) = [[alpha].sub.i0,0] + [[alpha].sub.i1,0]
x i = 1, ..., L ... (7)
ln ([P.sup.w.sub.i]/[P.sub.m.sup.w]) = [[alpha].sub.i0,m] +
[[alpha].sub.i1,m] [DELTA]x i, m = 0, 1, ..., L i > m ... (8)
Note that (6) is nothing but the correctly specified MNL model
already introduced as (4) while (7) is the (misspecified) model based on
individual level data and was introduced as (2) and finally, (8) is the
deviation form model, we just finished introducing as (5). Regarding,
the subscript notation for the parameters in (6)-(8) note that for
[[alpha].sub.is,t,] i = occupational category chosen, s = sequence
number of the regressand in the ith equation and t = the
'numeraire' occupational category used in the ith equation.
Further, note the following about the models (6)-(8).
[P.sup.*.sub.i] = True probability of an individual choosing the
occupational category i since it is estimated from the true Model (6).
[P.sub.i] = (Incorrect) probability of an individual choosing the
occupational category i as estimated from the Model (7) above.
[P.sup.w.sub.i] = Probability of an individual choosing the
occupational category i when 'deviation-form' version of (8)
is estimated. Note that under the assumption [DELTA]z = 0,
[P.sup.w.sub.i] = [P.sup.*.sub.i].
Let us now compare individual level estimates from (7) to the
'deviation-form' equation set (8) when m = 0 and i = 1, ...,
L. These latter estimates [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN
ASCII] would be unbiased and comparing them to the corresponding ones
obtained from the individual level estimates using (7) would give us a
(point) estimate of the extent of the bias due to the omission of z.
Thus, given our assumptions about z and its relationship to y and
x, the methodology of 'within-family deviation form' maximum
likelihood estimation of MNL model of occupational choice gives us
asymptotically unbiased estimates of [[alpha].sub.i1] where i = 1, ...,
L.
Comment 2: Existing Related Literature: The above methodology of
employing 'within-family deviations' to flush out the latent
omitted variable has also been used. in some of the earlier studies.
However, mostly such studies dealt with models where the regressand, y,
was a continuous variable such as (log) earnings or years of schooling.
[For instance, see Taubman (1977); Behrman and Wolfe (1984) and Shabbir
(1989)]. However, there are relatively few studies where y is discrete.
One notable exception is the discussion by Chamberlain (1984) of a
binary logit model due to Rasch (1960) and a multinomial logit model
based on McFadden (1974). The 'within-family' deviation
methodology in the above noted studies is based on taking differences
between pairs of siblings.
(b) Concluding Remarks/Empirical Research in Progress
We have outlined the possibility of a bias in the coefficient
estimate for the included explanatory variable(s) if a latent variable
that is equally shared by siblings in a family is omitted from the
specification of MNL model of occupational choice. (6) However, under
certain assumptions regarding the latent variable z, in particular, that
it is purely familial, (7) we can estimate a 'within-family
deviation' version of the MNL model where the maximum likelihood
estimates would not be biased. The next step would be to conduct an
empirical estimation of this model using data on siblings.
Comments on "Multinomial Logit Model of Occupational Choice: A
Latent Variable Approach"
We all know that in a standard linear regression model, the
omission of relevant explanatory variables introduces bias in regression
coefficient estimates. Furthermore, orthogonality conditions are readily
available (between omitted and included regressors) under which biases
disappear. In general, the directions of bias can also be related to the
directions of correlations between omitted and included regressors. If
there is no correlation (the orthogonality condition) between these two
groups of variables, the bias is equal to zero.
Dr Shabbir's paper explores these issues in the context of a
multinomial logit model--an interesting model with vast practical
applications and containing substantial complications, vis-a-vis the
standard linear regression model, which prevent a straightforward
translation of the results I have enumerated in the preceding paragraph.
Thus, Dr Shabbir's analytical results regarding bias and directions
of bias in estimated "slope" coefficients on a multinomial
logit model when relevant variables are inadvertently excluded indeed
represent a significant contribution to the literature on econometric
methodology.
There is also practical importance in Dr Shabbir's results.
The use of sibling data, which he proposes in the paper, provides an
operational procedure for modifying the estimator to erase bias. In a
related study reported in Mariano, Reyes and Lim (1989) dealing with
measurement errors in qualitative response models, corrections for bias
in estimated slope coefficients can lead to substantial changes in the
estimated relative effects of explanatory variables. For example, this
is one major observation we arrived at in our analysis of farmers'
decisions in the Philippines regarding the adoption of modern technology
in agriculture. Differential effects of price subsidies, extension work,
type of farm ownership and other factors change in character once
corrections are introduced for biases due to measurement errors. Modern
technology in our example concerns high yielding varieties of rice,
fertilizers and pesticides.
Note that, through appropriate transformations to what we would
call canonical form, we can show that the problem of omitted variables
can be treated equivalently in terms of measurement errors. Thus, the
practical lessons coming out of our study of the latter carry over, with
appropriate transformations, to the study of the former. Incidentally,
this same comment applies to the analytical phase of Dr Shabbir's
study. The analytical results reported in Tayyeb's paper, can also
be derived through the interpretation of the problem in terms of
measurement errors. Incidentally, related analytical results are also
reported in Yatchew and Griliches as well as in Kiefer.
Moving on to other technical issues regarding the paper, let me
first point out that there is some ambiguity in the literature in the
use of the phrase "multinomial logit model". The model that
Tayyeb labels in the paper as multinomial logit is indeed the
appropriate one--it allows for changes in factor effects across
alternatives or choices. That is, his model described in his Equation
(1.1) has [alpha]-coefficients which are subscripted by "i".
In models of this type, we can have x-variables which are constant,
regardless of choices made. Examples of these are characteristics of
individuals such as educational background, income, gender, and so on.
On the other hand, there are other variables which vary across choices.
These are attributes related to the choices themselves, such as, for
example, compensation for the occupations covered in Tayyeb's
study. If the [alpha]-parameters in the model do not change across
choices, coefficients of variables in the first category will not be
identifiable.
The next technical issue that I would like to raise concerns model
estimation. There are various procedures that can be used. One is the
approach discussed in this paper-least squares based on the linearity of
the log odds ratio vis-a-vis the explanatory variables. A second
procedure is a weighted least squares variation of the first. The log
odds ratio itself is not observable and before relationships, like (3)
in the paper, can be estimated, estimates of the log odds ratio must be
calculated from the data (basically subsample proportions). These
calculated values are then used as "observations" for the
dependent variable in (3). The fact that these are estimates introduces
heteroskedasticity in the version of (3) for the estimated log-odds
ratio.
A third method of estimation is the maximisation of the
"pseudo" likelihood function based on the model as specified.
I call the likelihood a "pseudo" contstruct because this is
not the correct likelihood that corresponds to the appropriate data
generating process if the model has been misspecified (because of
omission of variables).
All three procedures are distinct from each other. And they would
be affected by omission of variables in different ways.
It would be interesting for Tayyeb to work out the analytical bias
expressions for these three estimators and to show, in his empirical
applications, how these three estimates differ from each other and from
the bias-corrected least-squares procedure that he proposes.
Ladies and gentlemen, it is my pleasure to have served as
discussant of Tayyeb's paper. Let me conclude my comments by
stating once more that Tayyeb's paper presents new analytical
results which are important not only on their own technical merits but
also in their practical implications in the econometric study of
occupational choice and other processes dealing with qualitative and
other limited dependent variables.
Roberto S. Mariano
University of Pennsylvania, Philadelphia, USA.
REFERENCE
Mariano, Roberto S., C. R. Reyes and P. C. Lim (1989) Measurement
Errors in Limited-dependent Variable Models-Theory and Applications to
Adoption of Technology in Philippine Agriculture. Philadelphia:
Department of Economics, University of Pennsylvania. (Mimeographed.)
Author's Note: An earlier version of this paper has benefited
from very valuable comments by Paul Taubman, University of Pennsylvania
and Fatah Ullah Bagheri, University of North Dakota.
REFERENCES
Behrman, Jere R., and Barbara L. Wolfe (1984) The Socio-economic
Impact of Schooling in a Developing Country. Review of Economics and
Statistics 65:2.
Behrman, Jere R., Z. Hrubic, Paul Taubman and T. Wales (1980)
Socioeconomic Success. New York: North Holland.
Chamberlain, Gary (1984) Panel Data. In Z. Griliches and M. D.
Intrilligator (eds) Handbook of Econometrics. Volume 2. New York: North
Holland.
Crawford, D. L., and R. Pollak (1988) Order and Inference in
Qualitative Response Models. Discussion Paper, NBER.
Griliches, Zvi (1979) Sibling Models and Data in Economics:
Beginning of a Survey. Journal of Political Economy 87:5.
Lee, L. (1980) Specification Error in Multinominal Logit Models:
Analysis of the Omitted Variable Bias. Minneapolis, Minnesota: Center
for Economic Research, Department of Economics, University of Minnesota.
(Discussion Paper No. 80-131).
McFadden, D. (1974) Conditional Logit Analysis of Qualitative
Choice Behavior. In P. Zarembka (ed) Frontiers in Econometrics. New
York: Academic Press.
Nerlove, M., and J. Press (1973) Multivariate and Log Linear
Probability Models in Econometrics. Center for Statistics and
Probability, Northwestern University. (Discussion Paper No. 1.)
Pindyck, R., and D. Rubinfeld (1981) Econometric Models and
Economic Forecasts. New York: McGraw Hill.
Rasch, G. (1960) Probabilistic Models for Some Intelligence and
Attainment Tests. Copenhagen: Denmark's Paedagogiske Institute.
Schmidt, P., and R. Strauss (1975) The Prediction of Occupation
Using Multiple Logit Models. International Economic Review 16: 471-486.
Shabbir, Tayyeb (1987) Across and Intrahousehold Effects in a Model
of Earnings and Schooling with Controls for Latent Factors. Unpublished
Ph.D. Dissertation. Philadelphia: University of Pennsylvania.
Shabbir, Tayyeb (1989) Latent Structure of Earnings Models. The
Pakistan Development Review 28:4.
Taubman, Paul (ed) (1977) Kinometrics: Determinants of
Socioeconomic Success Within and Between Families. New York:
North-Holland Publishing Company.
(1) This extension from continuous (earnings functions etc.) to
discrete dependent variable (MNL etc.) is not merely a question of
extending the scope of a particular methodology since, as a is shown in
this paper, there are significant differences in the analytical results
in the two eases e. g. in the ease of discrete dependent variable, the
direction of the bias cannot be determined without additional
assumptions (of normality) regarding the distribution of the latent
variable, z. Also, whereas the assumption of 'unconditional
independence' was enough in the continuous dependent variable case
to obtain unbiased estimates here we require 'conditional
independence'.
(2) Some of the analysis presented in Lee (1980) is based on the
results derived in Nerlove and Press (1973).
(3) In this regard, the appendix of Schmidt and Strauss (1975)
gives further details.
(4) For an interpretation of these coefficient estimates, see
Pindyck and Rubinfeld (1981). However, there may be an inference problem
in such models when, as is the case here, there are more than two
occupational categories to choose from. For a discussion of the above
and related problems as well as some suggested solutions, see Crawford
and Pollak (1988).
(5) In our particular model where z is a latent variable which is
constrained to have unity coefficient, the nature and the direction of
bias in [[??].sub.i1] would then depend only on [[delta].sub.1], which
is the coefficient of x in the auxiliary regression of z on x, and y
(i.e., y is also being controlled for).
Now, it is likely that [[delta].sub.1] > 0. With reference to
the specific MNL model given as Equation (4) in the text, let us
interpret x as ED or years of schooling and interpret z as a measure of
ability such as IQ or a dimension of shared family environment such as
parental schooling levels. Since the regressand y is really different
occupational categories, then [[delta].sub.1] is simply a measure of
within occupational category (linear) association between ED and IQ (or
the latent variable z, to be more exact).
Thus, if we agree that [[delta].sub.1] > 0 and since
[[beta].sub.i] = 1 by virtue of our choice of the units of measurement for z, we will expect [[??].sub.i1] the coefficient estimate of x to be
upward biased.
(6) See the discussion in footnote 5 given earlier.
(7) In fact, the structure of z may be more complicated; it may
also contain individual specific components which would require more
complicated, model specification and more complex estimation techniques
than those suggested in this paper. In the context of the related
literature on earnings functions, some of these issues have been
discussed in Griliches (1979) or Behrman et al. (1980). However, I feel
that such relatively more complicated models that arc able to ask finer
questions often can do so only after making correspondingly heuristic assumptions.
Tayyeb Shabbir is Senior Research Economist at the Pakistan
Institute of Development Economics, Islamabad.