Shifting the interpretive framework of binary coded dummy variables.
Gober, R. Wayne
ABSTRACT
The traditional binary coding scheme is the starting point, and
often the ending point, for the coding and interpretation of dummy
variable coefficients for qualitative variables in regression analysis.
The binary coding scheme produces an interpretive framework for the
coefficients that measure the net effect of being in a given category as
compared to an omitted category. This may result in coefficients that
are as arbitrary as the selection of the omitted categories. Two methods
for shifting the binary coded coefficients are presented to assist in
establishing a more meaningful interpretive framework. The shifted
frameworks allow for interpretation of the coefficients about an
"average" of the dependent variable. One method allows for
each coefficient to be interpreted as a comparison to the unweighted
average of the dependent variable when averaged over all subcategory means. The second method allows for an interpretation of the
coefficients to the overall mean of the dependent variable. Since the
shifted framework coefficients are compared to an "average,"
the coefficients are insensitive to the omitted categories. The effort
to shift the interpretative framework is minimal and can be effected
without the use of a computer program. The shifted frameworks can be
determined by incorporating alternative coding schemes using a computer
program.
INTRODUCTION
The use of dummy variables to represent qualitative variables in
regression analysis has become quite prevalent in introductory business
and economic statistics courses (Daniel & Terrell, 1992; Anderson,
Sweeney & Williams, 2002). The specific information on coding the
dummy variables is typically presented using a binary coding scheme
(0,1). The binary scheme assigns members of a particular category for
the qualitative variable a code of 1 and members not in that particular
category receive a code of 0. Usually, the zero coded category is
selected to serve as the reference or comparison point for the
interpretation of the regression coefficients. These coefficients will
express the difference between a selected category and the reference
category for the qualitative variable. The choice of a reference
category is arbitrary and may present problems of interpretation. When a
number of binary coded qualitative variables are used for a regression
model, a reference category for each qualitative variable is selected as
the comparison points. The resulting regression coefficients may yield
unclear and sometimes awkward interpretations as to which categories
have been designated for comparisons.
The purpose of this paper is to illustrate processes for shifting
the interpretive framework of binary coded regression coefficients. A
major reason for the shifting processes is to provide coefficients that
lend themselves to more meaningful interpretations. Starting with
binary-coded coefficients, usually generated with the assistance of a
statistical computer package, the shifting process can be accomplished
with or without the assistance of a computer program. The shift in the
interpretative framework is such that the contrast of a regression
coefficient for a designated category is made to an "average"
value for the dependent variable and not to a specified zero coded
category. While the shifting processes will yield numerically different
coefficients, the overall fit and significance of the regression model
remain unchanged. A main advantage of shifting the interpretative
framework of binary coded dummy variables to an "average" is
that the coefficients are no longer sensitive to which class is treated
as the omitted class.
FRAMEWORK SHIFTING WITHOUT A COMPUTER PACKAGE
The process of shifting the interpretive framework of binary coded
coefficients can be made without the use of a computer program by adding
a constant, k, to the coefficients within each set of coefficients for a
qualitative variable and subtracting k from the regression equation constant or intercept. The general relationship for determining k is
[SIGMA] [b.sub.i.sup.*] = [SIGMA] w ([b.sub.i] + k) = 0. (1)
Where [b.sub.i] represent the binary-coded regression coefficients,
[b.sub.i.sup.*] represent the shifted regression coefficients, and w
represents a weight for the importance of each coefficient within a set
of regression coefficients for a qualitative variable. The resulting
value of k yields the condition that the new set of coefficients,
[b.sub.i.sup.*], will average zero.
Suits (1983) suggested a shifting process, Shifting Process I,
which expresses the category regression coefficients as deviations from
an "average," where the "average" is the unweighted
mean of the dependent variable across all categories for a categorical variable. In calculating the unweighted mean of means, each category
receives an equal weight of 1, regardless of the number of cases in that
category. Thus, when binary coded coefficients are shifted using
Shifting Process 1, the value of w in Equation (1) is set at 1. The
unweighted mean of all group means is reported as the regression
equation constant, [b.sub.0], and is the reference point from which all
category differences can be calculated.
The unweighted mean has the consequence that category means may be
based on a few cases and are treated the same as category means based on
much large category size. Obviously, when the category cases are
unequal, the unweighted mean of means and the overall mean are
difference measures. Thus, the overall mean of the dependent variable
may be the more desirable "average" as the comparison measure
for the regression coefficients.
Sweeny and Ulveling (1972) suggested a process, referred to as
Shifting Process II, for shifting the interpretative framework of the
coefficients to an "average," where the "average" is
indeed the overall mean of the dependent variable. The shifting process
can be accomplished by computing the constant k, for Equation (1), using
the sample proportion, p, of cases for the categories within each
qualitative variable. For Shifting Process II, the value of w in
Equation (1) is set at p. Each coefficient is compared to the regression
equation constant, [b.sub.o], which is the overall mean of the dependent
variable.
Framework Shifting Illustration
The interpretive framework shifting processes will be illustrated
by data collected for a maintenance service company. A request was made
for an analysis of the maintenance service repair time, in hours, based
on the type of repair and the person performing the repair. For a sample
of 40 repair times, a summary of the qualitative variables, repair type
and repair person, is presented in Table 1.
As mentioned previously, a binary coded dummy variable regression
equation is usually the framework selected for the interpretation of the
coefficients for introductory statistics courses. The binary coded dummy
variable regression equation is necessary for interpretative framework
shifting. The two qualitative variables, repair type and repair person,
can be completely represented by a set of binary coded dummy variables
that have a specific value when an observation is found in a given
category. The binary-coded dummy variables for repair type, D1 and D2,
and for repair person, D3, D4 and D5, are defined in Table 2.
When an observation is in the repair type electrical category, the
dummy set is defined as D1 = 1 and D2 = 0. For repair person, Jake, the
dummy set is defined as D3 = 1, D4 = 0 and D5 = 0. Other categories are
defined in a similar manner. Using the binary coded dummy variable, the
following general regression equation may be formed:
[??] = [b.sub.o] + [b.sub.1] D1 + [b.sub.2] D2 + [b.sub.3] D3 +
[b.sub.4] D4 + [b.sub.5] D5 (3)
With the equation in this form, there are two more coefficients to
be estimated than there are independent normal equations. One of the
extra coefficients is associated with the repair type dummy variables,
and one with the repair person dummy variables. In general, each
qualitative variable represented by dummy variables gives rise to one
superfluous coefficient. The remedy typically utilized in the
introductory course in statistics is to constrain a coefficient from
each dummy set to a value of 0. For example, when [b.sub.2] = 0 and
[b.sub.5] = 0, Equation (3) reduces to
[??] = [b.sub.0] + [b.sub.1] D1 + [b.sub.3] D3 + [b.sub.4] D4 (4)
When D2 = 1 and D5 = 1, then D1 = D3 = D4 = 0. Since D2 and D4 are
not in Equation (4), the equation reduces to the following when an
observation is a member of both excluded categories:
[??] = [b.sub.o] (5)
Equation (5) represents the regression estimated for a mechanical
repair type and for Bob as the repair person. From Equation (4), the
regression coefficient, [b.sub.1], is the net amount by which the
intercept, bo, must be adjusted to account for repair type mechanical
instead of electrical. A similar statement can be made of the remaining
dummy coefficients for repair person with respect to the omitted class,
repair person Bob. In general, this procedure produces coefficients for
each dummy variable that measure the net effect on the intercept of the
equation that membership in that class has as to the omitted class.
For the maintenance service problem, the binary coded dummy
variable regression equation to estimate the repair time is often
determined by means of a computer program. The equation is
[??] = 4.2645 + 0.5925 D1 - 0.5762 D4 - 1.4159 D5 (6a)
Including the two excluded dummy variables, D2 and D5, the equation
can be restated as
[??] = 4.2645 + 0.5925 D1 + 0 D2 - 0.5762 D4 - 1.4159 D5 + 0 D6
(6b)
Interpretation of the coefficients is somewhat more difficult when
two or more qualitative variables are included. The coefficient for D1,
[b.sub.1], is interpreted as the difference between the electrical
repair type as compared to the mechanical repair type. The coefficient
for D3, [b.sub.3], is interpreted as the difference between repair
person Jake as compared to repair person Bob. A similar statement can be
made for the other coefficient, [b.sub.4]. To enhance the understanding
and use of the dummy variable coefficients in Equation (6b), the
interpretive framework for comparison of the coefficients can be shifted
to an "average" for the dependent variable.
Based on Suits' suggestion, referred to as Shifting Process I,
each coefficient, bi, receives a weight of 1, i.e., w = 1 for Equation
(1). Since two sets of dummy variables are included in the maintenance
service problem, a constant must be computed for each set and added to
the coefficients of the respective sets, [k.sub.1] and [k.sub.2]. The
sum of the constants, k, is subtracted from [b.sub.o]. Referring to
Equation (6b), for repair type, the constant [k.sub.1] is computed
as--(0.5925 + 0) / 2 and for repair person, the constant [k.sub.2] is
computed as--(-0.5762 - 1.4159 + 0) / 3. The required constants are
[k.sub.1] = -0.2963 and [k.sub.2] = 0.66400. The sum of the constants,
k, is 0.3677. The transformed or shifted equation is obtained by adding
each constant, [k.sub.1] and [k.sub.2], to the respective set of
coefficients and subtracting the sum of the constants, k, from the
regression equation intercept or constant. The Suits' shifted
equation for Equation (6b) is
[??] = 3.8968 + 0.2963 D1 - 0.2963 D2 + 0.0878 D3 - 0.7519 D4 +
0.6640 D5 (7)
The interpretation of the coefficients now indicates the extent to
which behavior in the respective repair type and in the respective
repair person categories vary from the unweighted average of repair
type, when averaged over all subcategory means for repair time. For the
repair type electrical coefficient, [b.sub.1] = 0.2963, an electrical
repair type adds 0.2963 hours to the unweighted average, 3.8968 hours of
repair time. Also, repair person Bob subtracts 0.7519 hours from the
unweighted average.
To shift the interpretation framework of the coefficients to an
"average" that is the overall mean of the dependent variable,
referred to as Shifting Process II, Sweeny and Ulveling suggested using
the sample proportions for categories of each qualitative variable as
weights in Equation (1). Using Table 1, for repair type, [k.sub.1] is
computed as (0.6*-0.2963 + 0.4*0.2963) and [k.sub.2] is computed
as--(0.300 * 0.0878 + 0.450 * -0.7519 + 0.2500 * 0.6640). The required
constants are [k.sub.1] = + 0.0593 and [k.sub.2] = +0 .1460. The sum of
the constants, k, is +0.0867. As for Process I, k is subtracted from the
constant and each constant, [k.sub.1] and [k.sub.2], is added to the
coefficients of their respective dummy regression coefficients in
Equation (7). The Sweeny and Ulveling's shifted equation is
[??] = 3.8100 + 0.2370 D1 - 0.3556 D2 + 0.2339 D3 -0.6059 D4 +
0.8100 D5 (8)
The interpretive framework for comparison of the dummy coefficients
now represents the net effect of being in the category associated with
the dummy variable as compared to the overall mean or grand mean of the
dependent variable, Y = [b.sub.o]. For example, a mechanical repair type
subtracts 0.3556 hours from the average repair time, 3.810 hours, and
repair person Bob adds 0.8100 hours to the repair time average hours.
One further note is that a shift in the binary coded dummy
regression coefficients, Equation (6) can be made directly to the
"average" that is the overall mean of the dependent variable,
Equation (8), by using Shifting Process II.
FRAMEWORK SHIFTING WITH A COMPUTER PACKAGE
The interpretive framework of dummy variable coefficients resulting
for Shifting Processes I and II may also be obtained by using coding
schemes that are alternatives to the binary coding scheme (Parker &
Wrighton, 1975). The alternative coding schemes require that one
category for each qualitative variable be excluded when calculating the
regression equation. Referring to the maintenance service problem,
repair type mechanical and repair person Bob are selected as the
excluded categories.
When using the binary coding scheme the reference category is
always coded zero. An alternative scheme, referred to as Alternative
Scheme I, is to uniformly code the reference category with the value of
-1. A value of 1 is assigned to categories in the same manner as the
binary coding scheme. When an observation is in the repair type
electrical category, D1 = 1, and when the repair type is mechanical, D1
= -1. For repair person, the selected dummy variables are D3 and D4.
These dummy variables are defined in Table 3.
For Alternative Coding Scheme I, the following computer regression
equation is generated
[??] = 3.8968 + 0.2963 D1 + 0.0878 D3 - 0.7519 D4 (9)
Equation (9) does not contain the coefficients, [b.sub.2] for D2,
and [b.sub.5] for D5. These coefficients are easily determined as
follow:
[b.sub.2] = - [b.sub.1] = -0.2963 and [b.sub.5] = - ([b.sub.3] +
[b.sub.4]) = - (0.0879 - 0.7519) = 0.6640.
When these coefficients are included in Equation (9) the resulting
equation is the same as Equation (7). The coefficients are equivalent to
Shifting Process I, as suggested by Suits.
Another coding scheme, Alternative Scheme II, for a qualitative
variable is to code a selected category, j, as 1, and the excluded
category, e, as the ratio, [p.sub.j], of the number of cases in the
selected category, [n.sub.j], to the number of cases in the excluded
category, [n.sub.e], where [p.sub.j] = [n.sub.j] / [n.sub.e]. For the
maintenance service problem, when a case is in the repair type
electrical category, D1 = 1 and for a mechanical repair type, D1 = -1.5.
Dummy variables D3 and D4 can be used to represent repair person.
Alternative Scheme II dummy variables are defined in Table 4.
The computer regression equation resulting for Alternative Scheme
II is
[??] = 3.8100 + 0.2370 D1 + 0.2339 D3 -0.6059 D4 (10)
As before, Equation (10) does not contain the coefficients for D2
and D5. These coefficients can be calculated as follows:
[b.sub.2] = [p.sub.mechanical] * [b.sub.1] = -1.5 * .2963 = -
0.3556 and
[b.sub.5] = [p.sub.Jake] * [b.sub.3] + [p.sub.Dave] * [b.sub.4] =
-1.2 * 0.2338 -1.8 * (-0.6059) = 0.8100
When these coefficients are included in Equation (10), the
resulting equation is the same as Equation (8). The coefficients are
equivalent to Shifting Process II as suggested by Sweeny and Ulveling.
The relationship of the interpetative frameworks are summarized in
Figure I.
[FIGURE 1 OMITTED]
SUMMARY
Two processes for shifting the interpretive framework of
binary-coded dummy variable regression coefficients are summarized in
this paper. The frameworks assist in a more meaningful interpretation of
the coefficients and allow for interpretation of the coefficients about
an "average" of the dependent variable. One method suggested
by Suits allows for each coefficient to be interpreted as a comparison
of the coefficient to the unweighted average of the dependent variable
over all subcategory means. The method suggested by Sweeney and Ulveling
allows for an interpretation of the coefficients to the overall mean of
the dependent variable. The effort to shift the interpretative framework
is minimal and should be worth the effort. Comparing the coefficients to
an "average" makes the coefficients insensitive to the
selection of the category to be omitted. The processes of shifting can
be accomplished with or without the assistance of a computer package.
The shifted interpretive frameworks may be employed by practitioners who
will use the shifted coefficients to disseminate to individuals are who
heterogeneous in regard to the use and interpretation of the regression
model. As an additional note, if quantitative independent variables are
to be included in the regression model, each quantitative variable
should be coded as deviations from its mean.
REFERENCES
Anderson, D., D. Sweeney & T. Williams. (2002). Statistics for
Business and Economics (Eighth Edition). Cincinnati, OH: South-Western
Publishing.
Daniel, W. & J. Terell. (1992). Business Statistics. Boston,
MA: Houghton Mifflin Company.
Hardy, M. (1993). Regression With Dummy Variables. Sage University
Paper series on Quantitative Applications in the Social Sciences,
07-093. Newbury Park, CA: Sage Publications.
Parker, C. & F. Wrighton. (1975). Alternative Coding Schemes
for Covariance Analysis of Survey Data. Unpublished working paper,
Louisiana Tech University.
Suits, D. B (1983). Dummy Variables: Mechanics v. Interpretation.
The Review of Economics and Statistics, 66, 177-180.
Sweeney, R. & E. Ulveling. (1972). A Transformation for
Simplifying the Interpretation of Coefficients of Binary Variables in
Regression Analysis. The American Statistician, 5(26), 30-32.
R. Wayne Gober, Middle Tennessee State University
TABLE 1
Summary of the Cases for the Maintenance Service Qualitative
Variables and Categories
Repair Type Repair Person
Cases Electrical Mechanical Jake Dave Bob
Frequency 24 16 12 18 10
Proportion 0.6 0.4 0.300 0.450 0.25
TABLE 2
Binary Coding Scheme for the Maintenance Service Company
Qualitative Variables
Repair Type Dummy Variable Repair Person Dummy Variable
D1 D2 D3 D4 D5
Electrical 1 0 Jake 1 0 0
Mechanical 0 1 Dave 0 1 0
Bob 0 0 1
TABLE 3
Alternative Coding Scheme I for the Maintenance Service Company
Qualitative Variables
Dummy Dummy Repair Dummy Dummy Dummy
Repair Type D1 D2 Person D3 D4 D5
Electrical 1 omitted Jake 1 0 omitted
Mechanical -1 class Dave 0 1 class
Bob -1 -1
TABLE 4
Alternative Coding Scheme II for the Maintenance Service Company
Qualitative Variables
Dummy Dummy Repair Dummy Dummy Dummy
Repair Type D1 D2 Person D3 D4 D5
Electrical 1 omitted Jake 1 0 omitted
Mechanical -1.5 class Dave 0 1 class
Bob -1.2 -1.8