文章基本信息

标题：Shifting the interpretive framework of binary coded dummy variables.
作者：Gober, R. Wayne
期刊名称：Academy of Information and Management Sciences Journal
印刷版ISSN：1524-7252
出版年度：2003
期号：January
语种：English
出版社：The DreamCatchers Group, LLC
摘要：The traditional binary coding scheme is the starting point, and often the ending point, for the coding and interpretation of dummy variable coefficients for qualitative variables in regression analysis. The binary coding scheme produces an interpretive framework for the coefficients that measure the net effect of being in a given category as compared to an omitted category. This may result in coefficients that are as arbitrary as the selection of the omitted categories. Two methods for shifting the binary coded coefficients are presented to assist in establishing a more meaningful interpretive framework. The shifted frameworks allow for interpretation of the coefficients about an "average" of the dependent variable. One method allows for each coefficient to be interpreted as a comparison to the unweighted average of the dependent variable when averaged over all subcategory means. The second method allows for an interpretation of the coefficients to the overall mean of the dependent variable. Since the shifted framework coefficients are compared to an "average," the coefficients are insensitive to the omitted categories. The effort to shift the interpretative framework is minimal and can be effected without the use of a computer program. The shifted frameworks can be determined by incorporating alternative coding schemes using a computer program.

Shifting the interpretive framework of binary coded dummy variables.

Gober, R. Wayne

ABSTRACT

The traditional binary coding scheme is the starting point, and often the ending point, for the coding and interpretation of dummy variable coefficients for qualitative variables in regression analysis. The binary coding scheme produces an interpretive framework for the coefficients that measure the net effect of being in a given category as compared to an omitted category. This may result in coefficients that are as arbitrary as the selection of the omitted categories. Two methods for shifting the binary coded coefficients are presented to assist in establishing a more meaningful interpretive framework. The shifted frameworks allow for interpretation of the coefficients about an "average" of the dependent variable. One method allows for each coefficient to be interpreted as a comparison to the unweighted average of the dependent variable when averaged over all subcategory means. The second method allows for an interpretation of the coefficients to the overall mean of the dependent variable. Since the shifted framework coefficients are compared to an "average," the coefficients are insensitive to the omitted categories. The effort to shift the interpretative framework is minimal and can be effected without the use of a computer program. The shifted frameworks can be determined by incorporating alternative coding schemes using a computer program.

INTRODUCTION

The use of dummy variables to represent qualitative variables in regression analysis has become quite prevalent in introductory business and economic statistics courses (Daniel & Terrell, 1992; Anderson, Sweeney & Williams, 2002). The specific information on coding the dummy variables is typically presented using a binary coding scheme (0,1). The binary scheme assigns members of a particular category for the qualitative variable a code of 1 and members not in that particular category receive a code of 0. Usually, the zero coded category is selected to serve as the reference or comparison point for the interpretation of the regression coefficients. These coefficients will express the difference between a selected category and the reference category for the qualitative variable. The choice of a reference category is arbitrary and may present problems of interpretation. When a number of binary coded qualitative variables are used for a regression model, a reference category for each qualitative variable is selected as the comparison points. The resulting regression coefficients may yield unclear and sometimes awkward interpretations as to which categories have been designated for comparisons.

The purpose of this paper is to illustrate processes for shifting the interpretive framework of binary coded regression coefficients. A major reason for the shifting processes is to provide coefficients that lend themselves to more meaningful interpretations. Starting with binary-coded coefficients, usually generated with the assistance of a statistical computer package, the shifting process can be accomplished with or without the assistance of a computer program. The shift in the interpretative framework is such that the contrast of a regression coefficient for a designated category is made to an "average" value for the dependent variable and not to a specified zero coded category. While the shifting processes will yield numerically different coefficients, the overall fit and significance of the regression model remain unchanged. A main advantage of shifting the interpretative framework of binary coded dummy variables to an "average" is that the coefficients are no longer sensitive to which class is treated as the omitted class.

FRAMEWORK SHIFTING WITHOUT A COMPUTER PACKAGE

The process of shifting the interpretive framework of binary coded coefficients can be made without the use of a computer program by adding a constant, k, to the coefficients within each set of coefficients for a qualitative variable and subtracting k from the regression equation constant or intercept. The general relationship for determining k is

[SIGMA] [b.sub.i.sup.*] = [SIGMA] w ([b.sub.i] + k) = 0. (1)

Where [b.sub.i] represent the binary-coded regression coefficients, [b.sub.i.sup.*] represent the shifted regression coefficients, and w represents a weight for the importance of each coefficient within a set of regression coefficients for a qualitative variable. The resulting value of k yields the condition that the new set of coefficients, [b.sub.i.sup.*], will average zero.

Suits (1983) suggested a shifting process, Shifting Process I, which expresses the category regression coefficients as deviations from an "average," where the "average" is the unweighted mean of the dependent variable across all categories for a categorical variable. In calculating the unweighted mean of means, each category receives an equal weight of 1, regardless of the number of cases in that category. Thus, when binary coded coefficients are shifted using Shifting Process 1, the value of w in Equation (1) is set at 1. The unweighted mean of all group means is reported as the regression equation constant, [b.sub.0], and is the reference point from which all category differences can be calculated.

The unweighted mean has the consequence that category means may be based on a few cases and are treated the same as category means based on much large category size. Obviously, when the category cases are unequal, the unweighted mean of means and the overall mean are difference measures. Thus, the overall mean of the dependent variable may be the more desirable "average" as the comparison measure for the regression coefficients.

Sweeny and Ulveling (1972) suggested a process, referred to as Shifting Process II, for shifting the interpretative framework of the coefficients to an "average," where the "average" is indeed the overall mean of the dependent variable. The shifting process can be accomplished by computing the constant k, for Equation (1), using the sample proportion, p, of cases for the categories within each qualitative variable. For Shifting Process II, the value of w in Equation (1) is set at p. Each coefficient is compared to the regression equation constant, [b.sub.o], which is the overall mean of the dependent variable.

Framework Shifting Illustration

The interpretive framework shifting processes will be illustrated by data collected for a maintenance service company. A request was made for an analysis of the maintenance service repair time, in hours, based on the type of repair and the person performing the repair. For a sample of 40 repair times, a summary of the qualitative variables, repair type and repair person, is presented in Table 1.

As mentioned previously, a binary coded dummy variable regression equation is usually the framework selected for the interpretation of the coefficients for introductory statistics courses. The binary coded dummy variable regression equation is necessary for interpretative framework shifting. The two qualitative variables, repair type and repair person, can be completely represented by a set of binary coded dummy variables that have a specific value when an observation is found in a given category. The binary-coded dummy variables for repair type, D1 and D2, and for repair person, D3, D4 and D5, are defined in Table 2.

When an observation is in the repair type electrical category, the dummy set is defined as D1 = 1 and D2 = 0. For repair person, Jake, the dummy set is defined as D3 = 1, D4 = 0 and D5 = 0. Other categories are defined in a similar manner. Using the binary coded dummy variable, the following general regression equation may be formed:

[??] = [b.sub.o] + [b.sub.1] D1 + [b.sub.2] D2 + [b.sub.3] D3 + [b.sub.4] D4 + [b.sub.5] D5 (3)

With the equation in this form, there are two more coefficients to be estimated than there are independent normal equations. One of the extra coefficients is associated with the repair type dummy variables, and one with the repair person dummy variables. In general, each qualitative variable represented by dummy variables gives rise to one superfluous coefficient. The remedy typically utilized in the introductory course in statistics is to constrain a coefficient from each dummy set to a value of 0. For example, when [b.sub.2] = 0 and [b.sub.5] = 0, Equation (3) reduces to

[??] = [b.sub.0] + [b.sub.1] D1 + [b.sub.3] D3 + [b.sub.4] D4 (4)

When D2 = 1 and D5 = 1, then D1 = D3 = D4 = 0. Since D2 and D4 are not in Equation (4), the equation reduces to the following when an observation is a member of both excluded categories:

[??] = [b.sub.o] (5)

Equation (5) represents the regression estimated for a mechanical repair type and for Bob as the repair person. From Equation (4), the regression coefficient, [b.sub.1], is the net amount by which the intercept, bo, must be adjusted to account for repair type mechanical instead of electrical. A similar statement can be made of the remaining dummy coefficients for repair person with respect to the omitted class, repair person Bob. In general, this procedure produces coefficients for each dummy variable that measure the net effect on the intercept of the equation that membership in that class has as to the omitted class.

For the maintenance service problem, the binary coded dummy variable regression equation to estimate the repair time is often determined by means of a computer program. The equation is

[??] = 4.2645 + 0.5925 D1 - 0.5762 D4 - 1.4159 D5 (6a)

Including the two excluded dummy variables, D2 and D5, the equation can be restated as

[??] = 4.2645 + 0.5925 D1 + 0 D2 - 0.5762 D4 - 1.4159 D5 + 0 D6 (6b)

Interpretation of the coefficients is somewhat more difficult when two or more qualitative variables are included. The coefficient for D1, [b.sub.1], is interpreted as the difference between the electrical repair type as compared to the mechanical repair type. The coefficient for D3, [b.sub.3], is interpreted as the difference between repair person Jake as compared to repair person Bob. A similar statement can be made for the other coefficient, [b.sub.4]. To enhance the understanding and use of the dummy variable coefficients in Equation (6b), the interpretive framework for comparison of the coefficients can be shifted to an "average" for the dependent variable.

Based on Suits' suggestion, referred to as Shifting Process I, each coefficient, bi, receives a weight of 1, i.e., w = 1 for Equation (1). Since two sets of dummy variables are included in the maintenance service problem, a constant must be computed for each set and added to the coefficients of the respective sets, [k.sub.1] and [k.sub.2]. The sum of the constants, k, is subtracted from [b.sub.o]. Referring to Equation (6b), for repair type, the constant [k.sub.1] is computed as--(0.5925 + 0) / 2 and for repair person, the constant [k.sub.2] is computed as--(-0.5762 - 1.4159 + 0) / 3. The required constants are [k.sub.1] = -0.2963 and [k.sub.2] = 0.66400. The sum of the constants, k, is 0.3677. The transformed or shifted equation is obtained by adding each constant, [k.sub.1] and [k.sub.2], to the respective set of coefficients and subtracting the sum of the constants, k, from the regression equation intercept or constant. The Suits' shifted equation for Equation (6b) is

[??] = 3.8968 + 0.2963 D1 - 0.2963 D2 + 0.0878 D3 - 0.7519 D4 + 0.6640 D5 (7)

The interpretation of the coefficients now indicates the extent to which behavior in the respective repair type and in the respective repair person categories vary from the unweighted average of repair type, when averaged over all subcategory means for repair time. For the repair type electrical coefficient, [b.sub.1] = 0.2963, an electrical repair type adds 0.2963 hours to the unweighted average, 3.8968 hours of repair time. Also, repair person Bob subtracts 0.7519 hours from the unweighted average.

To shift the interpretation framework of the coefficients to an "average" that is the overall mean of the dependent variable, referred to as Shifting Process II, Sweeny and Ulveling suggested using the sample proportions for categories of each qualitative variable as weights in Equation (1). Using Table 1, for repair type, [k.sub.1] is computed as (0.6*-0.2963 + 0.4*0.2963) and [k.sub.2] is computed as--(0.300 * 0.0878 + 0.450 * -0.7519 + 0.2500 * 0.6640). The required constants are [k.sub.1] = + 0.0593 and [k.sub.2] = +0 .1460. The sum of the constants, k, is +0.0867. As for Process I, k is subtracted from the constant and each constant, [k.sub.1] and [k.sub.2], is added to the coefficients of their respective dummy regression coefficients in Equation (7). The Sweeny and Ulveling's shifted equation is

[??] = 3.8100 + 0.2370 D1 - 0.3556 D2 + 0.2339 D3 -0.6059 D4 + 0.8100 D5 (8)

The interpretive framework for comparison of the dummy coefficients now represents the net effect of being in the category associated with the dummy variable as compared to the overall mean or grand mean of the dependent variable, Y = [b.sub.o]. For example, a mechanical repair type subtracts 0.3556 hours from the average repair time, 3.810 hours, and repair person Bob adds 0.8100 hours to the repair time average hours.

One further note is that a shift in the binary coded dummy regression coefficients, Equation (6) can be made directly to the "average" that is the overall mean of the dependent variable, Equation (8), by using Shifting Process II.

FRAMEWORK SHIFTING WITH A COMPUTER PACKAGE

The interpretive framework of dummy variable coefficients resulting for Shifting Processes I and II may also be obtained by using coding schemes that are alternatives to the binary coding scheme (Parker & Wrighton, 1975). The alternative coding schemes require that one category for each qualitative variable be excluded when calculating the regression equation. Referring to the maintenance service problem, repair type mechanical and repair person Bob are selected as the excluded categories.

When using the binary coding scheme the reference category is always coded zero. An alternative scheme, referred to as Alternative Scheme I, is to uniformly code the reference category with the value of -1. A value of 1 is assigned to categories in the same manner as the binary coding scheme. When an observation is in the repair type electrical category, D1 = 1, and when the repair type is mechanical, D1 = -1. For repair person, the selected dummy variables are D3 and D4. These dummy variables are defined in Table 3.

For Alternative Coding Scheme I, the following computer regression equation is generated

[??] = 3.8968 + 0.2963 D1 + 0.0878 D3 - 0.7519 D4 (9)

Equation (9) does not contain the coefficients, [b.sub.2] for D2, and [b.sub.5] for D5. These coefficients are easily determined as follow:

[b.sub.2] = - [b.sub.1] = -0.2963 and [b.sub.5] = - ([b.sub.3] + [b.sub.4]) = - (0.0879 - 0.7519) = 0.6640.

When these coefficients are included in Equation (9) the resulting equation is the same as Equation (7). The coefficients are equivalent to Shifting Process I, as suggested by Suits.

Another coding scheme, Alternative Scheme II, for a qualitative variable is to code a selected category, j, as 1, and the excluded category, e, as the ratio, [p.sub.j], of the number of cases in the selected category, [n.sub.j], to the number of cases in the excluded category, [n.sub.e], where [p.sub.j] = [n.sub.j] / [n.sub.e]. For the maintenance service problem, when a case is in the repair type electrical category, D1 = 1 and for a mechanical repair type, D1 = -1.5. Dummy variables D3 and D4 can be used to represent repair person. Alternative Scheme II dummy variables are defined in Table 4.

The computer regression equation resulting for Alternative Scheme II is

[??] = 3.8100 + 0.2370 D1 + 0.2339 D3 -0.6059 D4 (10)

As before, Equation (10) does not contain the coefficients for D2 and D5. These coefficients can be calculated as follows:

[b.sub.2] = [p.sub.mechanical] * [b.sub.1] = -1.5 * .2963 = - 0.3556 and

[b.sub.5] = [p.sub.Jake] * [b.sub.3] + [p.sub.Dave] * [b.sub.4] = -1.2 * 0.2338 -1.8 * (-0.6059) = 0.8100

When these coefficients are included in Equation (10), the resulting equation is the same as Equation (8). The coefficients are equivalent to Shifting Process II as suggested by Sweeny and Ulveling. The relationship of the interpetative frameworks are summarized in Figure I.

[FIGURE 1 OMITTED]

SUMMARY

Two processes for shifting the interpretive framework of binary-coded dummy variable regression coefficients are summarized in this paper. The frameworks assist in a more meaningful interpretation of the coefficients and allow for interpretation of the coefficients about an "average" of the dependent variable. One method suggested by Suits allows for each coefficient to be interpreted as a comparison of the coefficient to the unweighted average of the dependent variable over all subcategory means. The method suggested by Sweeney and Ulveling allows for an interpretation of the coefficients to the overall mean of the dependent variable. The effort to shift the interpretative framework is minimal and should be worth the effort. Comparing the coefficients to an "average" makes the coefficients insensitive to the selection of the category to be omitted. The processes of shifting can be accomplished with or without the assistance of a computer package. The shifted interpretive frameworks may be employed by practitioners who will use the shifted coefficients to disseminate to individuals are who heterogeneous in regard to the use and interpretation of the regression model. As an additional note, if quantitative independent variables are to be included in the regression model, each quantitative variable should be coded as deviations from its mean.

REFERENCES

Anderson, D., D. Sweeney & T. Williams. (2002). Statistics for Business and Economics (Eighth Edition). Cincinnati, OH: South-Western Publishing.

Daniel, W. & J. Terell. (1992). Business Statistics. Boston, MA: Houghton Mifflin Company.

Hardy, M. (1993). Regression With Dummy Variables. Sage University Paper series on Quantitative Applications in the Social Sciences, 07-093. Newbury Park, CA: Sage Publications.

Parker, C. & F. Wrighton. (1975). Alternative Coding Schemes for Covariance Analysis of Survey Data. Unpublished working paper, Louisiana Tech University.

Suits, D. B (1983). Dummy Variables: Mechanics v. Interpretation. The Review of Economics and Statistics, 66, 177-180.

Sweeney, R. & E. Ulveling. (1972). A Transformation for Simplifying the Interpretation of Coefficients of Binary Variables in Regression Analysis. The American Statistician, 5(26), 30-32.

R. Wayne Gober, Middle Tennessee State University

TABLE 1

Summary of the Cases for the Maintenance Service Qualitative
Variables and Categories

 Repair Type Repair Person

Cases Electrical Mechanical Jake Dave Bob

Frequency 24 16 12 18 10
Proportion 0.6 0.4 0.300 0.450 0.25

TABLE 2

Binary Coding Scheme for the Maintenance Service Company
Qualitative Variables

Repair Type Dummy Variable Repair Person Dummy Variable

 D1 D2 D3 D4 D5

Electrical 1 0 Jake 1 0 0
Mechanical 0 1 Dave 0 1 0
 Bob 0 0 1

TABLE 3

Alternative Coding Scheme I for the Maintenance Service Company
Qualitative Variables

 Dummy Dummy Repair Dummy Dummy Dummy
Repair Type D1 D2 Person D3 D4 D5

Electrical 1 omitted Jake 1 0 omitted
Mechanical -1 class Dave 0 1 class
 Bob -1 -1

TABLE 4

Alternative Coding Scheme II for the Maintenance Service Company
Qualitative Variables

 Dummy Dummy Repair Dummy Dummy Dummy
Repair Type D1 D2 Person D3 D4 D5

Electrical 1 omitted Jake 1 0 omitted
Mechanical -1.5 class Dave 0 1 class
 Bob -1.2 -1.8