文章基本信息

标题：The social policy simulation database and model: an example of survey and administrative data integration.
作者：Wolfson, Michael ; Gribble, Stephen ; Bordt, Michael 等
期刊名称：Survey of Current Business
印刷版ISSN：0039-6222
出版年度：1989
期号：May
语种：English
出版社：U.S. Government Printing Office
摘要：WHENEVER governments propose a change in the personal income tax
关键词：Income distribution;Information services;Market surveys;Simulation;Simulation methods;Social policy;Social surveys

The social policy simulation database and model: an example of survey and administrative data integration.

Wolfson, Michael ; Gribble, Stephen ; Bordt, Michael 等

Introduction

WHENEVER governments propose a change in the personal income tax

laws or whenever special task forces or (in Canada) Royal Commissions propose changes in the structure of major transfer programs, the technique of microsimulation modeling is typically used to assess the impact of the proposals. For example, to assess the distributional impact of a change in income tax exemptions on different types of families, the Canadian Department of Finance uses a microsimulation model that recomputes the income tax liabilities for a sample of about 400,000 tax returns for a recent year; for each of these returns, the model calculates the tax liability under the proposed policy. Similarly, the Canadian Ministry of Employment and Immigration has a microsimulation model for the unemployment insurance system that is based on a sample of their internal administrative data files.

However, in Canada, microsimulation models have been available in only a few large institutions because of the relatively high cost of developing and maintaining this analytical capability. These models are located in the Federal Departments of Finance, of Employment and Immigration, and of Health and Welfare. Interested groups outside these Departments (even other Federal Departments and Provincial governments) generally do not have access to these models; consequently, they have no way to assess the published estimates of the distributional impacts of policy proposals, no way to explore the impacts in greater detail, and no way to develop comparable figures for their own proposals.

This situation is unlike that in the United States where various independent agencies-such as the Congressional Budget Office, nongovermental research institutions, and private consulting firms-have sophisticated microsimulation capabilities. These agencies regularly provide independent analyses and forecasts of proposed changes in microeconomic policy.

The first release of the Social Policy Simulation Database and Model (SPSD/M) from Statistics Canada in the fall of 1988 has changed this situation. Using the SPSD/M, anyone can perform microsimulation impact analyses of tax and transfer program changes on a personal computer. Moreover, the level of sophistication of the SPSM approaches, or in some cases exceeds, that of the current models used by the Federal Departments.

The SPSD/M represents a product different from the traditional products-typically publications with many tables of numbers-of a national statistical agency. The SPSD/M consists of a specially designed database integrated with a retrieval and analytical software package. The database was explicitly tailored to the intended analytical applications, unlike the more common situation in which the analysis is constrained by the data already available.

To meet the objective of public accessibility, the designers of the SPSD/M had to ensure that no individual respondents on the file are identifiable pursuant to the Statistics Act and that the database and software package are usable in a range of computing environments, especially personal computers. This paper describes the construction of the Social Policy Simulation Database and the uses of the associated Social Policy Simulation Model.

The Data Sets in the Social Policy

Simulation Database (SPSD)

To provide realistic, albeit synthetic, data on individuals in household contexts, the SPSD was constructed from four major sources of microdata. The sources of these data sets are (1) the Survey of Consumer Finances, (2) personal income tax returns, (3) unemployment insurance claim histories, and (4) the Family Expenditure Survey. The Survey of Consumer Finances SCF)

The "host" data set is derived from Statistics Canada's 1984 Survey of Consumer Finances. This survey is Statistics Canada's main source of data on the distribution of income for individuals and families. (Its content is similar to that of the U.S. Census Bureau's March supplement to the Current Population Survey.) The data collected from each household consists of demographic information, such as the family structure of the household, the labor force status and the previous year's income by source for each household member who is 15 years of age or older, and specific characteristics of the dwelling. The 1984 survey collected data from about 98,000 individuals in approximately 36,000 households. Although this survey is rich in data on family structure and income sources, it lacks detailed information on unemployment histories, tax deductions, and consumer expenditures. Personal income tax returns

This data set is from the 3-percent sample of personal income tax returns used in Revenue Canada's annual Taxation Statistics (the Green Book") and by the Department of Finance's personal income tax model. This sample consists of about 400,000 records. (It contains information comparable to that published in the U.S. Statistics of Income.) Unemployment insurance claim histories

Unemployment insurance (UI) is a complex insurance and temporary income maintenance program. The administra- tive data collected from the program serves to track the weekly benefits and claim activities or UI recipients and to establish eligibility and entitlements by monitoring previous employment patterns and program participation of repeat, or reentrant, claims. The UI claim histories imputed to the SPSD were based on a 1-percent sample of administrative records from the population with some UI claim activity in 1984. The sample consists of about 30,000 records of individuals and represents about 40,000 claims. The Family Expenditure Survey (FAMEX)

This survey is Statistics Canada's periodic survey that provides detailed data on household expenditure patterns. This data set contains about 10,000 household records. (It is similar to the U.S. Consumer Expenditure Survey.)

The microdata from these sources are confidential. Until now, the data from these microdata sets have been disseminated either as separate public use samples in the cases of the SCF and the FAMEX (in both of which some records and a fair number of variables are suppressed) or as summary tables of income tax data from Taxation Statistics and of UI claim histories. In the SPSD, the data sets from these four sources have been transformed into a single public use microdata set that retains the full household hierarchical" structure. At the same time, the SPSD also maintains the confidentiality of the individual records since all the constituent microdata sets contain nonidentifiable records and exact matching is not used to merge these data sets.

Techniques Used to Construct the SPSD

Several key techniques were used to join the four microdata sets in constructing the SPSD. Controlled blurring

The SPSD is much richer than the already released SCF because it is a fully hierarchical file: Each individual has a complete family and household context.

Controlled blurling," or selective randomization, of portions of the data is used to protect the identities of the individual microdata records, so the data can be released to the public.

If randomization is suitably structured, it does not adversely affect the usefulness of the database for the policy simulations for which it has been designed. Moreover, precise information about households is not generally required for the anticipated uses of the SPSD/M. For example, the precise age and sex composition and the geographic location of a household greatly increases the identifiability of a microrecord. As a result, the sex of children and the ages of household members within 5-year age-groups have been randomized (subject to some constraints to ensure "plausible" results). Similarly, by randomly reassigning the province and urban-size class codes, the geographical location of "unusual' household types (e.g., large size or multifamily) has been blurred. Integrated weighting

This technique is used to reduce bias by forcing agreement between the sample data and the known control totals. The SCF (host survey) weights are adjusted to ensure that the population by age, sex, and Province represented by the sur- vey corresponds to the known" population by age, sex, and Province from the census. In addition, the survey weights are adjusted to be consistent with the control totals at the family level, such as the number of families by size and the labor force participation status of the adult family members. The procedure is a generalization of iterative proportional adjustment, or raking" (see Deming and Stephan 1940 and Lemaitre and Dufour 1987). Categorical matching This technique is used to merge two data sets. A variety of methods can be used for the synthetic matching of the information from a record of the donor data set to any given record of the host data set; all of these methods are based on determining which records of both host and donor data sets are the most closely similar according to the policy relevant criteria common to both data sets (e.g., dwelling tenure, employment status, and income class). In the SPSD, the similarity of host and donor records was determined by dividing records from each data set into very fine categories-hence the term 'categorical matching." The information from the donor records may then be attributed synthetically to the records of the host data set that have the most closely similar characteristics without increasing the identifiability of the donor or of the host records. There is a substantial literature on the methods and the experiences of synthetic matching, or linking, of two files; for example, see Rodgers 1984, Rubin 1986, Paass 1986 and 1988, and Singh, Armstrong, and Lemaitre 1988. (Exact matches, though sometimes technically feasible, have been avoided for confidentiality reasons.)

Specifically, categorical matching is used to add FAMEX data, UI data, and Green Book income data for high-income recipients to the SPSD. Two data sets are partitioned into identically defined bins" of records-for example, into province, income range, and tenure. Within each corresponding pair of donor and host bins, the individual records are sorted, based on one of the continuous variables common to the two data sets (e.g., income). According to their rank order, records in a given bin are then matched one-for-one across the two data sets. Because the number of records for the two data sets in a given bin is usually not equal and because record weights are present on one or both data sets, selectively duplicating records from one or both data sets is usually necessary. Research by the staff of the Methodology Branch of Statistics Canada indicates that exactly matched files can be used to analyze and to improve the quality of public use SPSD synthetic matches (Armstrong 1989). Conversion

This technique is being used to adjust for the under-reporting of UI and welfare benefits. Research has suggested that the underreporting of UI and welfare income is probably due to item nonresponse. Selected records are identified as probable item nonrespondents, using a statistical analysis (i.e., logistic regression) to predict the probability of those reporting income from UI or welfare. These records are "converted" from zero receipts to some positive amount of UI or welfare income, and then the appropriate amount of income is imputed (Dufour 1988). Microrecord aggregation

This technique is used to improve the representation of high-income recipients by adding tax return information on incomes by source to the SCF, or host data set, so the patterns of income composition by source at the individual microdata level are retained in the SPSD. The SCF has reporting and sampling biases that result in a lower number of high-income individuals and in a lower level of income per high-income individual than indicated by personal income tax records. These underreporting biases are corrected by synthetically matching specially adapted income tax return data to completely replace the income components on these host SCF records. The technique of microrecord aggregation provides plausible, but unidentifiable, sets of income items from the Green Book.

The Green Book file contains about 25,000 high-income tax records, which are drawn from about 135,000 high-income tax returns. With the process of microrecord aggregation, these records are clustered into sets of at least five similar records. Then weighted average values of income for each source are computed for the individual records in each cluster. Specifically, one record is randomly selected from each cluster and is given a weight of 80 percent; the remaining records in the cluster are averaged with a total weight of 20 percent. These cluster-weighted averages are considered to be nonidentifiable, just as a table of statistics based on at least five observations per cell is considered nonidentifiable; however, these cluster averages also retain many of the essential characteristics of the actual microrecords on which they are based. Microrecord aggregation simply treats these averages as if they were actual microdata records. The microrecord aggregation of these records results in a file of about 5,000 synthetic microdata records of high-income tax returns.

To match these 5,000 synthetic Green Book records to the 300 high-income SCF records, the 300 SCF records are duplicated until there are 5,000 host SCF records. Then each of these SCF high-income records is categorically matched to a similar synthetic, aggregated Green Book high-income record. In this way, detailed information on income composition by source is fully absorbed and retained in the host data set. Stochastic imputation

This technique is used to generate synthetic data values for individual records in one data set by randomly drawing from the distributions or the density functions derived from a second data set. Specifically, it is used to add personal income tax information about various itemized deductions, exemptions, and tax credits that is required for the calculation of income tax liability to the SPSD. In adding this information, one priority is to ensure that the distribution of each of these deductions-including the numbers of tax filers reporting the deduction (or exemption or credit), the average amount claimed, and the univariate size distribution of the amounts claimed-agrees with the published results. This technique is used to assign, for example, a charitable donation that is based on the distribution of itemized donations on income tax returns within a given Province and by age, sex, and income group to each individual in the SCF. The other priority is to maintain the confidentiality of the underlying income tax information.

The source data for stochastic imputation were derived from the Green Book sample, using all 400,000 records. TO join the Green Book income tax data with the host SCF sample, a set of common classifications was defined for the following variables: Province, age, sex, marital status, total income range, employment income range, and number of children claimed for the child care expense deduction. These variables were chosen because of their policy relevance and the feasibility of defining them similarly for both data sets. A model of the personal income tax system (the same one subsequently used for policy analysis) was applied to identify the probable tax filers and to impute marital tax status (Canada does not have joint filing) for the host SCF data set.

Using a complex set of distributional statistics generated from the Green Book file of income tax returns, it is possible to recreate the same distribution of values on the host SCF data set. For each individual record in the host data set, random numbers based on the characteristics for each of the itemized deductions and tax credits are drawn to determine which of the items were claimed. If some of the items were claimed, a synthetic value is drawn from each distribution that represents the tax returns of a similar group of people.

Special Problems and Procedures Categorical matching of unemployment insurance (UI) data

Each of the 30,000 UI claimants' records was categorically matched to the SCF records that had some reported or converted' UI income during the year. The UI claim history variables-which include the type of claim (e.g., regular, retirement, maternity, or fishing) and the amount of UI benefits received-and the administrative data on the claimant's age, Province, and sex were used for constructing the matching categories. After duplication to ensure that there were an equal number of records for the corresponding cells of the UI and for the host data sets, the records were matched, based on their rank order of UI benefits within the cell. The cell match and the duplication increased the number of SCF records representing the UI claimant population from 10,000 to 30,000.

The content of this data set was specially designed. Because the SPSD needed benefit payments on a calendar year, rather than on a claim, basis for consistent analysis and for input to the income tax module, constructing this component of the database required simultaneously the development of a UI simulation module and the identification of a limited set of program relevant UI variables that could serve as input to the UI simulation module. Moreover, this data set had to be rich enough to capture the weekly labor force history relevant to the application of UI program regulations, but it also had to be nonidentifiable. These objectives were accomplished by thinking in terms of an event history; therefore, the durations of various activities, rather than the weekly activity records, became the focus. The staffs of the Department of Employment and Immigration and of the Forget Royal Commission on Unemployment Insurance (Canada 1986) were very helpful in designing this data set. Family Expenditure Survey (FAMEX) data imputations

The match using FAMEX data is principally designed to support the modeling of commodity tax incidence at the household level. The selection and the grouping of FAMEX income and expenditure variables were based on the requirements of the commodity tax model and, thus, on the structure and composition of personal expenditures in the. Canadian medium-level aggregation input-output tables. Expenditures that include some indirect taxes and duties were placed in the corresponding input-output personal expenditure category. Expenditures that did not include an indirect tax or that included an indeterminate indirect tax were placed in a residual category (e.g., real estate commissions).

Additional variables (e.g., income, taxes, and savings) were also matched to complete the basic household accounting identity in which income plus other money receipts equals expenditure plus saving. Completing the household accounting identity allowed various simulation options-for example, the allocation of a change in disposable income between saving and consumption. Although a number of conceptual differences still remain between FAMEX and the system of national accounts on which the input-output tables are based, the SPSD and the national accounts household sector aggregate expenditure estimates for 1984 are reasonably close (see Adler and Wolfson 1988). Suppression of data

Public use versions of the host SCF data set already exist. The data for aH households that are, in whole or in part, already suppressed in any of the public use SCF files have also been suppressed in the SPSD. Household duplication

Duplicates, or clones," of individual SCF records have been created because of the categorical matching of synthetic high-income tax records and of UI claims records. If the record for at least one individual in a household has been duplicated, duplication of the records for all the other members of the household (with a corresponding reduction in the sample weight) is required. This duplication ensures that the records for all the members of the household continue to have the same weight. The Social Policy Simulation Model (SPSM)

The SPSD is the primary input to the SPSM, and a personal computer with a hard disk is the minimal hardware required for the SPSM. The SPSM also requires a set of commodity tax rates, a set of parameters for all modeled tax and transfer programs (e.g., benefit levels, takeup rates, and tax brackets), and a set of parameters to control the flow of execution of the model.

The commodity tax rate parameters are supplied as default values. In addition, a separate, but concordant, input-output model (which includes a complete set of input-output tables) is provided as part of the SPSD/M package. Using this model, users can alter the retail sales tax rates and various "hidden' taxes-such as duties and intermediate-level commodity tax rates-and then they can derive (under alternative shifting assumptions) the equivalent retail sales tax rates. This capability is very important in Canada where the Federal intermediate-level commodity tax generates more revenue than the corporate income tax.

The flexibility of the SPSM makes it possible, in one run,' to do one simulation, to do two simulations-comparing a base case with a variant, or reform, scenario or, in effect, to do four simulations-comparing effective marginal tax rates (change in taxes minus transfers divided by change in income) of a base scenario with those of variant scenarios. (Note that a family's effective marginal tax rate depends not only on which particular source of income is varied but also on which member of the family receives an increment in that source of income. The SPSM can fully analyze such questions.)

The capability to do simulations of effective marginal tax rates partly addresses one major source of uncertainty in the model results-behavioral response to significant changes in tax provisions or transfer programs. According to economic theory, effective marginal tax rates are a major determinant of behavioral responses to changes in policy. Although the SPSM does not attempt to model such responses, it allows the user greater flexibility to display the individuals and households that are most likely to alter their behavioral responses, as these reponses are indicated by significant changes in these rates.

The SPSM provides four kinds of outputs at the individual, at the family, or at the household level. First, the SPSM has its own cross-tabulation facility; alternatively, microdata files containing any outputs or intermediate variables that are used by the model can be written out. Second, these files can be in a compressed format 'results files") for subsequent use in other SPSM runs. Third, the files can be in a form readily usable by the personal computer SAS statistical software package. Fourth, standard format ASCII files can also be an output. This flexibility in output formats allows the convenient use of

other standard packages. For example, a spreadsheet interface that can convert tabular output from the SPSM into a Lotus 1-2-3 or Symphony worksheet is provided.

Additionally, the SPSM software has been designed so that the user can modify it. Most users will want to run the model in the black box" mode in which the range of parameters and simulation capacities is given. However, we fully expect that new policy options that we have not anticipated in the model will inevitably arise. Therefore, we have provided the facilities for modifying or adding routines to the model so that sophisticated users can customize the SPSM for a 'glass box' mode of use.

Conclusions

The SPSD/M continues to be a work in progress. The first commercial release of a 1984 version of the SPSD/M was in December 1988. A 1986 version will be available in the spring of 1989.

The process of developing the SPSDIM has already had some valuable spinoffs. For example, the experience has contributed to a revision in the weighting system for Statistics Canada's monthly labor force survey; this revision is based on similar integrated weighting techniques that are being implemented. In the national accounting context, the SPSD has provided a microfoundation for the household sector (Ruggles and Ruggles 1986 and Adler and Wolfson 1988).

Moreover, the SPSM has already produced results that have been useful for policy analyses, such as the Forget Royal Commission's examination of the unemployment insurance system (Canada 1986), an Ontario special task force's review of social assistance (Ontario 1988), an analysis of the impact of Federal personal income tax reform (Maslove 1988), and the projections of the impact of Canada's aging population on the fiscal structure of the Federal Government (Fellegi 1988).

In developing the SPSD, many methodological refinements have been implemented to adjust for gaps and inaccuracies in the source data. Further improvements are possible and will continue to be made as work continues on the SPSD/M.

References

Adler, H.J., and M.C. Wolfson (1988), A Prototype Ancro-

Macro Link for the Canadian Household Sector,' The

Review of Income and Wealth, Series No. 34 (December

1988). Armstrong, J. (1989), An Evaluation of Statistical Matching

and Imputation Techniques,' Paper presented at

the annual meeting of the Statistical Society of Canada,

Ottawa, May 1989. Canada, Forget Royal Commission on Unemployment Insurance

(1986), Report of the Commission of Inquiry on

Unemployment Insurance, Ottawa: Queen's Printer. Deming, W.E., and F.F. Stephan (1940), 'On a Least

Squares Adjustment of a Sampled Frequency Table

When the Expected Marginal Totals Are Known,' Annals

of Mathematical Statistics 11 (1940): 427-44. Dufour, J. (1988), Quelques Methodes Palhatives au Probleme

de Non-Reportage,' Mimeo, Social Survey Methods

Division, Ottawa: Statistics Canada.

Fellegi, I. (1988), Can We Afford An Aging Society,' Canadian

Economic Observer, Ottawa: Statistics Canada (October 1988). Lemaitre, G., and J. Dufour (1987), An Integrated Method

for Weighting Persons and Families,' Survey Methodology

13 (1987): 199-207. Maslove, A.M. (1988), Distributional Impacts of Personal

Income Tax Reform, 1984 to 1988," Institute for Research

on Public Policy Discussion Paper No. 88.C.1,

Ottawa. Ontario (1988), Report of the Special Inquiry on Social Assistance,

Toronto: Queen's Park. Paass, G. (1986), Statistical Match: Evaluation of Eixisting

Procedures and Improvements By Using Additional

Information," Microanalytic Simulation Models to Support

Social and Financial Policy, Edited by G.H. Orcutt,

J. Merz, and H. Quinke, Amsterdam: Elsevier Science

Publishers. Paass, G. (1989), Stochastic Generation of a Synthetic Sample

from Marginal Information," Journal of Business

and Economic Statistics. Forthcoming. Rodgers, W.L. (1984), An Evaluation of Statistical Match-

ing,' Journal of Business and Economic Statistics 2

(January 1984). Rubin, D.B. (1986), "Statistical Matching Using File Concatenation

With Adjusted Weights and Multiple Impu-

tations,' Journal of Business and Economic Statistics 4

(January 1986). Ruggles, R., and N. Ruggles (1986), "The Integration of Micro

and Macro Data for the Household Sector," The Re-

view of Income and Wealth, Series No. 32 (September

1986). Singh, A.C., J.B. Armstrong, and G.E. Lemaitre (1988),

Log-Linear Imputation and Its Application to File

Merging,' Paper presented at the annual meeting of the American Statistical

Association, New Orleans, August 1988.