The social policy simulation database and model: an example of survey and administrative data integration.
Wolfson, Michael ; Gribble, Stephen ; Bordt, Michael 等
Introduction
WHENEVER governments propose a change in the personal income tax
laws or whenever special task forces or (in Canada) Royal Commissions
propose changes in the structure of major transfer programs, the
technique of microsimulation modeling is typically used to assess the
impact of the proposals. For example, to assess the distributional
impact of a change in income tax exemptions on different types of
families, the Canadian Department of Finance uses a microsimulation
model that recomputes the income tax liabilities for a sample of about
400,000 tax returns for a recent year; for each of these returns, the
model calculates the tax liability under the proposed policy.
Similarly, the Canadian Ministry of Employment and Immigration has a
microsimulation model for the unemployment insurance system that is
based on a sample of their internal administrative data files.
However, in Canada, microsimulation models have been available in
only a few large institutions because of the relatively high cost of
developing and maintaining this analytical capability. These models are
located in the Federal Departments of Finance, of Employment and
Immigration, and of Health and Welfare. Interested groups outside these
Departments (even other Federal Departments and Provincial governments)
generally do not have access to these models; consequently, they have no
way to assess the published estimates of the distributional impacts of
policy proposals, no way to explore the impacts in greater detail, and
no way to develop comparable figures for their own proposals.
This situation is unlike that in the United States where various
independent agencies-such as the Congressional Budget Office,
nongovermental research institutions, and private consulting firms-have
sophisticated microsimulation capabilities. These agencies regularly
provide independent analyses and forecasts of proposed changes in
microeconomic policy.
The first release of the Social Policy Simulation Database and
Model (SPSD/M) from Statistics Canada in the fall of 1988 has changed
this situation. Using the SPSD/M, anyone can perform microsimulation
impact analyses of tax and transfer program changes on a personal
computer. Moreover, the level of sophistication of the SPSM approaches,
or in some cases exceeds, that of the current models used by the Federal
Departments.
The SPSD/M represents a product different from the traditional
products-typically publications with many tables of numbers-of a
national statistical agency. The SPSD/M consists of a specially
designed database integrated with a retrieval and analytical software
package. The database was explicitly tailored to the intended
analytical applications, unlike the more common situation in which the
analysis is constrained by the data already available.
To meet the objective of public accessibility, the designers of the
SPSD/M had to ensure that no individual respondents on the file are
identifiable pursuant to the Statistics Act and that the database and
software package are usable in a range of computing environments,
especially personal computers. This paper describes the construction of
the Social Policy Simulation Database and the uses of the associated
Social Policy Simulation Model.
The Data Sets in the Social Policy
Simulation Database (SPSD)
To provide realistic, albeit synthetic, data on individuals in
household contexts, the SPSD was constructed from four major sources of
microdata. The sources of these data sets are (1) the Survey of
Consumer Finances, (2) personal income tax returns, (3) unemployment
insurance claim histories, and (4) the Family Expenditure Survey. The
Survey of Consumer Finances SCF)
The "host" data set is derived from Statistics
Canada's 1984 Survey of Consumer Finances. This survey is
Statistics Canada's main source of data on the distribution of
income for individuals and families. (Its content is similar to that of
the U.S. Census Bureau's March supplement to the Current
Population Survey.) The data collected from each household consists of
demographic information, such as the family structure of the household,
the labor force status and the previous year's income by source for
each household member who is 15 years of age or older, and specific
characteristics of the dwelling. The 1984 survey collected data from
about 98,000 individuals in approximately 36,000 households. Although
this survey is rich in data on family structure and income sources, it
lacks detailed information on unemployment histories, tax deductions,
and consumer expenditures. Personal income tax returns
This data set is from the 3-percent sample of personal income tax
returns used in Revenue Canada's annual Taxation Statistics (the
Green Book") and by the Department of Finance's personal
income tax model. This sample consists of about 400,000 records. (It
contains information comparable to that published in the U.S.
Statistics of Income.) Unemployment insurance claim histories
Unemployment insurance (UI) is a complex insurance and temporary
income maintenance program. The administra- tive data collected from
the program serves to track the weekly benefits and claim activities or
UI recipients and to establish eligibility and entitlements by
monitoring previous employment patterns and program participation of
repeat, or reentrant, claims. The UI claim histories imputed to the
SPSD were based on a 1-percent sample of administrative records from the
population with some UI claim activity in 1984. The sample consists of
about 30,000 records of individuals and represents about 40,000 claims.
The Family Expenditure Survey (FAMEX)
This survey is Statistics Canada's periodic survey that
provides detailed data on household expenditure patterns. This data set
contains about 10,000 household records. (It is similar to the U.S.
Consumer Expenditure Survey.)
The microdata from these sources are confidential. Until now, the
data from these microdata sets have been disseminated either as separate
public use samples in the cases of the SCF and the FAMEX (in both of
which some records and a fair number of variables are suppressed) or as
summary tables of income tax data from Taxation Statistics and of UI
claim histories. In the SPSD, the data sets from these four sources
have been transformed into a single public use microdata set that
retains the full household hierarchical" structure. At the same
time, the SPSD also maintains the confidentiality of the individual
records since all the constituent microdata sets contain nonidentifiable
records and exact matching is not used to merge these data sets.
Techniques Used to Construct the SPSD
Several key techniques were used to join the four microdata sets in
constructing the SPSD. Controlled blurring
The SPSD is much richer than the already released SCF because it is
a fully hierarchical file: Each individual has a complete family and
household context.
Controlled blurling," or selective randomization, of portions
of the data is used to protect the identities of the individual
microdata records, so the data can be released to the public.
If randomization is suitably structured, it does not adversely
affect the usefulness of the database for the policy simulations for
which it has been designed. Moreover, precise information about
households is not generally required for the anticipated uses of the
SPSD/M. For example, the precise age and sex composition and the
geographic location of a household greatly increases the identifiability
of a microrecord. As a result, the sex of children and the ages of
household members within 5-year age-groups have been randomized (subject
to some constraints to ensure "plausible" results).
Similarly, by randomly reassigning the province and urban-size class
codes, the geographical location of "unusual' household types
(e.g., large size or multifamily) has been blurred. Integrated weighting
This technique is used to reduce bias by forcing agreement between
the sample data and the known control totals. The SCF (host survey)
weights are adjusted to ensure that the population by age, sex, and
Province represented by the sur- vey corresponds to the known"
population by age, sex, and Province from the census. In addition, the
survey weights are adjusted to be consistent with the control totals at
the family level, such as the number of families by size and the labor
force participation status of the adult family members. The procedure
is a generalization of iterative proportional adjustment, or
raking" (see Deming and Stephan 1940 and Lemaitre and Dufour 1987).
Categorical matching This technique is used to merge two data sets. A
variety of methods can be used for the synthetic matching of the
information from a record of the donor data set to any given record of
the host data set; all of these methods are based on determining which
records of both host and donor data sets are the most closely similar
according to the policy relevant criteria common to both data sets
(e.g., dwelling tenure, employment status, and income class). In the
SPSD, the similarity of host and donor records was determined by
dividing records from each data set into very fine categories-hence the
term 'categorical matching." The information from the donor
records may then be attributed synthetically to the records of the host
data set that have the most closely similar characteristics without
increasing the identifiability of the donor or of the host records.
There is a substantial literature on the methods and the experiences of
synthetic matching, or linking, of two files; for example, see Rodgers
1984, Rubin 1986, Paass 1986 and 1988, and Singh, Armstrong, and
Lemaitre 1988. (Exact matches, though sometimes technically feasible,
have been avoided for confidentiality reasons.)
Specifically, categorical matching is used to add FAMEX data, UI
data, and Green Book income data for high-income recipients to the SPSD.
Two data sets are partitioned into identically defined bins" of
records-for example, into province, income range, and tenure. Within
each corresponding pair of donor and host bins, the individual records
are sorted, based on one of the continuous variables common to the two
data sets (e.g., income). According to their rank order, records in a
given bin are then matched one-for-one across the two data sets.
Because the number of records for the two data sets in a given bin is
usually not equal and because record weights are present on one or both
data sets, selectively duplicating records from one or both data sets is
usually necessary. Research by the staff of the Methodology Branch of
Statistics Canada indicates that exactly matched files can be used to
analyze and to improve the quality of public use SPSD synthetic matches
(Armstrong 1989). Conversion
This technique is being used to adjust for the under-reporting of
UI and welfare benefits. Research has suggested that the underreporting
of UI and welfare income is probably due to item nonresponse. Selected
records are identified as probable item nonrespondents, using a
statistical analysis (i.e., logistic regression) to predict the
probability of those reporting income from UI or welfare. These records
are "converted" from zero receipts to some positive amount of
UI or welfare income, and then the appropriate amount of income is
imputed (Dufour 1988). Microrecord aggregation
This technique is used to improve the representation of high-income
recipients by adding tax return information on incomes by source to the
SCF, or host data set, so the patterns of income composition by source
at the individual microdata level are retained in the SPSD. The SCF has
reporting and sampling biases that result in a lower number of
high-income individuals and in a lower level of income per high-income
individual than indicated by personal income tax records. These
underreporting biases are corrected by synthetically matching specially
adapted income tax return data to completely replace the income
components on these host SCF records. The technique of microrecord
aggregation provides plausible, but unidentifiable, sets of income items
from the Green Book.
The Green Book file contains about 25,000 high-income tax records,
which are drawn from about 135,000 high-income tax returns. With the
process of microrecord aggregation, these records are clustered into
sets of at least five similar records. Then weighted average values of
income for each source are computed for the individual records in each
cluster. Specifically, one record is randomly selected from each
cluster and is given a weight of 80 percent; the remaining records in
the cluster are averaged with a total weight of 20 percent. These
cluster-weighted averages are considered to be nonidentifiable, just as
a table of statistics based on at least five observations per cell is
considered nonidentifiable; however, these cluster averages also retain
many of the essential characteristics of the actual microrecords on
which they are based. Microrecord aggregation simply treats these
averages as if they were actual microdata records. The microrecord
aggregation of these records results in a file of about 5,000 synthetic
microdata records of high-income tax returns.
To match these 5,000 synthetic Green Book records to the 300
high-income SCF records, the 300 SCF records are duplicated until there
are 5,000 host SCF records. Then each of these SCF high-income records
is categorically matched to a similar synthetic, aggregated Green Book
high-income record. In this way, detailed information on income
composition by source is fully absorbed and retained in the host data
set. Stochastic imputation
This technique is used to generate synthetic data values for
individual records in one data set by randomly drawing from the
distributions or the density functions derived from a second data set.
Specifically, it is used to add personal income tax information about
various itemized deductions, exemptions, and tax credits that is
required for the calculation of income tax liability to the SPSD. In
adding this information, one priority is to ensure that the distribution
of each of these deductions-including the numbers of tax filers
reporting the deduction (or exemption or credit), the average amount
claimed, and the univariate size distribution of the amounts
claimed-agrees with the published results. This technique is used to
assign, for example, a charitable donation that is based on the
distribution of itemized donations on income tax returns within a given
Province and by age, sex, and income group to each individual in the
SCF. The other priority is to maintain the confidentiality of the
underlying income tax information.
The source data for stochastic imputation were derived from the
Green Book sample, using all 400,000 records. TO join the Green Book
income tax data with the host SCF sample, a set of common
classifications was defined for the following variables: Province, age,
sex, marital status, total income range, employment income range, and
number of children claimed for the child care expense deduction. These
variables were chosen because of their policy relevance and the
feasibility of defining them similarly for both data sets. A model of
the personal income tax system (the same one subsequently used for
policy analysis) was applied to identify the probable tax filers and to
impute marital tax status (Canada does not have joint filing) for the
host SCF data set.
Using a complex set of distributional statistics generated from the
Green Book file of income tax returns, it is possible to recreate the
same distribution of values on the host SCF data set. For each
individual record in the host data set, random numbers based on the
characteristics for each of the itemized deductions and tax credits are
drawn to determine which of the items were claimed. If some of the items
were claimed, a synthetic value is drawn from each distribution that
represents the tax returns of a similar group of people.
Special Problems and Procedures Categorical matching of
unemployment insurance (UI) data
Each of the 30,000 UI claimants' records was categorically
matched to the SCF records that had some reported or converted' UI
income during the year. The UI claim history variables-which include
the type of claim (e.g., regular, retirement, maternity, or fishing) and
the amount of UI benefits received-and the administrative data on the
claimant's age, Province, and sex were used for constructing the
matching categories. After duplication to ensure that there were an
equal number of records for the corresponding cells of the UI and for
the host data sets, the records were matched, based on their rank order
of UI benefits within the cell. The cell match and the duplication
increased the number of SCF records representing the UI claimant population from 10,000 to 30,000.
The content of this data set was specially designed. Because the
SPSD needed benefit payments on a calendar year, rather than on a claim,
basis for consistent analysis and for input to the income tax module,
constructing this component of the database required simultaneously the
development of a UI simulation module and the identification of a
limited set of program relevant UI variables that could serve as input
to the UI simulation module. Moreover, this data set had to be rich
enough to capture the weekly labor force history relevant to the
application of UI program regulations, but it also had to be
nonidentifiable. These objectives were accomplished by thinking in
terms of an event history; therefore, the durations of various
activities, rather than the weekly activity records, became the focus.
The staffs of the Department of Employment and Immigration and of the
Forget Royal Commission on Unemployment Insurance (Canada 1986) were
very helpful in designing this data set. Family Expenditure Survey
(FAMEX) data imputations
The match using FAMEX data is principally designed to support the
modeling of commodity tax incidence at the household level. The
selection and the grouping of FAMEX income and expenditure variables
were based on the requirements of the commodity tax model and, thus, on
the structure and composition of personal expenditures in the. Canadian
medium-level aggregation input-output tables. Expenditures that include
some indirect taxes and duties were placed in the corresponding
input-output personal expenditure category. Expenditures that did not
include an indirect tax or that included an indeterminate indirect tax
were placed in a residual category (e.g., real estate commissions).
Additional variables (e.g., income, taxes, and savings) were also
matched to complete the basic household accounting identity in which
income plus other money receipts equals expenditure plus saving.
Completing the household accounting identity allowed various simulation
options-for example, the allocation of a change in disposable income between saving and consumption. Although a number of conceptual
differences still remain between FAMEX and the system of national
accounts on which the input-output tables are based, the SPSD and the
national accounts household sector aggregate expenditure estimates for
1984 are reasonably close (see Adler and Wolfson 1988). Suppression of
data
Public use versions of the host SCF data set already exist. The
data for aH households that are, in whole or in part, already suppressed
in any of the public use SCF files have also been suppressed in the
SPSD. Household duplication
Duplicates, or clones," of individual SCF records have been
created because of the categorical matching of synthetic high-income tax
records and of UI claims records. If the record for at least one
individual in a household has been duplicated, duplication of the
records for all the other members of the household (with a corresponding
reduction in the sample weight) is required. This duplication ensures
that the records for all the members of the household continue to have
the same weight. The Social Policy Simulation Model (SPSM)
The SPSD is the primary input to the SPSM, and a personal computer
with a hard disk is the minimal hardware required for the SPSM. The
SPSM also requires a set of commodity tax rates, a set of parameters for
all modeled tax and transfer programs (e.g., benefit levels, takeup
rates, and tax brackets), and a set of parameters to control the flow of
execution of the model.
The commodity tax rate parameters are supplied as default values.
In addition, a separate, but concordant, input-output model (which
includes a complete set of input-output tables) is provided as part of
the SPSD/M package. Using this model, users can alter the retail sales
tax rates and various "hidden' taxes-such as duties and
intermediate-level commodity tax rates-and then they can derive (under
alternative shifting assumptions) the equivalent retail sales tax rates.
This capability is very important in Canada where the Federal
intermediate-level commodity tax generates more revenue than the
corporate income tax.
The flexibility of the SPSM makes it possible, in one run,'
to do one simulation, to do two simulations-comparing a base case with a
variant, or reform, scenario or, in effect, to do four
simulations-comparing effective marginal tax rates (change in taxes
minus transfers divided by change in income) of a base scenario with
those of variant scenarios. (Note that a family's effective
marginal tax rate depends not only on which particular source of income
is varied but also on which member of the family receives an increment in that source of income. The SPSM can fully analyze such questions.)
The capability to do simulations of effective marginal tax rates
partly addresses one major source of uncertainty in the model
results-behavioral response to significant changes in tax provisions or
transfer programs. According to economic theory, effective marginal tax
rates are a major determinant of behavioral responses to changes in
policy. Although the SPSM does not attempt to model such responses, it
allows the user greater flexibility to display the individuals and
households that are most likely to alter their behavioral responses, as
these reponses are indicated by significant changes in these rates.
The SPSM provides four kinds of outputs at the individual, at the
family, or at the household level. First, the SPSM has its own
cross-tabulation facility; alternatively, microdata files containing any
outputs or intermediate variables that are used by the model can be
written out. Second, these files can be in a compressed format
'results files") for subsequent use in other SPSM runs.
Third, the files can be in a form readily usable by the personal
computer SAS statistical software package. Fourth, standard format
ASCII files can also be an output. This flexibility in output formats
allows the convenient use of
other standard packages. For example, a spreadsheet interface that
can convert tabular output from the SPSM into a Lotus 1-2-3 or Symphony
worksheet is provided.
Additionally, the SPSM software has been designed so that the user
can modify it. Most users will want to run the model in the black
box" mode in which the range of parameters and simulation
capacities is given. However, we fully expect that new policy options
that we have not anticipated in the model will inevitably arise.
Therefore, we have provided the facilities for modifying or adding
routines to the model so that sophisticated users can customize the SPSM
for a 'glass box' mode of use.
Conclusions
The SPSD/M continues to be a work in progress. The first
commercial release of a 1984 version of the SPSD/M was in December 1988.
A 1986 version will be available in the spring of 1989.
The process of developing the SPSDIM has already had some valuable
spinoffs. For example, the experience has contributed to a revision in
the weighting system for Statistics Canada's monthly labor force
survey; this revision is based on similar integrated weighting
techniques that are being implemented. In the national accounting
context, the SPSD has provided a microfoundation for the household
sector (Ruggles and Ruggles 1986 and Adler and Wolfson 1988).
Moreover, the SPSM has already produced results that have been
useful for policy analyses, such as the Forget Royal Commission's
examination of the unemployment insurance system (Canada 1986), an
Ontario special task force's review of social assistance (Ontario
1988), an analysis of the impact of Federal personal income tax reform
(Maslove 1988), and the projections of the impact of Canada's aging
population on the fiscal structure of the Federal Government (Fellegi
1988).
In developing the SPSD, many methodological refinements have been
implemented to adjust for gaps and inaccuracies in the source data.
Further improvements are possible and will continue to be made as work
continues on the SPSD/M.
References
Adler, H.J., and M.C. Wolfson (1988), A Prototype Ancro-
Macro Link for the Canadian Household Sector,' The
Review of Income and Wealth, Series No. 34 (December
1988). Armstrong, J. (1989), An Evaluation of Statistical
Matching
and Imputation Techniques,' Paper presented at
the annual meeting of the Statistical Society of Canada,
Ottawa, May 1989. Canada, Forget Royal Commission on
Unemployment Insurance
(1986), Report of the Commission of Inquiry on
Unemployment Insurance, Ottawa: Queen's Printer. Deming,
W.E., and F.F. Stephan (1940), 'On a Least
Squares Adjustment of a Sampled Frequency Table
When the Expected Marginal Totals Are Known,' Annals
of Mathematical Statistics 11 (1940): 427-44. Dufour, J.
(1988), Quelques Methodes Palhatives au Probleme
de Non-Reportage,' Mimeo, Social Survey Methods
Division, Ottawa: Statistics Canada.
Fellegi, I. (1988), Can We Afford An Aging Society,' Canadian
Economic Observer, Ottawa: Statistics Canada (October 1988).
Lemaitre, G., and J. Dufour (1987), An Integrated Method
for Weighting Persons and Families,' Survey Methodology
13 (1987): 199-207. Maslove, A.M. (1988), Distributional
Impacts of Personal
Income Tax Reform, 1984 to 1988," Institute for Research
on Public Policy Discussion Paper No. 88.C.1,
Ottawa. Ontario (1988), Report of the Special Inquiry on Social
Assistance,
Toronto: Queen's Park. Paass, G. (1986), Statistical
Match: Evaluation of Eixisting
Procedures and Improvements By Using Additional
Information," Microanalytic Simulation Models to Support
Social and Financial Policy, Edited by G.H. Orcutt,
J. Merz, and H. Quinke, Amsterdam: Elsevier Science
Publishers. Paass, G. (1989), Stochastic Generation of a
Synthetic Sample
from Marginal Information," Journal of Business
and Economic Statistics. Forthcoming. Rodgers, W.L. (1984), An
Evaluation of Statistical Match-
ing,' Journal of Business and Economic Statistics 2
(January 1984). Rubin, D.B. (1986), "Statistical Matching
Using File Concatenation
With Adjusted Weights and Multiple Impu-
tations,' Journal of Business and Economic Statistics 4
(January 1986). Ruggles, R., and N. Ruggles (1986), "The
Integration of Micro
and Macro Data for the Household Sector," The Re-
view of Income and Wealth, Series No. 32 (September
1986). Singh, A.C., J.B. Armstrong, and G.E. Lemaitre (1988),
Log-Linear Imputation and Its Application to File
Merging,' Paper presented at the annual meeting of the
American Statistical
Association, New Orleans, August 1988.