Economic analysis and statistical disclosure limitation.
Abowd, John M. ; Schmutte, Ian M.
ABSTRACT This paper explores the consequences for economic research
of methods used by data publishers to protect the privacy of their
respondents. We review the concept of statistical disclosure limitation
for an audience of economists who may be unfamiliar with these methods.
We characterize what it means for statistical disclosure limitation to
be ignorable. When it is not ignorable, we consider the effects of
statistical disclosure limitation for a variety of research designs
common in applied economic research. Because statistical agencies do not
always report the methods they use to protect confidentiality, we also
characterize settings in which statistical disclosure limitation methods
are discoverable; that is, they can be learned from the released data.
We conclude with advice for researchers, journal editors, and
statistical agencies.
**********
This paper is about the potential effects of statistical disclosure
limitation (SDL) on empirical economic modeling. We study the methods
that public and private providers use before they publish data. Advances
in SDL have unambiguously made more data available than ever before,
while protecting the privacy and confidentiality of identifiable
information on individuals and businesses. But modern SDL intrinsically
distorts the underlying data in ways that are generally not clear to the
researcher and that may compromise economic analyses, depending on the
specific hypotheses under study. In this paper, we describe how SDL
works. We provide tools to evaluate the effects of SDL on economic
modeling, as well as some concrete guidance to researchers, journal
editors, and data providers on assessing and managing SDL in empirical
research.
Some of the complications arising from SDL methods are highlighted
by J. Trent Alexander, Michael Davern, and Betsey Stevenson (2010).
These authors show that the percentage of men and women by age in
public-use microdata samples (PUMS) from Census 2000 and selected
American Community Surveys (ACS) differs dramatically from published
tabulations based on the complete census and the full ACS for
individuals age 65 and older. This result was caused by an acknowledged
misapplication of confidentiality protection procedures at the Census
Bureau. As such, it does not reflect a failure of this specific approach
to SDL. Indeed, it highlights the value to the Census Bureau of making
public-use data available--researchers draw attention to problems in the
data and data processing. Correcting these problems improves future data
publications.
This episode reflects a deeper tension in the relationship between
the federal statistical system and empirical researchers. The Census
Bureau does not release detailed information on the specific SDL methods
and parameters used in the decennial census and ACS public-use data
releases, which include data swapping, coarsening, noise infusion, and
synthetic data. Although the agency originally announced that it would
not release new public-use microdata samples that corrected the errors
discovered by Alexander, Davem, and Stevenson (2010), shortly after that
announcement it did release corrections for all the affected Census 2000
and ACS PUMS files. (1) There is increased concern about the application
of these SDL procedures without some prior input from data analysts
outside the Census Bureau who specialize in the use of these PUMS files.
More broadly, this episode reveals the extent to which modern SDL
procedures are a black box whose effect on empirical analysis is not
well understood.
In this paper, we pry open the black box. First, we characterize
the interaction between modern SDL methods and commonly used econometric
models in more detail than has been done elsewhere. We formalize the
data publication process by modeling the application of SDL to the
underlying confidential data. The data provider collects data from a
frame defining an underlying, finite population, edits these data to
improve their quality, applies SDL, then releases tabular and
(sometimes) microdata public-use files. Scientific analysis is conducted
on the public-use files.
Our model characterizes the consequences for estimation and
inference if the researcher ignores the SDL, treating the published data
as though they were an exact copy of the clean confidential data.
Whether SDL is ignorable or not depends on the properties of the SDL
model and on the analysis of interest. We illustrate ignorable and
nonignorable SDL for a variety of analyses that are common in applied
economics.
A key problem with the approach of most statistical agencies to
modern SDL systems is that they do not publish critical parameters.
Without knowing these parameters, it is not possible to determine
whether the magnitude of nonignorable SDL is substantial. As the
analysis by Alexander, Davern, and Stevenson (2010) suggests, it is
sometimes possible to "discover" the SDL methods or features
based on related estimates from the same source. This ability to infer
the SDL model from the data is useful in settings where limited
information is available. We illustrate this method with a detailed
application in section IV.B.
For many analyses, SDL methods that have been properly applied will
not substantially affect the results of empirical research. The reasons
are straightforward. First, the number of data elements subject to
modification is probably limited, at least relative to more serious data
quality problems such as reporting error, item missingness, and data
edits. Second, the effects of SDL on empirical work will be most severe
when the analysis targets subpopulations where information is most
likely to be sensitive. Third, SDL is a greater concern, as a practical
matter, for inference on model parameters. Even when SDL allows unbiased
or consistent estimators, the variance of those estimators will be
understated in analyses that do not explicitly correct for the
additional uncertainty.
Arthur Kennickell and Julia Lane (2006) explicitly warned
economists about the problems of ignoring statistical disclosure
limitation methods. Like us, they suggested specific tools for assessing
the effects of SDL on the quality of empirical research. Their
application was to the Survey of Consumer Finances, which was the first
American public-use product to use multiple imputation for editing,
missing-data imputation, and SDL (Kennickell 1997). Their analysis was
based on the efforts of statisticians to explicitly model the trade-off
between confidentiality risk and data usefulness (Duncan and Fienberg
1999; Karr and others 2006).
The problem for empirical economics is that statistical agencies
must develop a general-purpose strategy for publishing data for public
consumption. Any such publication strategy inherently advantages certain
analyses over others. Economists need to be aware of how the data
publication technology, including its SDL aspects, might affect their
particular analyses. Furthermore, economists should engage with data
providers to help ensure that new forms of SDL reflect the priorities of
economic research questions and methods. Looking to the future,
statisticians and computer scientists have developed two related ways to
address these issues more systematically: synthetic data combined with
validation servers and privacy-protected query systems. We conclude with
a discussion of how empirical economists can best prepare for this
future.
I. Conceptual Framework and Motivating Examples
In this section we lay out the conceptual framework that underlies
our analysis, including our definitions of ignorable versus nonignorable
SDL. We also offer two motivating examples of SDL use that will be
familiar to social scientists and economists: randomized response for
eliciting sensitive information from survey respondents and the effect
of topcoding in analyzing income quantiles.
I.A. Key Concepts
Our goal is to help researchers understand when the application of
SDL methods affects the analysis. To organize this discussion, we
introduce key concepts that we develop in a formal model in the online
appendix. We assume the analyst is interested in estimating features of
the model that generated the confidential data. However, the analyst
only observes the data after the provider has applied SDL. The SDL is,
therefore, a distinct part of the process that generates the published
data.
We say the SDL is ignorable if the analyst can recover the
estimates of interest and make correct inferences using the published
data without explicitly accounting for SDL--that is, by using exactly
the same model as would be appropriate for the confidential data. In
applied economic research it is common to implicitly assume that the SDL
is ignorable, and our definition is an explicit extension of the related
concept of ignorable missing data.
If the data analyst cannot recover the estimate of interest without
the parameters of the SDL model, the SDL can then be said to be
nonignorable. In this case, the analyst needs to perform an SDL-aware
analysis. However, the analyst can only do so if either (i) the data
provider publishes sufficient details of the SDL model's
application to the confidential data, or (ii) the analyst can recover
the parameters of the SDL model based on prior information and the
published data. In the first case, we call the nonignorable SDL known.
In the second case, we call the nonignorable SDL discoverable.
I.B. Motivating Examples
Consider two examples of SDL familiar to most social scientists.
The first is randomized response, which allows a respondent to answer a
sensitive question truthfully without revealing the answer to the
interviewer. This yields more accurate responses, since respondents are
more likely to answer truthfully, but at the cost of adding noise to the
data. The second example is income topcoding, which is a form of SDL
that protects the privacy of high-income households. This example
highlights the fact that the ignorability of SDL is a function not just
of the SDL method but also of the estimand of interest.
RANDOMIZED RESPONSE Stanley Warner (1965) proposed a survey
technique in which the respondent is presented with one of two questions
that can both be answered either "yes" or "no." The
interviewer does not know the question. The respondent opens an envelope
drawn from a basket of identical envelopes, reads the question silently,
responds "yes" or "no," and then destroys the
question. With a certain probability the question is sensitive (for
example, "Have you ever committed a violent crime?"), and with
a complementary probability the question is innocuous (for example,
"Is your birthday between July 1st and December 31st?").
Again, the interviewer records only the "yes" or
"no" answer and never sees the true question.
If one runs this single-question survey on a sample of 100 people
chosen randomly, the estimated proportion of "yes" answers has
an expected value equal to the probability that the respondent was asked
the sensitive question times the population probability (in our example)
of having committed a violent crime plus the complement of the
probability that the respondent was asked the sensitive question times
one-half. If the sample mean proportion of "yes" answers is 26
percent, then to recover the implied estimate for the population
probability of having committed a violent crime one needs to know the
probability that the sensitive question was asked. The standard error of
the estimated proportion of "yes" answers is 4.4 percent, but
the standard error for the estimated population proportion of having
committed a violent crime is 4.4 percent divided by the probability that
the respondent was asked the sensitive question.
Why is this a form of statistical disclosure limitation? Because no
one other than the respondent knows which question was asked, this
procedure places bounds on the amount of information that anyone,
including the interviewer, can learn about the respondent's answer
to the sensitive question. (See section II.B for a complete discussion.)
This form of SDL is obviously not ignorable. The data analyst does not
care about the 26 percent but wants to estimate the proportion of people
who have committed a violent crime. The data publisher adds the
following documentation about the SDL parameters: Only half the
respondents were asked the sensitive question; the other half were asked
a question for which half the people in the population would answer
"yes." Now the analyst can estimate that the proportion who
committed a violent crime is 2 percent, and its standard error is 8.8
percent. Notice that the SDL affected both the mean and the standard
error of the estimate.
CONSEQUENCES OF TOPCODING FOR QUANTILE ESTIMATION Richard
Burkhauser and others (2012) provide a simple, vivid example of the
consequences of SDL for economic analysis. Because of SDL, changes in
the upper tail of the income distribution are largely hidden from view
in research based on public-use microdata, most often the Current
Population Survey (CPS). Because income is a sensitive data item, and
large incomes can be particularly revealing in combination with other
information, the Census Bureau and the Bureau of Labor Statistics both
censor incomes above a certain threshold in their public-use files. The
topcoding of income protects privacy, but it also limits what can be
done with the data.
Burkhauser and others (2012) report that the income topcode results
in 4.6 percent of observations being censored. Thus, the topcoded data
are perfectly fine for measuring the evolution of the 90-10 quantile
ratio but completely useless for measuring the evolution of incomes
among the top 1 percent of households, as was revealed when Thomas
Piketty and Emmanuel Saez (2003) analyzed uncensored income data based
on Internal Revenue Service (IRS) tax filings. Piketty and Saez (2003)
showed that trends in income inequality look quite different in the
administrative record data than in the CPS. Using restricted-access CPS
data, Burkhauser and others (2012) showed that the difference between
the administrative and survey data was largely due to censoring in the
survey data.
If we could observe all the confidential data, Y, they would have
probability distribution function [p.sub.y] (Y) and cumulative
distribution function [F.sub.Y] (Y). For studying income inequality,
interest centers on the quantiles of [F.sub.Y], defined by the inverse
cumulative distribution function [Q.sub.y]. When drawing inferences
about the quantiles of the income distribution, topcoding is irrelevant
for all quantiles that fall below the top-coding threshold, T. We say
top-coding is ignorable if, for a given quantile point of interest p
[member of] [0, 1], [Q.sub.Z] {p) = [Q.sub.Y] (p), where [Q.sub.z] (p)
is the quantile function of the published data, Z.
This very familiar example highlights several features of ignorable
and nonignorable SDL. First, whether SDL can be ignored depends on both
the properties of the SDL mechanism and the specific estimand of
interest. Second, assessing the effect of SDL requires knowledge of the
mechanism. If the value of the topcode threshold T were not published,
it would not be possible for the researcher to assess whether a specific
quantile of interest could be learned from the published data. The
researcher might learn the topcode by inspecting the published data. In
this case, we say the topcode is a discoverable form of SDL.
The work of Jeff Larrimore and others (2008) also illustrates how,
when armed with information about SDL methods and access to the
confidential data, researchers can improve their analysis with minimal
change to the risk of harmful or unlawful data disclosure. Larrimore and
others (2008) published new data for 24 separate income series for
1976-2006 that contain the mean values of incomes above the topcode
values within cells, disaggregated by race, gender, and employment
status. They show that these cell means can be used with the public-use
CPS microdata to analyze the income distribution in ways that would
otherwise require direct access to the confidential microdata.
In the randomized response example, the SDL model is known as long
as the probability that the sensitive question was asked is disclosed.
Without disclosure of this probability, the researcher is unable to
perform an SDL-aware analysis because it is not discoverable. By
contrast, an undisclosed topcode level may still be discoverable by a
researcher through inspection of the data.
II. The Basics of Statistical Disclosure Limitation
The key principle of confidentiality is that individual information
should only be used for the statistical purposes for which it was
collected. Moreover, that information should not be used in a way that
might harm the individual (Duncan, Jabine, and de Wolf 1993, p. 3). This
principle embodies two distinct ideas. First, individuals have a
property right of privacy covering their personal information. Second,
once such personal data have been shared with a trusted curator,
individuals should be protected against uses that could lead to harm.
These ideas are reflected in the development and implementation of SDL
among data providers. For the United States, the Federal Committee on
Statistical Methodology (Harris-Kojetin and others 2005) has produced a
very thorough summary of the objectives and practices of SDL.
The constant evolution of information technology makes it
challenging to translate the principle of confidentiality into policy
and practice. The statutes that govern how statistical agencies approach
SDL explicitly prohibit any breach of confidentiality. (2) However,
statisticians and computer scientists have formally proven that it is
impossible to publish data without compromising confidentiality, at
least probabilistically. We touch in our conclusion on how public policy
should adapt in light of new ideas about SDL and privacy protection. The
current period of tension also characterizes the broader co-evolution of
science and public policy around SDL, which we briefly review.
II.A. What Does SDL Protect?
SDL may appear to protect against unrealistic, fictitious, or
overblown threats. Reports of data security breaches, in which hackers
abscond with terabytes of sensitive individual information, are
increasingly common, but it has been roughly six decades since the last
reported breach of data privacy within the federal statistical system
(Anderson and Seltzer 2007, for household data; Anderson and Seltzer
2009, for business data). One is hard-pressed to find a report of the
American Community Survey, for example, being "hacked." Yet it
is important to acknowledge that the principle of confidentiality for
statistical agencies arose from very real and deliberate attempts by
other government agencies to use the data collected for statistical
purposes in ways that were directly harmful to specific individuals and
businesses.
Laws to protect data confidentiality arose from the need to
separate the statistical and enforcement activities of the federal
government (Anderson and Seltzer 2007; 2009). These laws were
subsequently weakened and violated in a small but influential number of
cases. For example, the U.S. government obtained access to confidential
decennial census information to help locate German and Japanese
Americans during World Wars I and II, and from the economic census to
assist with war planning. The privacy laws were subsequently
strengthened, in part because businesses were quite reluctant to provide
information to the Census Bureau for fear that it could either be used
for tax or antitrust proceedings or be used by their competitors to
reveal trade secrets. The statistical agencies therefore also have a
pragmatic interest in laws that protect individual and business
information against intrusions by other parts of the federal and state
governments, since these laws directly affect willingness to participate
in censuses and surveys.
The modern proliferation of data and advances in computing
technology have led to new concerns about data privacy. We now
understand that it is possible to identify an individual from a very
small number of demographic attributes. In a much-cited study, Latanya
Sweeney (2000) shows how then publicly available hospital records might
be linked to survey data to compromise confidentiality. Arvind Narayanan
and Vitaly Shmatikov (2008) show that supposedly anonymous user data
published by Netflix can be re-identified. Although no harm was
documented in these cases, they highlight the potential for harm in the
world of big data.
Paul Ohm (2010) argues that for every individual there may be a
"database of ruin" that can be constructed by linking together
existing nonruinous data. That is, there may be one database with some
embarrassing or damaging information, and another database with
personally identifiable information to which it may be linked, perhaps
through a sequence of intermediate databases. In some cases, there are
clear financial incentives to seek out such a database of ruin. A
potential employer or insurer may have an interest in learning health
information that a prospective employee would rather not disclose. If
such information could be easily and cheaply gleaned by combining
publicly available data, economic intuition suggests that firms might do
so, despite the absence of documented instances of such behavior. An
alternative perspective is offered by Jane Yakowitz (2011), who argues
for legal reforms that reduce the emphasis on hypothetical threats to
privacy and expand the emphasis on the benefits from providing accurate,
timely socioeconomic data.
II.B. Concepts and Methods of SDL
Modem SDL methods are designed to allow high-quality statistical
information to be published while protecting confidentiality. Since many
applied researchers may have an incomplete awareness of and knowledge
about the ways in which SDL distorts published data, we provide an
overview of the most common SDL methods applied to economic and
demographic data. For a more technical and detailed treatment, we refer
the reader to two recent works on SDL and formal privacy models:
Statistical Confidentiality: Principles and Practice by George Duncan,
Mark Elliot, and Juan-Jose Salazar-Gonzalez (2011), and "The
Algorithmic Foundations of Differential Privacy" by Cynthia Dwork
and Aaron Roth (2014).
A TAXONOMY OF THREATS TO CONFIDENTIALITY Confidentiality may be
violated in many related ways. An identity disclosure occurs if the
identity of a specific individual is completely revealed in the data.
This can occur because a unique identifier is released or because the
information released about a respondent is enough to uniquely identify
him or her in the data. An attribute disclosure occurs when it is
possible to deduce from the published data a specific confidential
attribute of a given respondent.
Modern SDL and formal privacy systems treat disclosure risk
probabilistically. From this perspective, the problem is not merely that
published data might perfectly identify a respondent or his or her
attributes. Rather, it is that the published data might allow a user to
infer a respondent's identity or attributes with high probability.
This concept, known as inferential disclosure, was introduced by Tore
Dalenius (1977) and formalized by Duncan and Diane Lambert (1986) in
statistics, and by Shaft Goldwasser and Silvio Micali (1982) in computer
science.
Suppose the published data are denoted Z. A confidential variable
y, is associated with a specific respondent i. The prior beliefs of a
user about the value of [y.sub.i] are represented by a probability
distribution, p([y.sub.i]) f that reflects information from all other
sources. Then p([y.sub.i]|Z) represents the updated--posterior--beliefs
of the user about the value of [y.sub.i] after the data Z are published.
An inferential disclosure has occurred if the posterior beliefs are too
large relative to prior beliefs.
Our example of randomized response from section I.B provides
intuition about inferential disclosure. The probability that the
respondent will answer "yes" given that the truth is
"yes" is 75 percent. The probability that the respondent will
answer "yes" given that the truth is "no" is 25
percent. These two probabilities are entirely determined by the
probability that the respondent was asked the sensitive question and the
probability that the answer to the innocuous question is
"yes." They do not depend on the unknown population
probability of having committed a violent crime. The ratio of these two
probabilities is the Bayes factor--the ratio of the posterior odds that
the truth is "yes" versus "no" given the survey
answer "yes" to the prior odds of "yes" versus
"no." The interviewer learns from a "yes" answer
that the respondent is three times as likely as a random person to have
committed a violent crime, and that is all the interviewer learns. Had
the violent crime question been asked directly, the interviewer could
have updated his posterior beliefs by a much larger factor--potentially
infinite if the respondent answers truthfully.
Moving forward, it is important to keep the concept of inferential
disclosure in mind for two reasons. First, it leads to a key intuition:
It is impossible to publish useful data without incurring some threat to
confidentiality. A privacy protection scheme that provably eliminates
all inferential disclosures is equivalent to a full encryption of the
confidential data and therefore useless for analysis. (3) Second, to be
effective against inferential disclosure, certain SDL methods require
that statistical agencies also conceal the details of their
implementation. For example, with swapping, knowledge of the swap rate
would increase inferential disclosure risk by improving the user's
knowledge of the full data publication process. We will argue later that
researchers, and agencies, should prefer SDL methods whose details can
be made publicly available.
II.C. SDL Methods for Microdata
SUPPRESSION Suppression is one of the most common forms of SDL.
Suppression can be used to eliminate an entire record from the data or
to eliminate an entire attribute. Record-level suppression is ignorable
under the same assumptions that lead to ignorable missing data models in
general. However, if the suppression rule is based on data items deemed
to be sensitive, then it is very unlikely that the data were suppressed
at random. In that case, knowledge of the suppression rule along with
auxiliary information from the underlying microdata is extremely useful
in assessing the effect of suppression on any specific application.
Sometimes suppression is combined with imputation; this occurs when
sensitive information is suppressed and then replaced with an imputed
value.
AGGREGATION Aggregation refers to the coarsening of values a
variable can take, or the combination of information from multiple
variables. The canonical example is the Census Bureau's practice of
aggregating geographic units into Public-Use Microdata Areas (PUMAs).
Likewise, data on occupation are often reported in broad aggregates. The
aggregation levels are deliberately set in such a way that the number of
individuals represented in the data have some combination of attributes
that exceeds a certain threshold. Aggregation is what prevents a user
from, say, looking up the income of a 42-year-old economist living in
Washington, D.C. Other forms of aggregation are quite familiar to
empirical researchers, such as topcoding income, and reporting income in
bins rather than in levels. These methods are well understood by
researchers, and their effects on empirical work have been carefully
studied. In many cases, it is easy to determine whether aggregation is a
problem for a particular research application; in such cases, one
possible solution is to obtain access to the confidential, disaggregated
data.
NOISE INFUSION Noise infusion is a method in which the underlying
microdata are distorted using either additive or multiplicative noise.
The infusion of noise is not generally ignorable. If applied correctly,
noise infusion can preserve conditional and unconditional means and
covariances, but it always inflates variances and leads to attenuation
bias in estimated regression coefficients and correlations among the
attributes (Duncan, Elliot, and Salazar-Gonzalez 2011, p. 113). To
assess the effects for any particular application, researchers need to
know which variables have been infused with noise along with information
about any relevant parameters governing the distribution of noise. If
such information is not published, it may be possible to infer the noise
distribution from the public-use data if there are multiple releases of
information based on the same underlying frame. We illustrate this
possibility in our analysis of the public-use Quarterly Workforce
Indicators (QWI), Quarterly Census of Employment and Wages (QCEW), and
County Business Patterns (CBP) data in section IV.B.
DATA SWAPPING Data swapping is the practice of switching the values
of a selected set of attributes for one data record with the values
reported in another record. The goal is to protect the confidentiality
of sensitive values while maintaining the validity of the data for
specific analyses. To implement swapping, the agency develops an index
based on the probability that an individual record can be re-identified.
(4) Sensitive records are compared to "nearby" records on the
basis of a few variables. If there is a match, the values of some or all
of the other variables are swapped. Usually, the geographic identifiers
are swapped, thus effectively relocating the records in each
other's location.
For example, in Athens, Georgia, there may be only one male
household head with 10 children. If that man participates in the ACS and
reports his income, it would be possible for anyone to learn his income
by simply reading the unswapped ACS. To protect confidentiality, the
entire data record can be swapped with the record of another household
in a different geographic area with a similar income.
Swapping preserves the marginal distribution of the variables used
to match the records at the cost of all joint and conditional
distributions involving the swapped variables. The computer science
community has frequently criticized this approach to confidentiality
protection because it does not meet the "cryptography"
standard: an encryption algorithm is provably secure when all details
and parameters, except the encryption key, can be made public without
compromising the algorithm. SDL algorithms like swapping are not
provably effective when too many of their parameters are public. That is
why the agencies do not publish them or release more than a few details
of their swapping procedures.
The lack of published details is what makes input data swapping so
insidious for empirical research. Matching variables, the definition of
"nearby," and the rate at which sensitive and nonsensitive
records are swapped can all affect the data analyses that use those
variables, so parameter confidentiality makes it difficult to analyze
the effects of swapping. Furthermore, even restricted-access
arrangements that permit use of the confidential data may still require
the use of the swapped version, even if other SDL modifications of the
data have been removed. Some providers even destroy the unswapped data.
SYNTHETIC MICRODATA Synthetic microdata involve the publication of
a data set with the same structure as the confidential data, in which
the published data are drawn from the same data-generating process as
the confidential data but some or all of the confidential data have been
suppressed and imputed. The confidential data, Y, are generated by a
model, p([??]|[theta]), parameterized by [theta]. The synthetic
microdata are drawn from p([??]|Y), the posterior predictive
distribution for the data process given the observed data, which has
been estimated by the statistical agency.
When originally proposed by Roderick Little (1993) and Donald Rubin
(1993), synthetic data methods mimicked procedures that already existed
for missing-data problems. Synthetic data methods impose an explicit
cost on the researcher--imputed data replacing actual data--in exchange
for an explicit benefit, namely the correct estimation and inference
procedures that are available for the synthetic data. The Little-Rubin
forms of synthetic data analysis are guaranteed to be SDL-aware. If the
researcher's hypothesis is among those for which correct inference
procedures are available, then the synthetic data are provably
analytically valid. John Abowd and Simon Woodcock (2001), Trivellore
Raghunathan, Jerome Reiter, and Rubin (2003), and Reiter (2004) have
refined the Little-Rubin methods, allowing them to be applied to complex
survey data and combined with other missing data imputations. They have
also shown that the class of hypotheses with provable analytical
validity is limited by the models used to estimate p([??]|Y).
Synthetic data can only be used by themselves for certain types of
research questions--those for which they are analytically valid. This
set of hypotheses depends on the model used to generate the synthetic
data. For example, if the confidential data are 10 discrete variables
and the synthetic data are generated from a model that includes all
possible interactions of two of these variables, then any research
question involving only two variables can be analyzed in a correct,
SDL-aware manner from the synthetic data. The analyst does not need
access to the confidential data. But no model involving three or more
variables can be analyzed correctly from the synthetic data. Such models
require that the analyst have access to the confidential data. When the
model used to produce the synthetic data is publicly available,
researchers can assess whether a given synthetic data set is appropriate
for a specific question.
Synthetic data can also be used as a framework for the development
of models, code, and hypotheses. For example, researchers can sometimes
develop models using the synthetic data, which are public, and then run
those models on the confidential data. These applications form part of a
feedback loop in which external researchers help provide improvements to
the synthetic data model. We discuss synthetic data and the feedback
loop in more detail in section VI.A.
FORMAL PRIVACY MODELS Formal privacy models emerged from database
security and cryptography. The idea is to model the publication of data
by the statistical agency using a randomized mechanism that answers
statistical questions after adding noise to the properly computed answer
in the confidential data. This is known in SDL as output distortion.
Breaches of privacy are modeled as a game between users, who try to make
inferential disclosures from the published data, and the statistical
agency, which tries to limit these disclosures.
Dwork (2006) and Dwork and others (2006) formalized the privacy
protection associated with output-distortion SDL in a model called
e-differential privacy. For economists, Ori Heffetz and Katrina Ligett
(2014) provide a very accessible introduction. Dwork and Roth (2014), in
section 3, use our running example of randomized response to
characterize [epsilon]-differential privacy. In [epsilon]-differential
privacy, the SDL must put an upper bound, [epsilon], on the Bayes
factor. In our example, [epsilon] = In (Bayes factor bound) = ln 3 =
1.1. Bounding the Bayes factor implies that the maximum amount the
interviewer can learn from a "yes" answer is that the
respondent (in our original example) is three times as likely as a
random person in the population to have committed a violent crime.
With formal privacy-protected data publication systems, there are
provable limits to the amount of privacy loss that can be experienced in
the population even under worst-case outcomes. These systems also have
provable accuracy for a specific set of hypotheses. From a researcher
perspective, then, formal privacy systems and synthetic data are very
similar--only some hypotheses can be studied accurately, and these are
determined by the statistical queries answered in the formal privacy
model. For example, in a case where the confidential data are, once
again, 10 discrete variables, and the formal privacy system publishes a
protected version of every two-way marginal table, then, once again, any
hypothesis involving only two variables can be studied correctly.
Likewise, no hypotheses involving three or more variables can be studied
correctly without additional privacy-protected publications. Whether
these computations can be safely performed by the formal privacy system
depends on whether any privacy budget remains. If the privacy budget has
been exhausted by publishing all two-way tables, then no further
analysis of the confidential data is permitted.
Synthetic data and formal privacy methods are converging. In the
SDL literature, researchers now analyze the confidentiality protection
provided by the synthetic data (Kinney and others 2011; Benedetto and
Stinson 2015; Machanavajjhala and others 2008). In the formal privacy
literature, analysts may choose to publish the privacy-protected output
as synthetic data--that is, in a format that allows an analyst to use
the protected data as if they were the confidential data (Hardt, Ligett,
and McSherry 2012). The analysis of synthetic data produced by a formal
privacy system is not automatically SDL-aware. The researcher has to use
the published features of the privacy model to correct the estimation
and the inference.
II.D. SDL Methods for Tabular Data
Tabular data present confidentiality risks when the number of
entities contributing to a particular cell in a table is small or the
influence of a few of the entities on the value of the cell is large,
such as for magnitudes like total payroll. A sensitive cell is one for
which some function of the cell's microdata falls above or below a
threshold set by an agency-specific rule. The two most common methods
for handling sensitive cells are forms of randomized rounding, which
distorts the cell value and may distort other cells as well, and the
more common method of suppression. An alternative to suppression is to
build tables after adding noise to the input microdata.
SUPPRESSION Suppression deletes the values for sensitive cells from
the published data. From the outset, it was understood that primary
suppression--not publishing easily identified data items--does not
protect anything if an agency publishes the rest of the data, including
summary statistics (Fellegi 1972). In such a case, users could infer the
missing items from what was published. Agencies that rely on suppression
for tabular data make complementary suppressions to reduce the
probability that a user can infer the sensitive items from the published
data.
Suppressions introduce a missing-data problem for researchers.
Whether that missing-data problem is ignorable or not depends on the
nature of the model being analyzed and the manner in which suppression
is done. An analysis using geographical variation for identification
will benefit from using data where industrial classifications were used
for the complementary suppressions, whereas an analysis that uses
industrial variation will benefit from using data where the
complementary suppressions were made using geographical classifications.
Ultimately, the preferences of the agency that chooses the complementary
suppression strategy will determine which analyses have higher data
quality. As with swap rates, agencies rarely publish details of their
methods for choosing complementary suppressions.
INPUT DISTORTION Input distortion of the microdata is another
method for protecting tabular data. Using this method, an agency
distorts the value of some or all of the inputs before any publication
tables are built, and then computes all, or almost all, of the cells
using only the distorted data.
II.E. Current Practices in the US. Statistical System
The SDL methods in the decentralized U.S. statistical system are
varied. The most thorough analysis of this topic is the one published by
the Federal Committee on Statistical Methodology (FCSM), which is
organized by the chief statistician of the United States in the Office
of Management and Budget (Harris-Kojetin and others 2005). We summarize
the key features of the FCSM report and, where possible, provide updated
information on certain data products used extensively by economists. It
is incumbent upon the researcher to read the relevant documentation and,
if necessary, contact the data provider to obtain nonconfidential
publications detailing how the data were collected and prepared for
publication, including which methods of SDL were applied.
The goal of the FCSM report is to characterize best practices for
SDL, and it contains a table presenting the methods employed by each
agency to protect microdata and tabular data (Harris-Kojetin and others
2005, p. 53). As of 2005, the table shows, almost all federal agencies
that published microdata reported using some form of nonignorable,
undiscoverable data perturbation. The Census Bureau's stated policy
is "for small populations or rare characteristics, noise may be
added to identifying variables, data may be swapped, or an imputation
applied to the characteristic" (Harris-Kojetin and others 2005, p.
40). Many other agencies, including the Bureau of Labor Statistics (BLS)
and National Science Foundation (NSF), contract with the Census Bureau
to conduct surveys and therefore use the same or similar guidelines for
SDL. The National Center for Education Statistics (NCES) also reports
using ad hoc perturbation of the microdata to prevent matching,
including swapping and "suppress and impute" for sensitive
data items.
In a recent technical report by Amy Lauger, Billy Wisniewski, and
Laura McKenna (2014), the Census Bureau released up-to-date information
on its SDL methods. In addition to information about discoverable SDL
methods, like geographic thresholds and topcoding, the report describes
in more detail how noise is added to microdata to protect
confidentiality. Specifically, it states that "noise is added to
the age variable for persons in households with 10 or more people,"
and that "noise is also added to a few other variables to protect
small but well-defined populations but we do not disclose those
procedures" (Lauger, Wisniewski, and McKenna 2014, p. 2).
This Census Bureau report also confirms that swapping is the
primary SDL method used in the ACS and decennial censuses. The swapping
method targets records that have high disclosure risk due to some
combination of rare attributes, such as racial isolation in a particular
location. The records at risk are matched on the basis of an unnamed set
of variables and swapped into a different geography. In the past few
years, the Census Bureau has changed the set of items it uses to
determine whether a record is at risk and should be swapped, and the
swap rate has increased slightly. The Census Bureau performed an
evaluation of the effects of swapping on the quality of published
tabular statistics, but it has not published its evaluation results due
to concerns that they might compromise the SDL procedures themselves.
One Census Bureau official whom we interviewed said the rate of
swapping is low relative to the rate at which data are edited for other
purposes. Furthermore, the official said, swapping is applied to cases
that are extreme outliers on some particular combination of variables.
Without getting more precise, the official conveyed that swapping, while
potentially of considerable concern, may have substantially less effect
on economic research than, say, missing-data imputation.
Within the last 10 years the Census Bureau has also begun producing
data based on more modern SDL methods. The Quarterly Workforce
Indicators are protected using an input noise infusion method that,
among other features, eliminates the need for cell suppression in count
tables. The Census Bureau also offers synthetic microdata from the
linked SIPP/ SSA/IRS data, the Longitudinal Business Database, and the
Longitudinal Employer-Household Dynamics (LEHD) Origin-Destination
Employment Statistics (LODES). (5)
III. How SDL Affects Common Research Designs
In this section, we demonstrate how to apply the concepts of
ignorable and nonignorable SDL in common applied settings. In most
cases, SDL is nonignorable, and researchers therefore need to know some
properties of the SDL model that was applied to their data. When the SDL
model is not known, it may still be discoverable in the manner
introduced in section I.A.
III.A. Estimating Population Proportions with Noise Infusion
This example is motivated by the SDL procedure that is used to mask
ages in the Census 2000, ACS, and CPS microdata files. Although the
misapplication of the procedure has been corrected for Census 2000 and
ACS, current versions of the CPS for the mid-2000s may still be affected
by the error, and have not been reissued. See the online appendix,
section B, for more details.
Suppose the confidential data contain a binary variable (such as
gender) and a multicategory discrete variable (such as age). We are
interested in estimation and inference for the age-specific gender
distribution, where p, the conditional probability of being male given
age, is the parameter of interest. When age has been subjected to SDL,
using published age to compute these conditional probabilities will lead
to problems. The estimated probability of being male conditional on age
is affected by the SDL. even though the gender variable was not itself
altered by the SDL.
Using the generalized randomized response structure, suppose that
we know the probability that the published age data are unaltered. With
probability p, the observed male/female value comes from the true age
category. With the complementary probability, the observed outcome is a
binary random variable with expected value [mu] [not equal to] [beta].
For example, p might be the average value of the proportion male for all
age categories at risk to be changed by the SDL model. In any case, [mu]
is unknown.
Equation B.16 in the online appendix shows that if we ignore the
SDL, the conditional probability estimator and its variance are biased.
An SDL-aware estimator for the conditional probability of being male for
a given age is [??] = [[[bar.z].sub.1] - (1 - [rho]) [mu]]/[rho], where
[[bar.z].sub.1], is the estimated sample proportion of males of the
chosen age. The estimator for the conditional proportion of interest
[??] is confounded by the two SDL parameters, except in the special case
that [rho] = 1, which implies that no SDL was applied to the published
age data. If all of the observations have been subjected to SDL, then
[??] is undefined, and the expected value of [[bar.z].sub.1] is just
[mu]. In the starkest possible terms, the estimator in equation B. 16 is
hopelessly underidentified in the absence of information about [rho] and
[mu].
If [rho] and [mu] are not known, they may still be discoverable if
the analyst has access to estimates of conditional probabilities like
[beta] from an alternative source. See the online appendix, section B,
for more details of the application to the Census 2000 and ACS PUMS that
generalizes the analysis in Alexander, Davern, and Stevenson (2010).
This procedure can be used to discover the SDL in any data set, for
example the CPS, for which alternative reliable published estimates of
the gender-specific age distribution are available.
The SDL process is still underidentified if we consider only a
single outcome like the gender-age distribution, but there are quite a
few other binary outcomes that could also be studied, conditional on
age--for example, marital status, race, and ethnicity. The differences
between Census 2000 estimates of the proportion married at age 65 and
older and their comparable Census 2000 PUMS estimates have exactly the
same functional form as online appendix equation B.17 with exactly the
same SDL parameters. Since these proportions condition on the same age
variable, all the other outcomes that also have an official Census 2000
or ACS published proportion can be used to estimate the unknown SDL
parameters. The identifying assumptions are (i) that all proportions are
conditioned on the same noisy age variable, and (ii) that the noisy age
variable can be reasonably modeled as randomized-response noise. We
implement a similar method in section IV.B.
III.B. Estimating Regression Models
We next consider the effect of SDL on linear regression models.
First, we analyze SDL applied to the dependent variable, assuming that
the agency replaces sensitive values with model-based imputed values.
This form of SDL is nonignorable for parameter estimation and inference.
Parameter estimates will be attenuated and standard errors will be
underestimated. Furthermore, this form of SDL is not discoverable,
except when there are two data releases from the same frame that use
different, independent SDL processes.
Our analysis draws on the work of Barry Hirsch and Edward
Schumacher (2004) and Christopher Bollinger and Hirsch (2006), who study
the closely related problem of bias from missing-data imputation in the
CPS. Respondents to the CPS commonly fail to provide answers to certain
questions. In the published data, the missing values are imputed
semi-parametrically, conditional on a set of variables. Hirsch and
Schumacher (2004) observe that if union status is not in the
conditioning set for the imputation model, the union wage gap will be
underestimated when using imputed and nonimputed values in a regression
of log wages on union status. This bias is exacerbated by using
additional controls. The result occurs because if union status is not in
the imputation model's conditioning set, then some union workers
are imputed nonunion wages, and some nonunion workers are imputed union
wages. Bollinger and Hirsch (2006) show that these results hold very
generally.
There are two key differences in our approach. First, assessing
bias from missing-data imputation is feasible because the published data
include an indicator variable that flags which values were reported and
which were imputed. With SDL, the affected records and variables are not
flagged. Second, in the SDL application, the published data can be
imputed using the distribution of the confidential data. This means that
the agency does not have to use an ignorable missing-data model when
doing imputations for SDL. When imputing actual missing data, which was
the subject of the Bollinger and Hirsch (2006) paper, the agency does
assume that the missing data were generated by an ignorable inclusion
model. The direct consequence is that the model used to impute the
suppressed values can be conditioned on all of the confidential data,
including the rule that determines whether an item will be suppressed.
More succinctly, the analysis below demonstrates the effect of using an
imputation model (or swapping rule) that does not contain a regressor of
interest, and thus is not conflated with any bias that could arise from
nonrandomness of the suppression rule.
SDL APPLIED TO THE DEPENDENT VARIABLE The model of interest is the
function E[[y.sub.i1]|[y.sub.i2]] = [alpha] + [y.sub.i2] [beta]. In the
published data, sensitive values of the outcome variable [y.sub.i1] are
suppressed and imputed. The variable [[gamma].sub.i] indicates whether
[y.sub.i1] is suppressed and imputed. When [[gamma].sub.i] = 1, the
confidential data are published without modification. When
[[gamma].sub.i] = 0, the value for [y.sub.i1] is replaced with an
imputed value, [z.sub.i1], which is drawn from [MATHEMATICAL EXPRESSION
NOT REPRODUCIBLE IN ASCII], the conditional distribution of the outcome
variable given [x.sub.i] among suppressed observations. The conditioning
information used in the imputation model, [x.sub.i] = [f.sub.I]
([y.sub.i2]), is a function [f.sub.I] that maps all of the available
conditioning information in [y.sub.i2] into a vector of control
variables [x.sub.i].
The simplest example is a model in which [x.sub.i] consists of a
strict subset of variables in [y.sub.i2]. For example, in Hirsch and
Schumacher (2004), [y.sub.i2] is a set of conditioning variables that
includes an indicator for union membership, and [x.sub.i] is the same
set of conditioning variables but excluding the union membership
indicator. Like the suppression model, the features of the imputation
model, including the function [f.sub.I], are known only to the agency
and not to the analyst.
The released data are [z.sub.i1] = [y.sub.i1] if [[gamma].sub.i] =
1 and [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] otherwise. For
the other variables, [z.sub.2i] = [y.sub.2i]. The marginal probability
that the exact confidential data are published is Pr [[[gamma].sub.i] =
1] = [rho]. So the suppression rate is (1 - [rho]), an exact analogue of
the rate at which irrelevant data replace good data in randomized
response. Finally, note that nothing in this specification requires
independence between the decision to suppress, [[gamma].sub.i], and the
data values, [y.sub.i1] and [y.sub.i2].
The effects of statistical disclosure limitation in this context
are generically nonignorable except for two unusual cases. If no
observations are suppressed ([rho] = 1), then the SDL is ignorable
because it is irrelevant. In the more interesting case, the
characteristics, [x.sub.i], perfectly predict [z.sub.2i], and the SDL
model is also ignorable for consistent estimation of [beta]. This case
is interesting because it occurs when the agency conditions on all
covariates of interest, [y.sub.2i], when imputing y,., and then releases
[y.sub.2i] without any additional SDL. Even in this latter case, while
the SDL is ignorable for consistent estimation of [beta], it is not
ignorable for inference. The SDL model introduces variance that is not
included in the standard estimator for the variance of [??].
The effects of SDL on estimation and inference could be assessed
and corrected if the analyst knew two key properties of the SDL model:
(i) the suppression rate, (1 - [rho]) = Pr [[gamma].sub.i] = 0]; and
(ii) the set of characteristics used to impute the suppressed
observations, [x.sub.i]. At present, almost nothing is known in the
research community about either characteristic of the SDL models used in
many data sets. See online appendix, section C.1, for details.
SDL APPLIED TO A SINGLE REGRESSOR If SDL is applied to a single
regressor rather than to the dependent variable, the conclusions of the
analysis remain the same, as long as the imputation model does not
perfectly predict the omitted regressor. Curiously, if the regression
model only has a single regressor and the conditioning information is
the same, the bias from SDL is identical whether the SDL is applied to
the regressor or to the dependent variable. If there are multiple
regressors, with SDL applied to a single regressor, the SDL introduces
bias in all regressors. The model setup and nature of the bias are
derived explicitly in the online appendix, section C.2.
III.C. Estimating Regression Discontinuity Models
Regression discontinuity (RD) and regression kink (RK) models can
be seriously compromised when SDL has been applied to the running
variable. To illustrate some of these issues, we consider a design from
Guido Imbens and Thomas Lemieux (2008). This analysis is intended to
guide economists, who can perform our simplified SDL-aware analysis as
part of the specification testing for a general RD.
MODEL SETUP Modeling the unobservable latent outcomes is intrinsic
to the RD analysis. We incorporate the usual counterfactual data process
inherent in the RD design directly into the data model. As Imbens and
Lemieux (2008) note, this is a Rubin Causal Model (Rubin 1974; Holland
1986; Imbens and Rubin 2015). The simplest data model, corresponding to
Imbens and Lemieux (2008, pp. 616-19), has three continuous variables
and one discrete variable whose conditional distribution is degenerate
in the RD design and nondegenerate in the fuzzy RD (FRD) design. The
latent data process consists of four variables with the following
definitions: [w.sub.i] (0) = untreated outcome, [w.sub.i] (1) = treated
outcome, [t.sub.i] = treatment indicator, and [r.sub.i] = RD running
variable. The confidential data vector has the experimental design
structure, Y = ([w.sup.*.sub.i], [t.sub.i], [r.sub.i]) where
[w.sup.*.sub.i] = [w.sub.i] ([t.sub.i]).
Our interest centers on the conditional expectations in the
population data model E[[w.sub.i] (0)|[r.sub.i]] = [f.sub.1] ([r.sub.i])
and E[[w.sub.i] (1)|[r.sub.i]] = [f.sub.2] ([r.sub.i]), where [f.sub.1]
([r.sub.i]) and [f.sub.2] ([r.sub.i]) are continuous functions of the
running variable, [r.sub.i]. The parameter of interest is the average
treatment effect at [tau]:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
NONIGNORABLE SDL IN THE RUNNING VARIABLE We focus on the setting
where SDL is only applied to the RD running variable and its associated
indicator. The published data vector is Z = ([w.sup.*.sub,i], [t.sub.i],
[z.sub.i]). The published running variable is sampled from a
distribution that depends on the true value: [z.sub.i] ~ [p.sub.z\R]
([z.sub.i] | [r.sub.i]). We assume the distribution [p.sub.z\R]
([z.sub.i]|[r.sub.i]) is the randomized response mixture model, a
generalization of simple randomized response described in the online
appendix, section D.1. The SDL process depends on two parameters: [rho],
the probability that the confidential value of the running variable is
released without added noise, and [delta], the standard deviation of a
mean zero noise term added to the running variable when subjected to
SDL.
If the agency publishes its SDL values [rho] = [[rho].sub.0] and
[delta] = [[delta].sub.0] and the true RD is strict, then the analyst
can correct the strict RD estimator directly using
(1) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
Clearly, this implies that the uncorrected estimate is attenuated
toward zero. Intuitively, the introduction of noise into the running
variable converts the strict RD to a fuzzy RD, with
E[[t.sub.i]|[z.sub.i], [[rho].sub.0], [[delta].sub.0]]] playing the role
of the "compliance status" function. For details, see the
online appendix, section D.2.
When the true RD is strict, the SDL is discoverable from the
compliance function even if the agency has not released the SDL
parameters. The researcher can use the fact that the compliance function
g([z.sub.i]) = [rho]1 [[z.sub.i] [greater than or equal to] [tau]] + (1
- [rho]) [PHI] ([[z.sub.i] - [tau]/[delta]). The fuzzy RD estimator is
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
When the noise addition is independent of the outcome variables (as
is the case here), the change in the probability of treatment at the
discontinuity point, [tau], is equal to the share of undistorted
observations, [[rho].sub.0]. When [rho] = 1, there has been no SDL, and
both estimators yield the conventional sharp RD estimate. A similar
analysis shows that a sharp RK design becomes a fuzzy RK design (Card
and others 2012) in the presence of SDL. As in the case of linear
regression, it is still necessary to model the extra variability from
the SDL to get correct estimates of the variance of the estimated RD
parameter.
IMPLICATIONS OF SDL IN THE RUNNING VARIABLE FOR FUZZY RD MODELS If
generalized randomized-response SDL is applied to the running variable,
then the SDL is ignorable for parameter estimation when using a fuzzy RD
design. The FRD compliance function must be augmented with the
contribution from SDL. When the running variable is distorted with
normally distributed noise, as we have assumed, there is no point mass
anywhere, and hence no discontinuity in the probability of treatment at
the discontinuity that is due to the SDL. The claim that the SDL is
ignorable for estimation of the treatment effect in the fuzzy RD design
follows because the only discontinuity in the estimated compliance
function is entirely due to the discontinuity in the true running
variable. (See the online appendix, section D.2.1, for details.) Imbens
and Lemieux (2008) show that the instrumental variable (IV) estimator
that uses the RD as an exclusion restriction is formally equivalent to
the fuzzy RD estimator, so the SDL is also ignorable for consistent
estimation in this case as well.
Whether or not the SDL is ignorable for consistent estimation, it
is never ignorable for inference. The estimated standard errors of the
RD and FRD treatment effects must be adjusted.
In some applications, the treatment indicator is not observed and
must be proxied by the discontinuity point, around which the RD is
strict. If the treatment indicator is not observed and SDL has been
applied to the running variable, only the sharp RD estimator is
available, and it will be attenuated by a factor p. Nothing can be done
in this setting without auxiliary information about the SDL model.
NONIGNORABLE SDL IN OTHER PARTS OF THE RD DESIGN When SDL is
applied to the dependent variable rather than the running variable, the
situation is more complicated. We refer to our analysis of regression
models in section III.B. SDL applied to the dependent variable will lead
to attenuation of the estimated treatment effect unless all relevant
variables, including the running variable and its interaction with the
discontinuity point, are included in the SDL model for the dependent
variable. Hence, SDL applied to the dependent variable is more likely to
cause problems for RD than for conventional linear regression models,
since the variation around the discontinuity point is unlikely to be
included in the agency's imputation or swapping algorithms.
CONSEQUENCES OF DATA COARSENING FOR SDL The ignorability of SDL in
some circumstances was anticipated in the work of Daniel Heitjan and
Rubin (1991), which considers the problem of inference when the
published data are coarsened. Their application was to reporting errors
where, for instance, individuals round their hours to salient, whole
numbers. The same model is relevant to those types of microdata SDL that
aggregate attribute categories, like occupations or geographies, and to
topcoding.
David Lee and David Card (2008) consider the consequences of
microdata coarsening for RD designs. For example, if ages are coarsened
into years, the RD design in which age is the running variable will
group observations near the boundary with those further from the
boundary, violating the required assumption that the running variable is
continuous around the treatment threshold. Once again, depending on the
type of RD design, when SDL is accomplished through coarsening of the
running variable, it is not ignorable. An analysis that uses the
coarsened running variable with a standard RD estimator may be biased
and understate standard errors. As in Heitjan and Rubin (1991), Lee and
Card (2008) establish conditions under which a grouped-data estimator
provides a valid way to handle coarsened data. This method is agnostic
about the cause of the grouping and is therefore SDL-aware by
construction.
III.D. Estimating Instrumental Variable Models
We consider simple instrumental variable models with a single
endogenous explanatory variable, a single instrument, and no additional
regressors. Except where indicated, the intuition for these examples
carries through to a more general setting with multiple instruments and
controls.
The confidential data model of interest is the standard IV system
[y.sub.i] = [kappa] + [gamma][[t.sub.i] + [[epsilon].sub.i]
[t.sub.i] = [phi] + [delta] [z.sub.i] + [[eta].sub.i]
where [y.sub.i] is the outcome of interest, [t.sub.i] is a scalar
variable that may be correlated with the structural residual
[[epsilon].sub.i], and [z.sub.i] is a scalar variable that can serve as
an instrument. That is, [z.sub.i] is uncorrelated with [[epsilon].sub.i]
and [delta] [not equal to] 0. We assume the SDL described in section
II1.B is applied to either the dependent variable, the endogenous
regressor, or the instrument.
With this simplified setup, the IV estimator [MATHEMATICAL
EXPRESSION NOT REPRODUCIBLE IN ASCII], is the parameter estimate from
the reduced form equation [y.sub.i] = [alpha] + [beta][z.sub.i] +
[v.sub.i]. We apply the results in section III.B. First, if SDL is
applied to the dependent variable, then the point estimate of [gamma]
will be attenuated. This is an immediate consequence of the fact that
plim [??] < [beta], while plim [??] = [delta]. Second, by parallel
reasoning, if SDL is applied to the endogenous regressor, then the point
estimate of [gamma] will be exaggerated. In this case, plim [??] =
[beta], but plim [??] [less than or equal to] [delta]. This result
implies that IV models may overstate the coefficient of interest when
SDL is applied to the endogenous regressor. It is also not possible to
use IV to correct for SDL in this case.
Finally, somewhat surprisingly, SDL is ignorable when applied to
the instrument. In this particular model, with a single instrument and
no regressors, the attenuation term is the same in the first-stage and
reduced form, and therefore cancels out of the ratio [MATHEMATICAL
EXPRESSION NOT REPRODUCIBLE IN ASCII]. We caution, however, that this
ignorability does not extend to the case where there are additional
exogenous regressors. In summary, our analysis suggests that
blank-and-impute SDL is generally nonignorable for instrumental
variables estimation and inference.
IV. Analysis of Official Tables
Tabular or aggregate data are the primary public output of most
official statistical systems. Most agencies offer a technical manual
that provides an extensive description of how the microdata inputs were
transformed into the publication tables. These manuals rarely, if ever,
include an assessment of the effects of the SDL, and we could find no
examples of manuals that did among the federal statistical agencies.
When an agency releases measures of precision for aggregate data, these
measures do not include variation due to SDL.
There are three key forms of SDL applied to tabular summaries. All
federal agencies rely on primary and complementary suppression as the
main SDL method. When an alternative SDL method is used, the most common
ones add noise to the underlying input microdata or to the prerelease
tabulated estimates. For household-based inputs, most agencies also
perform some form of swapping before preparing tabular summaries. For
business-based inputs, we are not aware of any SDL system that uses
swapping.
IV.A. Directly Tabulating Published Microdata
An alternative to using published tabulations is to tabulate from
published microdata files. This is usually not an option for business
data, which form the bulk of our examples in this section, but it may be
an option for household data. We explore some of the pitfalls of doing
custom tabulations in the online appendix, section E.3. Researchers
should use caution when making tabulations from published microdata if
the subpopulations being studied are often suppressed in the official
tables. The presence of suppression usually signals a data quality
problem.
IV.B. Suppression versus Noise Infusion
WHEN SUPPRESSION IS NONIGNORABLE Tabular suppression rules identify
cells that are too heavily influenced by a few observations. The
consequences for research are profound when those few observations are
the focus of a particular study or the cause of a very inconvenient
complementary suppression. It is not surprising that detailed data about
the upper 0.25 percent of the income distribution are almost all
suppressed by the Statistics of Income Division of the IRS. If a study
focuses on unusual subpopulations, dealing with suppression is a normal
part of the research design.
The most common form of suppression bias occurs when an analyst is
assembling data at a given aggregation level, such as county level by
four-digit NAICS (6) industry group from the BLS's Census of
Employment and Wages frame. Between 60 and 80 percent of the published
cells will have missing data. These data cannot reasonably be missing at
random (ignorably missing) because the rule used to determine if those
data could be published depends upon the values of the missing data. The
problem compounds as covariates from other sources are added to the
analysis.
Formally, SDL suppression is never ignorable. The probability that
a cell is suppressed depends on the values of its component microdata
records. Surprisingly, there is considerable resistance to replacing
suppression with SDL methods that infuse deliberate noise.
Noise-infusion SDL, as applied in the QWI, allows for the elimination of
cell suppression and therefore eliminates bias from missing data. The
trade-off is an increase in variance of all table entries, including
those that would not be suppressed.
Perhaps the resistance to replacing suppression with noise-infusion
arises because the bias from suppression is buried in a missing-data
problem that most applied studies address with ad hoc methods: (i)
analyze the published data as though the suppressions were ignorable, or
(ii) do the analysis at a more aggregated level (say, NAICS subsector
rather than NAICS industry group). These approaches are generally not as
good as what could be accomplished with the same data if the cause were
acknowledged and addressed.
A better solution, which is still ad hoc, is to use the frame
variable to allocate the values of higher-level aggregates into the
missing lower-level observations for the same variable. For example, in
the QWI the frame variable is quarterly payroll--it is never suppressed
at any level of aggregation--and in the QCEW and CBP the frame variable
is the number of establishments, which is also never suppressed in these
publications. The analyst can proportionally allocate the three-digit
industrial aggregate employment, say, using the four-digit proportions
of the frame variable as weights. This can be done in a sophisticated
manner so that none of the observed original data are overwritten or
contradicted by this imputation. For example, it can be done by only
imputing the values of the four-digit employment that were actually
suppressed and respecting the published three-digit employment totals
for the sum of all four-digit industries within that total. This
solution at least acknowledges that the suppression bias is
nonignorable. The values for the higher-level aggregates contain some
information about the suppressed values. Allocations based on the frame
variable assume that the distribution of every variable with missing
data across the entire population is the same as the distribution of the
frame variable.
The analyst can do better still. The best solution for any given
analysis is to combine the model of interest with a model for the
suppressed data. Bayesian hierarchical models, like the ones we used in
this paper, work well. Software tools for specifying and implementing
such models are readily available. The complete model will properly
account for the nonrandom pattern of the missing data, will incorporate
prior information about the suppression rule that can be used for
identification, and account for the additional uncertainty introduced by
suppression. See Scott Holan and others (2010) for a specific
application to BLS data.
WHEN NOISE INFUSION MAKES THE SDL NONIGNORABLE Applying SDL by
input noise infusion dramatically reduces the amount of suppression in
the publication data. Since we are going to illustrate many of the
features of these systems in the example in section V, we devote our
attention here to the basic nonignorable features of input noise
infusion.
Input noise infusion models were first proposed by Timothy Evans,
Laura Zayatz, and John Slanta (1998). The noise models they proposed are
constructed so that the expectation of the noisy aggregate, given the
confidential aggregate, equals the confidential aggregate. This is the
sense in which these measures are unbiased. In addition, as the number
of entities in a cell (usually business establishments) gets large, the
variance of the aggregate that is due to noise infusion vanishes. This
is the sense in which these measures add variance to the published data
in exchange for reducing suppression bias. Finally, the noise itself is
usually generated from an independent, identically distributed random
variable, so the joint distribution of the confidential data and the
input noise factors into two independent distributions. Thus, SDL using
input noise infusion can sometimes be ignorable for estimating the
parameter of interest, but it will generally not be ignorable when
trying to form a confidence interval around that estimate. Because the
noise process affects the posterior distribution of most parameters of
interest, it is generally not ignorable.
Fortunately, agencies have been much more open about the processes
used to produce publication tables from noise-infused inputs. A
data-quality variable generally indicates whether the published value
suffers from substantial infused noise. These flags are based on the
absolute percentage error in the published value compared to the
confidential value. It turns out, as we will see below, that they also
sometimes release enough information to estimate the variance of the
noise process itself, which is the SDL parameter that plays the role of
the randomized-response "true data" probability. When the
variance of the noise-infusion process goes to zero, the SDL becomes
ignorable for all analyses, if no other SDL replaces it.
V. SDL Discovery in Published Tables
In this section, we show that it is possible to use information
from three data sets released from very similar frames to conduct
complete SDL-aware analyses. These data sets are the QWI, the QCEW, and
the CBP. The key insight is that each data set applies a different SDL
method to the same confidential microdata. The variation across the
published data facilitates discovery of the SDL process. First, it is
possible to directly infer a key unpublished variance term from the QWI
noise infusion model. This variance term can then be used to correct
SDL-generated estimation bias. Second, we argue that the QCEW and CBP
data can be used as instruments to correct SDL-induced measurement error
in analysis based on the QWI.
V.A. Overview of the QWI, QCEW, and CBP Data Sets
The QWI is a collection of 32 employment and earnings statistics
produced by the Longitudinal Employer-Household Dynamics program at the
U.S. Census Bureau. It is based on state Unemployment Insurance (UI)
system records integrated with information on worker and workplace
characteristics. Workplace characteristics are linked from the QCEW
microdata. The frame for employers and workplaces is the universe of
QCEW records, including both the employer report and the separate
workplace reports. A QCEW workplace is an establishment in the QWI data.
Essentially, the same QCEW inputs are used by the BLS to publish its
Census of Employment and Wages (CEW) quarterly series on employment and
total payroll. (In what follows, the acronym QCEW is reserved for the
inputs and publications of the BLS in the CEW series.) CBP data sets are
also published by the Census Bureau from inputs based on its employer
Business Register.
While the QWI, QCEW, and CBP use closely related sources to publish
statistics by employer characteristics, they apply different methods for
SDL. The QWI and CBP distort the establishment-level microdata using a
multiplicative noise model and publish the aggregated totals. The QCEW
aggregates the undistorted confidential establishment-level microdata
and then suppresses sensitive cells with enough complementary
suppressions of nonsensitive cells to allow publication of most table
margins.
V.B. Published Aggregates from the QWI, QCEW, and CBP
We give just enough detail here so that the reader can see how the
Census Bureau and BLS form the aggregates for the quarterly payroll
variables that we will use to illustrate the consequences of universal
noise infusion for SDL. (More details are in the online appendix,
section F.)
Tabular aggregates are formed over a classification k = 1, ... , K
that partitions the universe of establishments into K mutually exclusive
and exhaustive cells [[OMEGA].sub.(k)t]. These partitions have detailed
geographic and industrial dimensions. For all three data sources,
geography is coded using FIPS (7) county codes. Industrial
classifications are NAICS sectors, subsectors, and industry groups. The
tabular magnitudes are computed by aggregating the values over the
establishments in the group k. For the QWI, in the absence of SDL, the
total quarterly payroll [W.sub.jt] for establishment j in group k and
quarter t would be estimated by (8)
(2) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
For the QCEW, an identical formula uses total quarterly payroll, as
measured by [W.sup.(QCEW).sub.jt] and for CBP, the quarterly payroll
variable would be [W.sup.(CBP).sub.jt]. Published aggregates from the
QWI are computed using multiplicative noise factors 8( that have mean
zero and constant variance. (More details are in the online appendix,
section G.) The published quarterly payroll is computed as
(3) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII],
where we have adopted the convention of tagging the post-SDL value
with an asterisk. The same noise factor is used to aggregate total
quarterly payroll and all other QWI variables. Total quarterly payroll
is never suppressed in the QWI. The number of establishments in a cell
is not published. If, and only if, a cell has a published value of W*,
then there is at least one establishment in that cell.
The published QCEW payroll aggregate is exactly the output of
equation 2 using QCEW inputs. The published QCEW total quarterly payroll
might be missing due to suppression. The QCEW data use item-specific
suppression. Payroll might be suppressed when employment is not, and
vice versa.
The CBP total quarterly payroll is exactly the output of equation 3
with CBP-specific inputs, including the noise factor. As with the QWI
data, the same noise factor is used for all the input variables from a
particular establishment. The published CBP aggregates have some SDL
suppressions and can therefore be missing. The number of establishments
in a cell is never suppressed, nor is the size distribution of
employers.
V.C. Regression Models with Nonignorable SDL
The noise infusion in QWI may be nonignorable. Univariate
regression of a variable from another data set onto a QWI aggregate
provides a simple illustration, which we summarize here. (See the online
appendix, section E.4, for details.)
The model of interest is appendix equation E.26, the regression of
a county-level outcome [Y.sub.(k)t] from a non-QWI source on QWI
quarterly payroll in the county [W.sup.*]. The dependent variable can be
subjected to SDL as long as it is independent of the QWI SDL, as would
be the case if the dependent variable were computed by the BLS or the
Bureau of Economic Analysis (BEA). The published aggregate data are the
[[Y.sub.(k)t], [W.sup.*.sub.(k)t]]. The undistorted values,
[W.sub.(k)t], are confidential.
The probability limit of the ordinary least squares (OLS) estimator
for the regression coefficient on [beta] based on using the published
data is appendix equation E.27, and the asymptotic bias ratio is
appendix equation E.28. The bias due to SDL depends on the product of
two factors: the variance of the noise-infusion process and the expected
Herfindahl index for payroll within aggregate k, as derived in the
online appendix, section E.5. If either of these factors is zero, there
is no bias in estimation. But the expected Herfindahl index is data, so
we cannot make prior restrictions on that component. This leaves only
the SDL noise variance. Clearly, the noise infusion is nonignorable in
this setting.
One option is to correct the bias analytically. If the noise
variance is known or can be estimated, the bias can be corrected
directly. An unbiased estimator for E[[[W.sub.(k)t]].sup.2] is available
from E[[[W.sup.*.sub.(k)t].sup.2] once the variance of the
multiplicative noise factor, V[[[delta].sub.j]], is known, after which
it only remains to recover V[[W.sub.(k)t]] from the definition of
V[[W.sup.*.sub.(k)t]].
The second possibility is to find instruments. Any instrument,
[Z.sub.(k)t], correlated with [W.sub.(k)t] and uncorrelated with the SDL
noise infusion process, will work, as shown in appendix equation E.29.
In the QWI setting, there are three natural candidates for such
instruments: (i) data from the QCEW for the same cell; (ii) data from
CBP from the same cell; and (iii) data from neighboring cells
(geographies or industries) in the QWI.
Data from QCEW for the same cell are based on the same
administrative record system. QWI tabulates its measures from the UI
wage records. QCEW tabulates from the associated ES-202 workplace
report. The total payroll measure has an identical statutory definition
on both administrative record systems for the state's Unemployment
Insurance. Data for CBP are tabulated from the Census Bureau's
employer Business Register. Payroll and employment come from the
employer federal tax filings, and the payroll measured from this IRS
source has a very similar statutory definition as compared to the
definition used by QWI and QCEW. Finally, QWI data from nearby
geographies or industries (depending on the aggregate represented by k)
should be correlated with the QWI variable in the regression because
they are based on the same administrative record system reports.
By construction, all of these instruments are uncorrelated with the
SDL-induced noise in the right-hand side of equation E.26. In the case
of QCEW or CBP data, any SDL-induced noise (CBP) or suppression bias
(QCEW and CBP) in the instrument is independent of the noise in QWI.
However, if many of the cells in the tabulation of the instrument are
suppressed, that will affect the validity of the instrument, as we
analyzed in section IV. B. When there are many suppressions in QCEW or
CBP for the partition under study, data from the neighboring QWI cells
can be used to complete the set of instruments.
Perhaps surprisingly, the input noise infusion to the QWI does not
bias parameter estimates if the dependent and independent variables all
come from QWI. Once drawn, the establishment-level noise factors are the
same across variables and over time. Therefore, the variance from noise
infusion affects all variables in exactly the same manner, factors out
of the OLS moment equations, and then cancels. The same feature of the
QWI also leads the time-series properties of the data to be preserved
after noise infusion. We note that this feature is unique to the QWI
method of noise infusion, where the noise process is fixed over time for
each cross-sectional unit. It does not hold for other forms of noise
infusion, such as the one used by CBP.
V.D. Estimating the Variance Contribution of SDL for the QWI
It is possible to recover the variance of the noise factor
V[[[delta].sub.j]], which is needed to correct directly for bias in the
univariate and multivariate regression examples using the QWI. The
details of this estimation process are presented in the online appendix,
section E.5.
Our leverage in this analysis comes from the fact that QWI and QCEW
use identical frames (QCEW establishments). Hence, we can use
[W.sup.(QCEW).sub.(k)t] as the instrument for [W.sub.(k)t], as long as
it has not been suppressed too often. Furthermore, we can use
[W.sup.(QCEW).sub.(k)t], which is published at the county level as an
instrument for any subcategory of QWI payroll, for example payroll of
females ages 55-64, even though no exact analogue is published in QCEW.
Although the data come from a different administrative record
system, the concepts underlying the CBP payroll variable are very
similar to both the QWI and QCEW inputs. The SDL system used for CBP
data is very similar to the one used for QWI, but the random noise in
CBP is independent of the random noise in QWI. Therefore, CBP data can
also be used as instruments, and they are suppressed far less often than
QCEW data. The formulas for recovering both systems' SDL parameters
are in the online appendix, section E.5.
V.E. Empirical Results
Table 1 presents the estimates of the equation used to recover the
SDL parameters fitted using matched QWI and QCEW data for the first
quarters of 2006 through 2011 by ordinary least squares. Table 2 fits
the same functions using mixed-effect models. (9) The equations are
fitted for state-level aggregations, where the error in both the
employment and payroll magnitudes is mitigated by the benchmarking,
county-level aggregations, where the agreement in the workplace codes
for county is most likely to be strong, and county by NAICS sector-level
aggregations, where there is greater scope for differences between the
coding of the microdata in QWI and QCEW.
Both tables give very similar estimates for V[S] whether we use
payroll or employment as the basis. This suggests that the bias in
estimating V[[delta]] from using proxies for the Herfindahl index is
either minimal or uncorrelated between employment and payroll. Either
way, we are able to estimate with reasonable precision the range of
possibilities for V[[delta]], and these indicate that the noise infusion
does not create a very substantial bias or inflate estimated variances
substantially.