Economic analysis and statistical disclosure limitation.
Abowd, John M. ; Schmutte, Ian M.
VI. The Frontiers of SDL
In this section we discuss the relationship between synthetic data
and validation servers, the nature and limits of formal privacy systems,
and the analysis of confidential data in enclaves.
VI.A. Analysis of Synthetic Data
We defined synthetic data in section II. Here we discuss the tight
relationship between synthetic data systems and validation servers, a
method of improving the accuracy of synthetic data that links the user
community and the data providers directly. In a synthetic data feedback
loop, the agency releases synthetic microdata to the research community.
Researchers analyze the synthetic data as if they were public-use
versions of the confidential data using SDL-aware analysis software.
When the analysis of the synthetic data is complete, the researchers may
request a validation, which is performed by the data providers on the
actual confidential data. The results of the validation are subjected to
conventional SDL and then released to the researcher as public-use data.
The data provider then inventories these analyses and uses them to
improve the analytical validity of the synthetic data in the next
release by testing new versions of the synthetic data on the models in
its inventory.
The Census Bureau has two active feedback-loop, synthetic-data
systems: the Survey of Income and Program Participation (SIPP) Synthetic
Beta (SSB) and the Synthetic Longitudinal Business Database (SynLBD).
(10) The SSB provides synthetic data for all panels of the SIPP linked
to longitudinal W-2 data. SynLBD is a synthetic version of selected
variables and all observations from the confidential Longitudinal
Business Database, the research version of the employer Business
Register, longitudinally linked.
A recent paper by Marianne Bertrand, Emir Kamenica, and Jessica Pan
(2015) provides an excellent illustration of the advantages of using
synthetic data that are part of a feedback loop. The authors use the
administrative record values for married couples' individual W-2
earnings to compute the proportion of household income that was due to
each partner. They hypothesize that there should be a regression
discontinuity at 50 percent because of their model prediction that women
should prefer to marry men with higher incomes than their own. The SSB
data have undergone extensive SDL and, for this model, the effects of
this SDL on the RD running variable was extensive, nonignorable, and had
a stated "suppress and impute rate" of 100 percent. Analyses
from synthetic data show no causal effect. However, analyses from the
validation estimation on the confidential data, where the earnings
variables have not been subjected to any SDL but are imputed when
missing, show a clear discontinuity. The validated estimates are
reported in the published paper. Any researcher anywhere in the world
can use the SSB and SynLBD by following the instructions on the Cornell
University-based server that is used as the interface for analyses that
are part of the feedback process. (11)
While writing this paper, we discovered why the analysis of the
linked SIPP-IRS data by Bertrand, Kamenica, and Pan (2015) showed no
causal effect when the synthetic data were used. The reason can be seen
by examining equation 1 when the running variable has been modified for
every observation, as is the case in the SSB. The
regression-discontinuity effect is not identified in the synthetic data,
and it will not generally be identified for any RD design that uses the
many exact earnings and date variables in the SSB. If only the SSB were
available with no access to validation, RD and FRD analyses using these
data would be pointless. However, because the SSB offers validation
using the underlying confidential data and traditional SDL on the output
coefficients, an analyst can do a specification search for the response
functions/, and/2 using the SSB, then submit the entire protocol from
the specification search for validation. The validated estimate of the
RD or FRD treatment effect provides the researcher's first evidence
on that effect. Thus, the use of the feedback mechanism for the
synthetic data protected the research design from pretest estimation and
false-discovery bias for the inferences on the causal RD effect, an
incredible silver lining.
We have already noted that the Survey of Consumer Finances (SCF)
uses synthetic data for SDL, based on the same model that is used for
edit and imputation of item missing data. The statutory custodian for
the SCF is the Federal Reserve Board of Governors. The Fed maintains a
very limited feedback loop that is described in the codebook (Federal
Reserve Board of Governors 2013).
VI.B. Formal Privacy Systems
A researcher is much more likely to encounter a formal privacy
system for SDL when interacting with a private data provider.
Differential privacy was invented at Microsoft. As early as 2009,
Microsoft had in place a system, Privacy Integrated Queries (PINQ), that
allowed researchers to analyze its internal data files (such as search
logs) with a fixed privacy budget using only analysis tools that were
differentially private at every step of the process, including data
editing (McSherry 2009). These tools ensure that every statistic seen by
the researcher, and therefore available for publication, satisfies
e-differential privacy. When the researcher exhausts [epsilon], no
further access to the data is provided.
PINQ computes contingency tables, linear regressions,
classification models, and other statistical analyses using provably
private algorithms. Its developer recognized that a strong privacy
guarantee comes at the expense of substantial accuracy. It was up to the
analyst to decide how to mitigate that loss of accuracy. The analyst
could spend most of the privacy budget to get some very accurate
statistics--ones for which the inferences were not substantially altered
as compared to the same inference based on the confidential data. But
then the analysis was over, and the analyst could not formulate
follow-up hypotheses because there was no remaining privacy budget.
Alternatively, the analyst could use only a small portion of the privacy
budget doing many specification searches, each one of which was highly
inaccurate as compared to the same estimation using the confidential
data, then use the remainder of the privacy budget to compute an
accurate statistic for the chosen specification.
The literature on formal privacy models is still primarily
theoretical. At present, there are serious concerns about the
computational feasibility of applying formal privacy methods to large,
high-dimensional data, as well as their analytical validity for
nontrivial research questions. However, these methods make clear the
cost in terms of loss of accuracy that is inherent in protecting privacy
by distorting the analysis of the confidential data. The formal methods
also allow setting a privacy budget that can be allocated across
competing uses of the same underlying data.
Economists should have no trouble thinking about how to spend a
privacy budget optimally during a data analysis. But they might also
wonder how any real empirical analysis can survive the rigors of never
seeing the actual data. That is a legitimate worry, and one that the
formal privacy community takes very seriously. For a glimpse of one
possible future, see the work of Dwork (2014), who calls for all
custodians of private data to publish the rate at which their data
publication activities generate privacy losses and to pay a fine for
nonprivate uses (infinite privacy loss, [epsilon] = [infinity]). Public
and private data providers will have an increasingly difficult time
explaining why they are unwilling to comply with this call when others
begin to do so. The resulting public policy debate is very unlikely to
result in less SDL applied to the inputs or outputs of economic data
analyses.
VI. C. Analysis of Confidential Data in Enclaves
Because this paper is about the analysis of public-use data when
the publisher has used statistical disclosure limitation, we have not
discussed restricted access to the underlying confidential data.
Restricted access to the confidential data also involves SDL. First,
some agencies do not remove all of the SDL from the confidential files
they allow researchers to use in enclaves. Second, the output of the
researcher's analysis of the confidential data is considered a
custom tabulation from the agency's perspective. The output is
subjected to the same SDL methods that any other custom tabulation would
require.
VII. Discussion
Unlike many other aspects of the processes by which data are
produced, SDL is poorly understood and seldom discussed among
economists. SDL is applied widely to the data most commonly used by
economists, and the pressure on data custodians to protect privacy will
only get stronger with time. We offer suggestions to researchers,
journal editors, and statistical agencies to facilitate and advance
SDL-aware economic research.
VII. A. Suggestions for Researchers
Over the decades since SDL was invented, research methods have
changed dramatically--most notably in the applied microeconomists'
adoption of techniques that require both enormous amounts of data and
very precise model-identifying information. The combination of these two
requirements has led to much more extensive use of confidential data
with the publication of only summary results. Studies carried out this
way have very limited potential for replication or reuse of the
confidential data. Grant funding agencies have insisted that the
researchers they fund prepare a data management plan for the curation of
the data developed and analyzed using their funds, yet very few
statistical agencies or private firms will surrender a copy of the
confidential data for secure curation to allow research teams to comply
with this requirement. Consequently, only the public portion of this
scientific work can be curated and reused. But all such public data have
been subjected to very substantial SDL, almost all of it in the form of
suppression--none of the original confidential data and very little of
the intermediate work product can be published.
Suppression on this scale leads to potentially massive biases and
very limited data releases. To address this problem, over these same
decades statisticians and computer scientists have worked to produce SDL
methods that permit the publication of more data, including detailed
microdata with large samples and precise model-identifying variables.
Yet only a handful of applied economists are active in the SDL and data
privacy communities. What Arthur Kennickell accomplished by integrating
the editing, imputation, and SDL components of the Survey of Consumer
Finances in 1995 and orchestrating the release of those microdata in a
format that required SDL-aware analysis methods was not accomplished
again until 2007, when the Census Bureau released synthetic microdata
for the Survey of Income and Program Participation. We believe that the
reason economists have been reticent about exploring alternatives to
suppression is that they have not fully understood how pernicious
suppression bias actually is.
Statistical agencies do understand this, and the SDL and
privacy-preserving methods they have adopted are designed to control
suppression bias by introducing some deliberate variance. Economists
tend to argue that the deliberate infusion of unrelated noise is a form
of measurement error that infects all of the analyses. That is true, as
we have shown, but it is an incomplete picture. Suppression too creates
massive amounts of unseen bias--the direct consequence of not being able
to analyze the data that are not released. Economists should recognize
that the publication of altered data with more limited suppression
instead of just the unsuppressed unaltered data could be a
technologically superior solution to the SDL problem. We challenge more
economists to become directly involved in the creation and use of SDL
and privacy-preserving methods that are more useful to the discipline
than the ones developed to serve the general user communities of
statistical agencies and Internet companies.
In the meantime, what can productively be done? Economic
researchers who use anything other than the most aggregated data should
become more familiar with the methods used to produce those data:
population frames, sampling, edit, imputation, and publication formulas,
in addition to SDL. This will help reduce the tendency to think of SDL
as the only source of bias and variation. For students, these topics are
usually covered in courses called "Survey Methodology," but
they belong in econometrics and economic measurement courses too.
VII.B. Suggestions for Journals, Editors and Referees
Journals should insist that authors document the entire production
process for the inputs and output of their analyses. The current
standards are incomplete because they focus on the reproducibility of
the published results from uncurated inputs. Economists do not even have
a standard for citing data. A proper data citation identifies the
provenance of the exact file used as the starting point for the
analysis. Requiring proper citation of curated data inputs provides an
incentive for those who perform such activities, just as proper software
citation has provided an incentive to create and maintain curated
software distribution systems. Discussions of the consequences of frame
definitions, sampling, edit, imputation, publication formulas, and SDL
that were applied to the inputs are also important for any econometric
analysis. If authors cannot cite sources that document each of these
components, they should be required to include the information in an
archival appendix.
We make these points because we also want the journals to require
documentation of the SDL procedures that were applied to the inputs and
outputs of the analyses, although we do not think it is appropriate to
single out SDL for special attention. The other aspects of data
publication we discuss here also have implications for interpreting and
reproducing the published results. If scientific journals added their
voices to the calls for better documentation of all data publication
methods, it would be easier to press statistical agencies to release
more details of their SDL methods.
VII.C. Suggestions for Statistical Agencies and Other Data
Providers
We think that the analysis in this paper should be considered a
prima facie case for releasing more information about the actual
parameters used in SDL methods and for favoring SDL methods that are
amenable to SDL-aware statistical analysis. By framing our arguments
using methods already widely adopted to assess the effects of data
quality issues, we hope to show that the users are also entitled to
better information about specific SDL methods. We have also shown that
if certain SDL methods are used, only very basic summary parameters need
to be released. These can even be released as probability distributions,
if desired.
We stress that we are not singling out SDL for special attention.
Very specific information about the sample design is released in the
form of the sampling frames used, detailed stratification structures,
sampling rates, design weights, response rates, cluster information,
replicate weights, and so on. Very specific information is released
about items that have been edited, imputed or otherwise altered to
address data quality concerns. But virtually nothing--nothing
specific--is released about SDL parameters. This imbalance fuels the
view that the SDL methods may have unduly influenced a particular
analysis. In addition, it is critical to know which SDL methods have
been permanently applied to the data, so that they must be considered
even when restricted access is granted to the confidential data files.
Our remarks are not directed exclusively to government statistical
agencies; they apply with equal force to Amazon, Facebook, Google,
Microsoft, Netflix, Yahoo, and other Internet giants as they begin to
release data products like Google Trends for use by the research
community.
VIII. Conclusion
Although SDL is an important component of the data publication
process, it need not be more mysterious or inherently problematic than
other widely used and well understood methods for sampling, editing, and
imputation, all of which affect the quality of analyses that economists
perform on published data. Enough is known about current SDL methods to
permit modeling their consequences for estimation of means, quantiles,
proportions, moments, regression models, instrumental variables models,
regression discontinuity designs, and regression kink models. We have
defined ignorable SDL methods in a model-dependent manner that is
exactly parallel to the way ignorability is defined for missing-data
models. We have shown that an SDL process is ignorable if one can apply
the methods that would be appropriate for the confidential data directly
to the published data and reach the same conclusions.
Most SDL systems are not ignorable. This is hardly surprising,
since the main justification for using SDL is limiting the ability of
the analyst to draw conclusions about unusual data elements such as
re-identifying a respondent or a sensitive attribute. The same tools
that help assess the influence of experimental design and missing data
on model conclusions can be used to make any data analysis SDL-aware.
One such system, the multiple imputation model used for SDL by the
Survey of Consumer Finances, has operated quite successfully for two
decades. Other systems, most notably the synthetic data systems with
feedback loops operated by the Census Bureau, are quite new but permit
fully SDL-aware analyses of important household and business microdata
sources.
Finally, we have shown that the methods we developed here can be
used effectively on real data and that the consequences of SDL for data
analysis are limited, at least for the models we considered here. When
methods that add noise are used, there is less bias than for equivalent
analyses that use data subjected to suppression. The extra variability
that the noise-infusion methods generate is of a manageable magnitude.
We use these findings to press for two actions: (i) publication of
more SDL details by the statistical agencies so that it is easier to
assess whether or not SDL matters in a particular analysis and (ii) less
trepidation by our research colleagues in using data that have been
published with extensive SDL. There is no reason to treat the use of SDL
as significantly more challenging than the analysis of
quasi-experimental data or an analysis with substantial nonignorable
missing data.
ACKNOWLEDGMENTS We acknowledge direct support from the Alfred P.
Sloan Foundation (Grant G-2015-13903) and, of course, the Brookings
Institution. Abowd acknowledges direct support from the National Science
Foundation (NSF Grants BCS-0941226, TC-1012593, and SES-1131848). This
paper was written while Abowd was visiting the Center for Labor
Economics at the University of California, Berkeley. We are grateful for
helpful comments from David Card, Cynthia Dwork, Caroline Hoxby, Tom
Louis, Laura McKenna, Betsey Stevenson, Lars Vilhuber, and the volume
editors.
JOHN M. ABOWD
Cornell University
IAN M. SCHMUTTE
University of Georgia
References
Abowd, John M., and Simon D. Woodcock. 2001. "Disclosure
Limitation in Longitudinal Linked Data." In Confidentiality,
Disclosure, and Data Access: Theory and Practical Applications for
Statistical Agencies, edited by Pat Doyle, Julia Lane, Jules Theeuwes,
and Laura Zayatz. Amsterdam: North Holland.
Alexander, J. Trent, Michael Davern, and Betsey Stevenson. 2010.
"Inaccurate Age and Sex Data in the Census PUMS Files: Evidence and
Implications." Public Opinion Quarterly 74, no. 3: 551-69.
Anderson, Margo, and William Seltzer. 2007. "Challenges to the
Confidentiality of U.S. Federal Statistics, 1910-1965." Journal of
Official Statistics 23, no. 1: 1-34.
--. 2009. "Federal Statistical Confidentiality and Business
Data: Twentieth Century Challenges and Continuing Issues." Journal
of Privacy and Confidentiality 1, no. 1: 7-52.
Benedetto, Gary, and Martha Stinson. 2015. "Disclosure Review
Board Memo: Second Request for Release of SIPP Synthetic Beta Version
6.0." U.S. Census Bureau, Survey Improvement Research Branch,
Social, Economic, and Housing Statistics Division (SEHSD).
http://www.census.gov/content/dam/Census/
programs-surveys/sipp/methodology/DRB MemoTablesVersion2SSBv6_0.pdf
Bertrand, Marianne, Emir Kamenica, and Jessica Pan. 2015.
"Gender Identity and Relative Income within Households."
Quarterly Journal of Economics 130, no. 2: 571-614.
Bollinger, Christopher R., and Barry T. Hirsch. 2006. "Match
Bias from Earnings Imputation in the Current Population Survey: The Case
of Imperfect Matching." Journal of Labor Economics 24, no. 3:
483-520.
Burkhauser, Richard V., Shuaizhang Feng, Stephen P. Jenkins, and
Jeff Larrimore. 2012. "Recent Trends in Top Income Shares in the
United States: Reconciling Estimates from March CPS and IRS Tax Return
Data." Review of Economics and Statistics 94, no. 2: 371-88.
Card, David, David Lee, Zhuan Pei, and Andrea Weber. 2012.
"Nonlinear Policy Rules and the Identification and Estimation of
Causal Effects in a Generalized Regression Kink Design." Working
Paper no. 18564. Cambridge, Mass.: National Bureau of Economic Research.
Dalenius, Tore. 1977. "Towards a Methodology for Statistical
Disclosure Control." Statistik Tidskrift 15: 429-44.
Duncan, George T., Mark Elliot, and Juan-Jose Salazar-Gonzalez.
2011. Statistical Confidentiality: Principles and Practice. New York:
Springer.
Duncan, George T., and Stephen E. Fienberg. 1999. "Obtaining
Information While Preserving Privacy: A Markov Perturbation Method for
Tabular Data." Presented at Eurostat conference Statistical Data
Protection '98 (SDP'98). Available at
http://www.heinz.cmu.edu/research/21full.pdf
Duncan, George T., Thomas B. Jabine, and Virginia A. de Wolf, eds.
1993. Private Lives and Public Policies: Confidentiality and
Accessibility of Government Statistics. Washington: National Academies
Press.
Duncan, George T., and Diane Lambert. 1986.
"Disclosure-Limited Data Dissemination." Journal of the
American Statistical Association 81, no. 393: 10-18.
Dwork, Cynthia. 2006. "Differential Privacy." In
Automata, Languages and Programming: 33rd International Colloquium,
Proceedings, Part II, edited by Michele Bugliesi, Bart Preneel,
Vladimiro Sassone, and Ingo Wegener. Berlin and Heidelberg: Springer.
--. 2014. "Differential Privacy: A Cryptographic Approach to
Private Data Analysis." In Privacy, Big Data, and the Public Good:
Frameworks for Engagement, edited by Julia Lane, Victoria Stodden,
Stefan Bender, and Helen Nissenbaum. Cambridge University Press.
Dwork, Cynthia, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006.
"Calibrating Noise to Sensitivity in Private Data Analysis."
In Theory of Cryptography: Third Theory of Cryptography Conference,
Proceedings, edited by Shai Halevi and Tal Rabin. Berlin and Heidelberg:
Springer.
Dwork, Cynthia, and Aaron Roth. 2014. "The Algorithmic
Foundations of Differential Privacy." Foundations and Trends in
Theoretical Computer Science 9, nos. 3-4: 211-407.
Evans, Timothy, Laura Zayatz, and John Slanta. 1998. "Using
Noise for Disclosure Limitation for Establishment Tabular Data."
Journal of Official Statistics 14, no. 4: 537-51.
Evfimievski, Alexandre, Johannes Gehrke, and Ramakrishnan Srikant.
2003. "Limiting Privacy Breaches in Privacy Preserving Data
Mining." In Proceedings of the Twenty-Second ACM
SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS).
New York: Association for Computing Machinery,
http://www.cs.comell.edu/johannes/papers/2003/pods03-privacy.pdf
Federal Reserve Board of Governors. 2013. Codebook for 2013 Survey
of Consumer Finances. Washington.
Fellegi, I. P. 1972. "On the Question of Statistical
Confidentiality." Journal of the American Statistical Association
67, no. 337: 7-18.
Goldwasser, Shaft, and Silvio Micali. 1982. "Probabilistic
Encryption and How to Play Mental Poker Keeping Secret All Partial
Information." In Proceedings of the Fourteenth Annual ACM Symposium
on Theory of Computing (STOC). New York: Association for Computing
Machinery, https://www.cs.purdue.edu/
homes/ninghui/readings/Qual2/Goldwasser-Micali82.pdf
Hardt, Moritz, Katrina Ligett, and Frank McSherry. 2012. "A
Simple and Practical Algorithm for Differentially Private Data
Release." In Advances in Neural Information Processing Systems 25,
edited by F. Pereira, C. J.C. Burges, L. Bottou, and K.Q. Weinberger.
Red Hook, N.Y.: Curran Associates.
Harris-Kojetin, Brian A., Wendy L. Alvey, Lynda Carlson, Steven B.
Cohen, and others. 2005. "Report on Statistical Disclosure
Limitation Methodology." Statistical Policy Working Paper no. 22,
Federal Committee on Statistical Methodology.
https://fcsm.sites.usa.gov/files/2014/04/spwp22.pdf
Heffetz, Ori, and Katrina Ligett. 2014. "Privacy and
Data-Based Research." Journal of Economic Perspectives 28, no. 2:
75-98.
Heitjan, Daniel F., and Donald B. Rubin. 1991. "Ignorability
and Coarse Data." Annals of Statistics 19, no. 4: 2244-53.
Hirsch. Barry T., and Edward J. Schumacher. 2004. "Match Bias
in Wage Gap Estimates due to Earnings Imputation." Journal of Labor
Economics 22, no. 3: 689-722.
Holan, Scott H., Daniell Toth, Marco A. R. Ferreira, and Alan F.
Karr. 2010. "Bayesian Multiscale Multiple Imputation with
Implications for Data Confidentiality." Journal of the American
Statistical Association 105, no. 490: 564-77.
Holland, Paul W. 1986. "Statistics and Causal Inference."
Journal of the American Statistical Association 81, no. 396: 945-60.
Imbens, Guido W., and Thomas Lemieux. 2008. "Regression
Discontinuity Designs: A Guide to Practice." Journal of
Econometrics 142, no. 2: 615-35.
Imbens, Guido W., and Donald B. Rubin. 2015. Causal Inference for
Statistics, Social and Biomedical Sciences: An Introduction. Cambridge
University Press.
Karr, A. F., C. N. Kohnen, A. Oganian, J. P. Reiter, and A. P.
Sanil. 2006. "A Framework for Evaluating the Utility of Data
Altered to Protect Confidentiality." American Statistician 60, no.
3: 224-32.
Kennickell, Arthur B. 1997. "Multiple Imputation and
Disclosure Protection: The Case of the 1995 Survey of Consumer
Finances." In Record Linkage Techniques, edited by Wendy Alvey and
Bettye Jamerson. Arlington, Va.: Federal Committee on Statistical
Methodology.
Kennickell, Arthur, and Julia Lane. 2006. "Measuring the
Impact of Data Protection Techniques on Data Utility: Evidence from the
Survey of Consumer Finances." In Privacy in Statistical Databases:
CENEX-SDC Project International Conference, Proceedings, edited by Josep
Domingo-Ferrer and Luisa Franconi. Berlin and Heidelberg: Springer.
Kinney, Satkartar K., Jerome P. Reiter, Arnold P. Reznek, Javier
Miranda, and others. 2011. "Towards Unrestricted Public Use
Business Microdata: The Synthetic Longitudinal Business Database."
International Statistical Review 79, no. 3: 362-84.
Larrimore, Jeff, Richard V. Burkhauser, Shuaizhang Feng, and Laura
Zayatz. 2008. "Consistent Cell Means for Topcoded Incomes in the
Public Use March CPS (1976-2007)." Journal of Economic and Social
Measurement 33, no. 2: 89-128.
Lauger, Amy, Billy Wisniewski, and Laura McKenna. 2014.
"Disclosure Avoidance Techniques at the U.S. Census Bureau: Current
Practices and Research." Research Report Series (Disclosure
Avoidance) no. 2014-02. Washington: Center for Disclosure Avoidance
Research, U.S. Census Bureau.
Lee, David S., and David Card. 2008. "Regression Discontinuity
Inference with Specification Error." Journal of Econometrics 142,
no. 2: 655-74.
Little, Roderick J.A. 1993. "Statistical Analysis of Masked
Data." Journal of Official Statistics 9, no. 2: 407-26.
Machanavajjhala, A., D. Kifer, J. Abowd, J. Gehrke, and L.
Vilhuber. 2008. "Privacy: Theory Meets Practice on the Map."
In Proceedings of the 2008 IEEE 24th International Conference on Data
Engineering. Red Hook, N.Y.: Curran Associates.
McSherry, Frank 2009. "Privacy Integrated Queries: An
Extensible Platform for Privacy Preserving Data Analysis." In
Proceedings of the 2009 ACM SIGMOD International Conference on
Management of Data. New York: Association for Computing Machinery,
http://research.microsoft.com/pubs/80218/sigmodl 15mcsherry.pdf
Narayanan, Arvind, and Vitaly Shmatikov. 2008. "Robust
De-Anonymization of Large Sparse Datasets." In Proceedings of the
2008 IEEE Symposium on Security and Privacy. Red Hook, N.Y.: Curran
Associates.
Ohm, Paul 2010. "Broken Promises of Privacy: Responding to the
Surprising Failure of Anonymization." UCLA Law Review 57: 1701.
Piketty, Thomas, and Emmanuel Saez. 2003. "Income Inequality
in the United States, 1913-1998." Quarterly Journal of Economics
118, no. 1: 1-41.
Raghunathan, T. E., Reiter, J.P., and Rubin, D. B. 2003.
"Multiple Imputation for Statistical Disclosure Limitation."
Journal of Official Statistics 19, no. 1: 1-16.
Reiter, Jerome P. 2004. "Simultaneous Use of Multiple
Imputation for Missing Data and Disclosure Limitation." Survey
Methodology 30, no. 2: 235-42.
--. 2005. "Estimating Risks of Identification Disclosure in
Microdata." Journal of the American Statistical Association 100,
no. 472: 1103-12.
Rubin, Donald B. 1974. "Estimating Causal Effects of
Treatments in Randomized and Nonrandomized Studies." Journal of
Educational Psychology 66, no. 5: 688-701.
--. 1993. "Discussion: Statistical Disclosure
Limitation." Journal of Official Statistics 9, no. 2: 461-68.
Skinner, C. J., and D.J. Holmes. 1998. "Estimating the
Re-Identification Risk per Record in Microdata." Journal of
Official Statistics 14, no. 4: 361-72.
Skinner, Chris, and Natalie Shlomo. 2008. "Assessing
Identification Risk in Survey Microdata Using Log-Linear Models."
Journal of the American Statistical Association 103, no. 483: 989-1001.
Sweeney, L. 2000. "Uniqueness of Simple Demographics in the
U.S. Population." Technical report no. LIDAP-WP4. Laboratory for
International Data Privacy, Carnegie Mellon University.
U.S. Census Bureau. 2013a. SIPP Synthetic Beta: Version 6.0
[computer file], Washington; Cornell University, Synthetic Data Server
[distributor], Ithaca, N.Y.
U.S. Census Bureau. 2013b. Synthetic Longitudinal Business
Database: Version 2.0 [computer file], Washington; Cornell University,
Synthetic Data Server [distributor], Ithaca, N.Y.
U.S. Census Bureau. 2015. LEHD Origin-Destination Employment
Statistics (LODES), Washington; U.S. Census Bureau [distributor].
Warner, Stanley L. 1965. "Randomized Response: A Survey
Technique for Eliminating Evasive Answer Bias." Journal of the
American Statistical Association 60, no. 309: 63-69.
Yakowitz, Jane. 2011. "Tragedy of the Data Commons."
Harvard Journal of Law and Technology 25, no. 1.
(1.) See the online appendix, section B.1. Supplemental materials
and online appendices to all papers in this volume may be found at the
Brookings Papers web page, www.brookings. edu/bpea, under "Past
Editions."
(2.) U.S. Code Title 13, Section 9, governing the Census Bureau,
prohibits "any publication whereby the data furnished by any
particular establishment or individual under this title can be
identified" (see https://www.law.comell.edU/uscode/text/13/9,
accessed August 6, 2015). U.S. Code Title 5, Section 552a (part of the
Confidential Information Protection and Statistical Efficiency Act of
2002), which governs all federal statistical agencies, requires them to
"establish appropriate administrative, technical, and physical
safeguards to insure the security and confidentiality of records and to
protect against any anticipated threats or hazards to their security or
integrity which could result in substantial harm, embarrassment,
inconvenience, or unfairness to any individual on whom information is
maintained" (see https://www.law.comell.edU/uscode/text/5/552a,
accessed August 6, 2015).
(3.) Evfimievski, Gehrke, and Srikant (2003) and Dwork (2006) prove
it is impossible to deliver full protection against inferential
disclosures, using different, but related, formalizations of the
posterior probabilities.
(4.) See River (2005), Skinner and Holmes (1998), and skinner and
Shlomo (2008) for specifics on the risk indexes and Duncan, Elliott, and
Salazar-Gonzales (2011, p. 114) for a review of historical uses of
swapping.
(5.) See for example U.S. Census Bureau (2013a, 2013b, 2015).
(6.) North American Industry Classification System.
(7.) Federal Information Processing Standard.
(8.) We abstract from the weight that QWI uses to benchmark certain
state-level aggregates. Formulas including weights are in the online
appendix, section H.
(9.) By the construction of the noise-infusion process for QWI, the
design of the random effects is orthogonal to In [N.sub.(k)t].
(10.) Information about the SIPP database can be found here:
https://www2.vrdc.comell. edu/news/data/sipp-synthetic-beta-file.
Information about the SynLBD database can be found here:
https://www2.vrdc.comell.edu/news/data/lbd-synthetic-data/
(11.) The Cornell-based server is located here:
http://www2.vrdc.cornell.edu/news/
synthetic-data-server/step-1-requesting-access-to-sds/
Comments and Discussion
COMMENT BY
CAROLINE HOXBY I began graduate school in economics in the heyday
of survey data. Nearly all applied microeconomists relied intensely on
data from surveys supported by the federal government. This was an era
in which good researchers knew the Current Population Survey, the Public
Use Microdata Samples of the Census, and other major surveys inside and
out. They routinely discussed apparently obscure points about imputation
of missing data in a particular variable or how changing response rates
biased the time trend in another variable. Little did they know it, but
the era of survey data was already passing to make way for an era in
which administrative data would become dominant. (1) Indeed, as
discussed below, the same researchers who were steeped in survey data
were pushing causal empirical techniques that would eventually induce
more and more scholars to shift to administrative data. I was able to
see this myself by the time I wrote my dissertation: techniques like
differences-in-differences worked much more smoothly with the
administrative data on which I partially relied. Today, many newly
minted Ph.D.s in applied microeconomics have only used administrative or
other data gathered through similar means.
Administrative data are automatically compiled in the course of
administering a program. Examples are tax data; social insurance data
such as unemployment, disability, public pensions, Medicare and
Medicaid; data from patient medical visits; educational records from
schools; criminal justice data from police, courts, and incarceration;
mortgage regulation data, credit agency records; and so on. Although
usually called big data rather than administrative data, the data from
businesses, especially online businesses like Amazon or Facebook, that
can automatically compile information have a similar flavor. No one is
surveyed and data are gathered on a population of users, not a sample.
Researchers who are old enough to have used survey and
administrative data in parallel tend to appreciate the strengths of each
type. For instance, surveys can directly ask the questions to which we
would most like answers: "Are you searching for work (unemployed)
or happily out of the labor force?" Surveys can gather rich
sociodemographic data. They can reach people who do not
"participate"--people who do not use credit, for example.
However, such appreciation for surveys is falling among newly minted
economists. They often look down on survey data as obviously inferior:
the sample sizes seem too small, the responses too prone to reporting
error and missing data, and the sampling too opaque. When asked to drum
up support for a federal survey among young colleagues, I often find
their responses to be muted or even ambivalent. Why support expensive
surveys that they have not used and might never use more than
occasionally to supplement or provide descriptive context for their
analyses based on administrative data?
This is the world into which John Abowd and Ian Schmutte send their
paper on statistical disclosure limitation (SDL). It is an admirable,
thorough, and careful paper, replete with wisdom. It offers us telling
examples and gem-like insights based on them. It will surely become
economists' key reference for SDL. In short, one can learn a great
deal from the paper.
However, the paper is oddly out-of-step with the context into which
it was born: a context in which researchers are abandoning survey data
altogether. While I and the authors would almost certainly agree that
surveys ought to continue and are crucial in many applications, I think
that we predict the likely response to their paper somewhat differently.
The authors hope that it will drive researchers to become sophisticated
about SDL, account for its effects in their research, and document it
when publishing. I believe that their paper will horrify researchers who
currently are unaware of SDL but who are already dubious about survey
data. It will drive them deeper into the administrative-data-only camp.
Moreover, I disagree with the authors on who ought to bear the
burden of lessening the negative impact of SDL on the accuracy of
research. The authors put too much onus on researchers. This seems wrong
not only for practical reasons (discussed below) but also because it
flies in the face of political logic. Federal statistical agencies need
researchers to support and use their data if they are to justify the
expense of surveys. Since these same agencies introduce SDL to data that
would otherwise be free of it, they are in a far better position to
manage its impact than are researchers who are downstream of the SDL
being applied. If these agencies want to keep up their surveys, it is
they who need to take up the burden of lessening the negative impact of
SDL on research.
WHAT WE LEARN FROM THIS PAPER The authors could not be more correct
when they assert that "modern SDL procedures are a black box whose
effect on empirical analysis is not well understood." And they do
indeed "pry open the black box" and describe what they see. At
least, they describe what we are allowed to see, which in some cases is
quite limited.
SDL is intended to protect the confidentiality of survey
respondents when data are released for public use. (2) The authors
provide the example of a male household head from Athens, Georgia, who
has 10 children. He may be the only person in the entire United States
with such characteristics. Thus, if we knew his characteristics and
wanted to learn surreptitiously about his family income, we might scour
the American Community Survey and Current Population Survey in the hope
that he is a participant in one of them. Since the former is a 1 percent
sample and the latter a 0.1 percent sample of the U.S. population, our
effort would be extremely likely to end up producing nothing of interest
even if SDL were not applied. However, SDL is applied to these data and
would probably prevent us from learning his income.
The authors explain all of the SDL methods used to protect the
fecund father from Athens. All of them alter the data so that he cannot
be identified with certainty. Thus, data swapping might cause some of
his data to be swapped with data from another household head in a
different area of the country. Coarsening might make his number of
children "five or more" instead of 10. Noise infusion might
give him 10 children plus or minus three. Synthetic data would destroy
his (and everyone else's) actual data completely but would allow us
to compute certain prespecified statistics on fake data and nevertheless
obtain the correct numbers.
We now see why agencies that apply SDL are unwilling to disclose
their methods with much exactitude. If we knew that the father would
always be swapped with another father of 10 in a neighboring county, we
might try to find all of the "possibles" and learn that his
income took one of only a few values. If we knew that his number of
children would be plus or minus three, we could focus on Athens fathers
with seven or thirteen children. If we had synthetic data and were
allowed to learn which statistics could be computed accurately and how
inaccurate other statistics would be, we might be able to back out the
father's actual data--albeit with an analytic and computational
burden so enormous that it would be more sensible to apply our
formidable skills and nefarious inclinations to more remunerative tasks.
In short, if agencies disclose their SDL methods in too much detail,
data users might be able to undo it. This is why agencies hesitate to
give more than vague descriptions and never disclose exact parameters.
To help us think through the effects of SDL, the authors introduce
the concept of ignorability, well known in statistics but not common
parlance among applied economists. SDL is ignorable if the researcher
can use the SDL-treated data just as though it were the clean,
confidential data and produce estimates and inferences that are the same
as the clean data would produce. The authors' discussion of
ignorability is highly useful in and of itself, even if it does not
change the way people manage SDL. Economists are already comfortable
discussing measurement error, imputation, and biases due to selection
into nonresponse. They need a framework for thinking clearly about SDL.
Using a combination of examples and models, the authors explain
which types of SDL are ignorable under which circumstances. The main
lesson is that SDL is not ignorable unless the researcher wants to use
the data to construct the statistics that are those already published by
the agencies or that the agencies foresaw that researchers would want to
construct when setting up SDL. Fundamentally, the problem is that
statistical agencies are forced to develop a strategy for publishing
data for public use, but in order to have a strategy they must trade off
confidentiality risk (the cost) against data usefulness (the benefit).
But whether the data are "useful" depends on the use, so
agencies are forced to decide in advance what the uses will be in order
to conduct SDL. Unless the researcher's use happens to be a use
they foresaw and took into account, SDL will negatively affect the
accuracy of estimates and inferences. The authors put this point well:
"Any such . . . strategy inherently advantages certain analyses
over others." Moreover, the agencies will not reveal their strategy
to the researcher so he cannot even know whether his use is one that
they foresaw or one that they did not. It is thus extremely difficult
for even the most diligent researcher to prevent herself from
unintentionally generating misleading analyses.
A researcher is on the safest ground if she is merely publishing
descriptive statistics in a noncausal analysis and those descriptive
statistics are (i) means (ii) based on large subgroups of the data and
(iii) fairly similar to statistics that the agencies published
themselves. Similarity to published statistics may make cross-validation
possible. For instance, if means of adjusted gross income were published
and the researcher computed mean tax payments for exactly the same
subgroups, tax law would allow the researcher to check whether the two
sets of means were reasonably compatible.
Unfortunately, most of what researchers do does not fit that
description of safe ground. Modem causal empirical methods, like
regression discontinuity and differences-in-differences (with its many
extensions), make accuracy indispensable, use subgroups that are thin
slices of the population, and compute statistics that are so unlike
those reported in government statistics that cross-validation can
definitively eliminate only outlandishly wrong estimates. Researchers
who are less concerned about causal analysis but who use SDL data in
structural models are also negatively affected: SDL always affects
inference on model parameters because researchers cannot correct for the
uncertainty it introduces.
Concrete examples may be helpful here. One particularly important
application of regression discontinuity is to compulsory schooling laws
which generate a birthday cut-off for enrolling a child in school. For
instance, in certain school districts if a child is age five by
September 30th she should be enrolled in kindergarten. Such cutoffs have
been used to estimate the effect of education on earnings, childbearing,
and numerous other outcomes. (3) SDL might mean that all such estimates
are wrong. This is because compulsory schooling laws necessarily
generate a slightly fuzzy discontinuity: some parents of children with a
September 30 birthday will be able to hold off enrollment for a year.
Some parents of children with an October 1 birthday will manage to
enroll their child. If SDL has added noise to birthdays, swapped
birthdays, swapped locations, or constructed synthetic data that do not
exactly foresee this application, the estimates could be highly
inaccurate. True September 30 and October 1 children could be given
August birthdays and August children could be given birthdays near the
cutoff. Children who truly live in districts where the cutoff is
November 30 could be swapped into districts where the cutoff is
September 30. Because some children do actually enroll on the
"wrong" side of the birthday cutoff, the researcher will have
no way to know whether his regression discontinuity results are
consistent or ruined by SDL. I am confident that any researcher who
reads the authors' paper and wants to use compulsory schooling laws
will henceforth flee SDL-treated data in favor of clean administrative
data (from birth certificates, for example). (4)
An important application of differences-in-differences methods is
to the Earned Income Tax Credit (EITC). The EITC has had its generosity
changed at various times, and the changes sometimes apply only to
families with, say, three or more children. Thus, a researcher might
exploit the before-after change in generosity for the families with
exactly three children, and she might use the families with exactly two
children to eliminate time trends that would have affected the
three-child families even if the generosity of the program had not
changed. If SDL changes families' numbers of children even
slightly, this empirical method could generate highly misleading
results. Actually, the situation would likely be worse. Researchers do
not typically compare all three-child families to all two-child families
with a simple differences-in-means. They usually condition on indicators
for state of residence, local area economic conditions, race, ethnicity,
mother's education, child age, and other variables. Thus, the data
are sliced into thin subgroups that could be extremely affected if SDL
has been applied to these other conditioning variables as well as to the
number of children. It is disturbing to think that researchers who
exerted so much effort analyzing the EITC with survey data could have
all their good work undone by SDL and have no way of knowing it. One can
understand why they would flee to administrative data, such as tax data,
for their next project on the topic.
THE AUTHORS' SUGGESTIONS FOR RESEARCHERS Although the
authors' analysis of the effects of SDL on estimation and inference
is very helpful, their suggestions for researchers are less so.
One suggestion is that researchers attempt to back out some
information on SDL from different sources of data to which different
versions of SDL have been applied (although the researcher will not have
been told about the differences in the versions). The authors cite the
example of J. Trent Alexander, Michael Davern, and Betsey Stevenson
(2010), who demonstrate that different versions of Census and American
Community Survey data that were supposed to be the same actually
produced systematically different results when certain statistics were
computed. From this exercise, they became aware that SDL was making
certain computations unreliable. Now, we could presumably assign
numerous economists to conduct comparisons a la Alexander, Davern, and
Stevenson on a tremendous scale. We could compute numerous statistics
for all oft-used survey data until, as a profession, we derived greater
understanding of where SDL was likely to be nonignorable. This seems to
be the content of the authors' suggestion.
This suggestion does not make sense. True, in some circumstances,
researchers could--with a great deal of effort--deduce enough about SDL
to account for it better in their analyses. However, if agencies want us
to know the parameters of SDL, they ought to give them to us (which they
could do with minimal effort). If agencies do not want us to know about
certain parameters, they should not provide data that allow them to be
inferred through cross-referencing.
The authors suggest that researchers rely more on synthetic data.
But this assumes that agencies will somehow become remarkably prescient
about the research that people will want to conduct in the future with
the synthetic data. I see no evidence of such prescience. Indeed, it is
the nature of original research that it cannot be foreseen.
Realistically, agencies inevitably produce synthetic data that produce
the sort of calculations that they need to publish to fulfil their
reporting mandates. But since these calculations are already available,
the synthetic data may be of little further use.
Moreover, we do not want agencies to foresee the research that
people will want to conduct for the same reason we do not want agencies
to attempt causal evaluations themselves. The conflict of interest that
arises when agency staff evaluate a program that its leaders champion
(or wish to see eliminated) is enormous. Staff can be pressured to use
favorable but flawed empirical methods, apply SDL to make unfavorable
but better methods impossible to use, use SDL to make unfavorable data
disappear, and so on. If we do not wish to create an environment in
which such pressure might be brought to bear, we ought not to ask
agencies to conduct or foresee the work of outside researchers. Outside
researchers' conflict-of-interest problems that are related to
federal programs are nearly always trivial compared to those that could
arise within agencies. The agencies have more degrees of freedom
(because they collect data and have the right to alter it) and their
leaders are far more closely identified with particular programs than
are outside researchers.
The authors also suggest data validation. This occurs when
researchers conduct all of their exploratory analysis using fake data
and then send their final code to be run on the actual, clean data. This
is a somewhat useful suggestion because it would at least allow
researchers to avoid unintentionally publishing estimates that are
grossly incorrect because of SDL, about which they were unable to learn.
But data validation is not a fix. Since the researcher is forced to
explore only fake data, she may be unable to recognize crucial patterns
in the data that actually exist. Great empiricists are people who are
superb at recognizing patterns in data and seeing the patterns'
relationship to hypotheses. Data validation makes them operate with
blindfolds on. (5)
The authors suggest that journals should require researchers to
supply details of the SDL applied to their data. They go further, in
fact, and argue that if such requirements were implemented, researchers
would lobby agencies to learn more about how SDL affected their
estimates. These arguments seem to get the incentives wrong. Journals
pressure researchers who pressure agencies to provide information that
the agencies do not want to provide. The careers of agency staff do not
depend on whether the researcher is able to publish his paper. If
anything, agencies sometimes want to pick and choose which papers get
published. Giving them an indirect mechanism for doing this is not a
good idea.
SDL SHOULD CHANGE IN RECOGNITION OF HOW THE WORLD IS CHANGING I
would argue that SDL needs to change in recognition of how the world is
changing: SDL must become nearly always ignorable when the data are used
in the ways that modern economists use data. This means that SDL
treatment must be lightened. Alternatively, survey and other data to
which SDL is normally applied must be made available in a clean form to
qualified researchers in carefully controlled, secure settings. These
changes should be initiated and accomplished by the agencies themselves.
Researchers simply do not have the tools to make these changes occur.
Why must SDL change? There are three reasons. First, as I already
emphasized, SDL can wreak havoc on modern causal empirical methods and
modern methods of inference. The methods in place were devised in an era
when it was supposed that data would be used very differently. If the
data are to be useful, SDL must keep up with methods.
Second, agencies may soon find it impossible to defend their
surveys from budget cuts. Already, support is notably falling among
young researchers. The flight to administrative data will not reverse
itself, because it is a consequence of nonreversible progress made on
methods. This reality raises the costs of any given amount of SDL: by
accelerating the switch to administrative data, it could ultimately
destroy the surveys themselves. (Here, we must differentiate SDL from
missing data imputation and reporting error. The latter problems are
also driving the switch to administrative data, but missing or erroneous
data can be extremely hard to remedy. In contrast, SDL is imposed on
data that could have remained clean.)
Third, the nightmare "database of ruin" scenarios that
the authors present under "What does SDL protect?" are made
less likely by the advent of big data from an increasing number of
Internet and other sources. While it may seem odd that people are so
willing to have their personal information in the hands of Facebook,
Google, Intuit (the company behind TurboTax and Mint), and numerous
other sites and retailers, this is the reality. If any thinking person
wants to compile a database of ruin, it would be inefficient,
unnecessarily difficult, and expensive for him to do it through a
federal survey. Why should one start with Census data in an attempt to
find the father of 10 in Athens, Georgia? It would be far easier to
offer him a small incentive to sign up for an Internet-based service
that would ask his income in return for helpful consumer or tax advice.
Each day, we read prima facie evidence that big data are now more
sought after by those with nefarious ends than are federal survey data.
The authors note that "it has been roughly six decades since the
last reported breach of data privacy within the federal statistical
system. One is hard-pressed to find a report of the American Community
Survey, for example, being 'hacked.'" It has not been six
decades since hackers stole information from credit card providers,
banks, massive stores like Target and Home Depot, and numerous Internet
sites to which people have voluntarily uploaded information. Such data
breaches occur every day. They appear to require far less effort and to
provide far more accurate income and other information than roundabout
methods applied to one-percent samples of the U.S. population. It only
seems sensible to acknowledge that risks are gravitating away from
survey data and toward other data. Surely agency efforts to prevent
confidential information leaks ought to flow in the same direction as
the risk.
REFERENCES FOR THE HOXBY COMMENT
Alexander, J. Trent, Michael Davern, and Betsey Stevenson. 2010.
"Inaccurate Age and Sex Data in the Census PUMS Files: Evidence and
Implications." Public Opinion Quarterly 74, no. 3: 551-69.
Angrist, Joshua D., and Jorn-Steffen Pischke. 2010. "The
Credibility Revolution in Empirical Economics: How Better Research
Design Is Taking the Con out of Econometrics." Journal of Economic
Perspectives 24, no. 2: 3-30.
Dobkin, Carlos, and Fernando Ferreira. 2010. "Do School Entry
Laws Affect Educational Attainment and Labor Market Outcomes?"
Economics of Education Review 29, no. 1: 40-54.
(1.) There are numerous papers on this topic, but a good
introduction is Angrist and Pischke 2010.
(2.) SDL is also applied to certain administrative data that are
released for public use. However, there are often "restricted"
versions of these data to which qualified researchers with a relevant
project can gain access in a strictly controlled environment. The
restricted versions are often free from SDL treatment. Thus, I focus on
survey data, which federal agencies appear always to treat with SDL.
(3.) There are now a large number of such papers based on data from
several countries. It is worth noting that some papers use Census data
subjected to SDL while others use administrative birth records that, as
far as is known, have not been subjected to SDL. See, for instance,
Dobkin and Ferreira 2010.
(4.) The authors point out that SDL has little effect on strict
regression discontinuity because the researcher knows that any person on
the wrong side of a discontinuity must be SDL-affected. However, this
type of strictness is merely a theoretical possibility used for
exposition of regression discontinuity. I was unable to think of a
single applied example where strictness was perfect. Even examples drawn
from authoritarian regimes and the military, which can presumably
enforce cutoffs more stringently than others, exhibit some amount of
fuzziness. Even administrative tax and social insurance data exhibit
slight fuzziness. For instance, a person who earns one dollar more than
the cutoff for a tax credit is often allowed to take the credit if she
claims it. Authorities rarely waste effort on such cases.
(5.) Of course, if the researcher could send every preliminary
result to be validated, she could accomplish pattern recognition, albeit
slowly and probably less well because of the time costs. However,
agencies that wanted to enforce strong SDL would necessarily limit the
number of results that a researcher could validate. Otherwise, the
researcher could back out the SDL parameters that the agencies wanted to
obscure.
COMMENT BY
BETSEY STEVENSON John Abowd and Ian Schmutte have done an important
public service writing this paper. In my experience, too few researchers
are aware of statistical disclosure limitation (SDL) procedures, and
therefore far too few are using the appropriate methods to adjust for
the distortions introduced by these produces. Without such an awareness,
researchers cannot make appropriate modifications or validations. These
issues are therefore not side issues but critical to researchers'
attempts to use data to make valid inferences about the world.
The authors provide a very nice framework for thinking about
ignorable and non-ignorable SDL procedures and their implications for
different types of econometric analysis. They then provide thorough
explanations of the types of SDL techniques commonly used, the way
researchers should adapt given each of these techniques, and the
tell-tale signs to identify whether data have been distorted. They
provide concrete guidance to researchers and journal editors. Without a
doubt, this paper should be required reading for empiricists,
particularly for every graduate student thinking of doing empirical
work.
Given the authors' success at delivering such guidance, in my
comments I want to focus on three big-picture issues that the paper
raises: First, what more can government statistical agencies do to
better balance the value of data against privacy concerns? Second, what
are the responsibilities of researchers and journal editors in helping
to curate our national data? And third, what are the needs for
disclosure avoidance going forward, including in other sources of data?
BALANCING THE VALUE OF DATA AGAINST PRIVACY CONCERNS One challenge
that the U.S. statistical agencies face is that they are seeking to meet
a standard of zero probability of disclosure. But as the paper makes
clear, it is not possible to have data with a zero probability of
inferential disclosure-- for that to occur, the data would have to be
"all noise, no signal" and hence useless. So there is an
inherent tension in the system: the standard that those in the
statistical agencies are being held to is incompatible with the goal of
providing data that are useful.
In 2003, the Confidential Information Protection and Statistical
Efficiency Act of 2002 (CIPSEA) was signed into law, tightening
protections for statistical data collected under a pledge of
confidentiality. The goal of CIPSEA was to ensure that those who supply
information under such a pledge to statistical agencies for statistical
purposes will only have their data used for statistical purposes and
will not have that information disclosed to any unauthorized person. The
legislation made it clear that there is no acceptable level of
noncompliance with the CIPSEA pledge. The interpretation of the statute
from the Office of Management and Budget stated that agencies are
required to provide "a uniform high level of protection for all
information gathered by Federal agencies under a pledge of
confidentiality for exclusively statistical purposes" (Wallman and
Harris-Kojetin 2004, p. 1800).
Yet a "high level" is not clearly defined in terms of how
much disclosure risk one should tolerate. As a result, many interpreted
the guidance as suggesting zero tolerance. A natural policy question is
how much disclosure risk should we tolerate in data? And a related
question is whether we owe a higher degree of confidentiality to data
collected for statistical purposes (the current policy is that we do).
Finally, it is also not clear that those within the statistical agencies
are the best situated to determine the level of risk that is acceptable.
For many folks working in the statistical agencies, there is a
perception that the loss function puts infinite weight on the risk of
disclosure. This is not because the risk to any particular individual
from their data being disclosed is infinite, but because the political
risk is great: such a disclosure, even one that did minimal harm, would
lead to sharp congressional action to curtail data collection for
statistical purposes. Moreover, data workers also face great personal
risk if they are deemed responsible for a disclosure incident. A Census
employee is told that he or she could face up to five years in prison
and a $250,000 fine for having released data that allowed a business or
individual to be identified. (1) Notably, the punishment for disclosure
is not a function of whether the disclosure did harm or even whether it
revealed valuable information. Even if a Census worker disclosed
government-collected data for statistical purposes that is also
elsewhere publically available, he or she would face the same punishment
as if the disclosure did cause damage. In short, for statistical agency
employees, their risk is not a function of the harm of disclosure.
For example, the Census Bureau had data on which businesses
experienced flooding during Hurricane Katrina, but it could not publish
a map showing all the businesses that were flooded even though anyone
walking down the street could have hand-collected the data and published
such a map. Nor did this policy change just because Google Maps now
enables people anywhere in the world to zoom into street views using
smartphones to see detailed images of flooded businesses.
One has to wonder whether there is a principal-agent problem here,
so that those in charge of disclosure avoidance are less willing to
tolerate risk than the general public would be. Public data are
undoubtedly an essential part of our national infrastructure, but even
if we accept that Congress may take drastic action in the face of a
breach in which privacy was compromised, we may still want those in the
statistical agencies to accept some risk of this occurring. Statistical
agencies have to balance the risk of disclosure, including political
risk, with the usefulness of the data.
While economists might naturally think about how much risk should
be tolerated in terms of costs and benefits, some people object on civil
liberty grounds to the government's compelling respondents to
provide any personal data. These civil libertarians view a breach of
such data as having a greater cost than if the equivalent data collected
by a nongovernmental source were breached, for example through the
hacking of a data set compiled by a private-sector company or even the
hacking of a government administrative data set. Abowd and Schmutte
argue that a key principle of confidentiality is "that individual
information should only be used for the statistical purposes for which
it was collected" and that it "should not be used in a way
that might harm the individual." However, this misses the subtle
yet important distinction that the standard for data collected for
statistical purposes is greater than the standard for administrative
data. It also overlooks the belief held by the government, as
articulated by the Office of Management and Budget in its implementing
guidance on CIPSEA, that the "purposes for which and the conditions
under which the data were collected" (Wallman and Harris-Kojetin
2004, page 1800) are critical when making decisions about how to protect
confidentiality and provide access to the data.
Respondents are required by law to complete Census surveys, and
they can be jailed or fined for noncompliance. Although no nonresponders
have actually been fined or jailed, some members of Congress have sought
to remove the requirement that the survey be completed so as to
explicitly make survey responses voluntary. These advocates argue that
the government should not force Americans to reveal private information.
Such civil liberty concerns were cited as the primary motivation when
Stephen Harper, then Canada's prime minister, made the Canadian
long-form Census voluntary, a move that has dramatically decreased the
statistical reliability of the data. Harper's stated justification
was that citizens should not "be forced, under threat of fines,
jail, or both, to disclose extensive private and personal
information" (Casselman 2015).
In general, the majority of the U.S. public has been concerned
about government data and privacy risk for some time, although this
concern is not limited to data collected for statistical purposes. Since
the 1980s, about once every 10 years the General Social Survey has asked
respondents whether increased computing power coupled with the federal
government's access to private information presents a threat to
individual privacy. (2) Consistently, as shown in my figure 1, a
majority of adult respondents--about two-thirds in 2006--state that this
access to information presents either a very serious or a fairly serious
threat to privacy.
Beyond thinking about how much disclosure risk we should tolerate,
we should also consider who should be held responsible for a breach.
Currently, it is employees of the statistical agencies that make
decisions about the trade-off between risk and useful data while facing
the threat of legal and financial sanction if the data is misused. If
legal sanctions were instead directed toward those who would misuse the
data, this could provide a level of protection that would allow the
pendulum to shift more toward useful data. For example, making it
illegal to attempt to identify people in purposefully anonymized data
could provide protections that go beyond data collected for statistical
purposes.
Currently, some users of the data do bear personal responsibility
and are rewarded with access to less distorted data--researchers have
access to data with fewer manipulations applied through the Federal
Statistical Research Data Centers (RDC). According to Census, aside from
some swapping, there are very few adjustments made to these data, so
researchers should know that even when public-use files are available,
the data that they can use in the RDCs may be better suited for their
projects. More generally, this illustrates the benefits to making data
available of having trusted users. Another option is for Census to
develop more licensed data products. Licensing data would allow Census
both to expand the number of trusted users and to employ the threat of
legal sanctions as a substitute for greater disclosure avoidance
methods, without the costs to the agencies and users of RDCs.
Additionally, Census could move in the direction that the authors
suggest and use more synthetic data with validation by the statistical
agency. RDCs could potentially be used for validation studies, although
Census argues that it is not currently set up to do that. The difficulty
is funding. And on a purely practical level, having the right staffing
to create more synthetic data presents a challenge--the application of
SDL techniques takes a different, and lesser, skill set then what is
required for creating synthetic data. That current staffing is not
well-suited to the creation of synthetic data creates a bias toward
nonsynthetic data techniques. So even though synthetic data will surely
be preferable to substitution and suppression in the medium to long
term, researchers and perhaps private sector funders may need to play a
role.
RESEARCHERS' AND JOURNAL EDITORS' RESPONSIBILITY TO HELP
CURATE OUR NATIONAL DATA Given the increasing costs associated with
providing useful data in which privacy is protected, a natural policy
question is whether Census and the statistical agencies should move
toward a fee-for-service model in which they charge for validation or
charge for licensed data. There are very few occurrences of statistical
agencies collecting money from outside sources, but in an era of budget
cuts this may be one of the only ways to increase access to the data.
Two obstacles to such a funding mechanism are practical. For academic
researchers this may just be moving money around the government, since
researchers would seek government funding for validation fees if they
were to be imposed, ultimately leaving the government to fund the full
costs of data provision. However, a fee-for-service model could help
ensure that the most valuable data get created and maintained by
allowing data users to direct funding to data. The second obstacle
relates to civil liberties: Can data that is required be
"sold," even when the cost is to cover the marginal cost of,
for example, a validation study?
What about private sector funding? With more of the data masked for
disclosure avoidance reasons, demand for Census's internal analysis
of the data will grow. Should Census also operate a consulting arm in
which it provides analysis of data, including program evaluation, for a
fee? Currently, program evaluators may not be able to access restricted
data since there is concern about allowing private sector researchers
into the RDCs to conduct program evaluation. Even when such work is
being done for a government agency and cannot be done with the
public-use files, the current system is not designed to allow paid
contractors access to the data.
Let me turn to the economics profession's responsibility. The
authors deserve enormous praise for this really important work designed
to help researchers understand how to use the data that are produced and
made available by statistical agencies and to understand issues of
privacy that impact all data sets. Our profession has far too few
rewards for academics who contribute to the public good by helping to
curate and improve our national statistics. This incentive structure
leaves too many academics ignorant about the way data are collected and
prepared for use, and it means that, too often, problems in the data go
undiscovered and improvements go unmade.
We in the profession should have standards under which graduate
students, as part of their training, are more actively engaged in
validating both research and data. For instance, one could set up RDCs
so that graduate students used them to validate papers as part of their
graduate training and as a useful assessment of both the authors'
methodology and their data.
Perhaps most importantly, in reviewing empirical research a first
question should be whether an empirical finding can be replicated in
other available data sets. While this will not always solve or even
illuminate issues related to disclosure limitation procedures, it may
identify them and help to identify problems in the data more generally,
including those related to data masking. Multiple datasets are not
always available, but to the extent that they are it should be de
rigueur to have multiple dataset validation. Empirical researchers
should test results across as many data sets as possible, in the same
way researchers run specification tests or other tests for sensitivity
of their results.
The authors discuss problems with Census 2000 and several years of
the American Community Survey (ACS) and the Current Population Survey
(CPS) that stemmed from the misapplication of statistical disclosure
limitation procedures. These problems went undetected for many years and
were discovered in research that I did with J. Trent Alexander and
Michael Davern comparing marriage rates by age across several data sets;
we noticed inconsistent findings around the interaction of marriage and
reaching full retirement age (Alexander, Davern, and Stevenson 2010).
What at first appeared to be an interesting pattern of divorce around
qualification for Social Security turned out to be a spurious result
that reflected the misapplication of disclosure limitation procedures.
That misapplication led to age- and sex-specific population estimates
generated from the original ACS and CPS public-use files that differed
by up to 15 percent from the counts using the full, confidential data.
Although Census did not release corrected versions of the CPS
public-use microdata, it did amend the public-use ACS and Census files.
My figure 2 shows that the perturbed and unperturbed data in the 2009
CPS still have substantial differences in the male-female ratio. The
Census did amend its age perturbation procedures for the CPS to attempt
to reduce these discrepancies, changes that became effective in January
2011. It also led the agency to compare income and poverty summary
statistics between the perturbed and unperturbed files; when it did so,
it found that, with a few exceptions of narrow age categories and
race/ethnic groups, poverty rates and average incomes were statistically
similar (at the 10 percent level) across the files. But the places where
differences occur illustrate the types of challenges that Abowd and
Schmutte lay out in their paper. My figure 3 shows that mean earnings by
race for men age 65 and older are similar in the perturbed and
unperturbed data. However, as my figure 4 shows, when the ages are
broken down further--separating those ages 65 to 69 from those 75 and
older--substantial differences can be seen across the perturbed and
unperturbed data.
[FIGURE 2 OMITTED]
[FIGURE 3 OMITTED]
[FIGURE 4 OMITTED]
This instance demonstrates that we can improve on our data
collection and processing capabilities and that the profession plays an
important role in ensuring that the data are reliable and useful. As the
economics field has become more empirical, these considerations have
become increasingly important, not just for statistical agencies but for
individual researchers as well. And because the economics profession is
in the midst of a revolution in which greater empiricism is coupled with
accelerating computing power for collecting and analyzing data, we need
to grapple with the trade-off between transparency and anonymity:
between increasing access to comprehensive data, easing the replication
of empirical results, and providing transparent analysis on the one hand
and, on the other, ensuring that respondents are not identifiable from
the survey information they provide. The replication movement pushed
people to make their data available to other researchers, but in a world
in which publicly used data are held to a replication standard, it may
take more effort to balance transparency and anonymity, so the need for
us to increase the rewards for replication is even greater.
THE NEED FOR DISCLOSURE AVOIDANCE GOING FORWARD While some may
argue that the professional researcher will turn away from government
survey data, shifting instead to administrative or private-sector data,
this offers a false sense of protection for researchers. Researchers are
indeed shifting away from government survey data, something that Raj
Chetty demonstrated at the 2012 National Bureau of Economic Research
Summer Institute by showing that the use of such data in leading
economic journals has steadily fallen since 1980, while papers using
administrative data have become increasingly common (Chetty 2012). His
results are shown in my figures 5 and 6, which track journals through
2010; an examination of more recent years suggests these trends
continued through 2014. At the same time, researchers are increasingly
collecting their own data through field experiments and randomized
controlled trials (List and Rasul 2010). Both of these trends highlight
that researchers today are less limited in the questions they ask by
data that the federal government makes publicly available.
However, this increasing scope to collect data ourselves comes with
its own privacy concerns. All government-funded research in the United
States is governed by the "Common Rule," a set of ethics
guidelines regarding biomedical and behavioral research. This regulation
governs Institutional Review Boards and provides guidance on disclosure
limitation, requiring that personally identifiable information remain
confidential at all times. However, the guidelines under the Common Rule
are not as clear as the ones that Census must follow, and while it may
be the case that data collected for statistical purposes by the federal
government is not vulnerable enough to disclosure, other data sets may
be too vulnerable.
[FIGURE 5 OMITTED]
[FIGURE 6 OMITTED]
For example, the HIPAA Privacy Rule establishes standards to
protect people's health and personal medical information. (3) This
means that insurers, health care providers, and clearinghouses should
ensure that any health information they release is not identifiable, in
part by removing information such as names and record numbers, much as
government agencies do. However, these rules are likely insufficient.
A growing risk that is true of all personal data is that it is
possible to map outside data sources onto published data in an attempt
to identify people. As Abowd and Schmutte discuss, it is now known that
people can be identified using a small number of demographic attributes.
The authors give some examples. Another example was demonstrated
recently by researchers at Harvard, who showed that they could identify
people in the Personal Genome Project with 97 percent success using
participants' ZIP codes, birth dates, and sex by simply matching
these data with voter lists (Sweeney, Abu, and Winn 2013).
While many private-sector providers are limited in the types of
information they are able to make public, other detailed personal
data--including identifying information that government data sets do not
publish--are not subject to these constraints. For example, a simple
Google search of the last name "Anderson" and a common male
first name beginning with a "J" in the state of New York
provided information on one Mr. Anderson's age, phone number, and
current and past addresses. Using these addresses, data from Zillow
provides detailed information on Mr. Anderson's current home,
including the number of rooms, the price he paid for his home when he
bought it in 2009, and how much he has paid in property taxes each year
since.
All of this information is public record, but the fact that all of
this information about Mr. Anderson could be found within 90 seconds
illustrates the ease of access that private-sector actors commonly
provide to information that many Americans would consider--and likely
assume to be--private. And as the researchers showed with the Personal
Genome Project, simply using Mr. Anderson's age, ZIP code, and
gender may be enough to link Mr. Anderson to sensitive information
contained in supposedly anonymous data sets. Similarly, research using
1990 U.S. Census data found that 87 percent of Americans had reported
characteristics that could uniquely identify them (Sweeney 2000). This
expansion in access to very sensitive, privately collected data is
driving some of the need to take greater care with our public-use files.
In this way, what the private sector does to safeguard privacy is
intimately linked to what the government statistical agencies need to
do.
REFERENCES FOR THE STEVENSON COMMENT
Alexander, J. Trent, Michael Davern, and Betsey Stevenson. 2010.
"Inaccurate Age and Sex Data in the Census PUMS Files: Evidence and
Implications." Public Opinion Quarterly 74, no. 3: 551-69.
Casselman, Ben. 2015. "What We Don't Know About Canada
Might Hurt Us." FiveThirtyEight.com (blog), August 11.
Chetty, Raj. 2012. "Time Trends in the Use of Administrative
Data for Empirical Research." Presentation at the National Bureau
of Economic Research Summer Institute,
http://www.rajchetty.com/chettyfiles/admin_data_trends.pdf
List, John A., and Imran Rasul. 2010. "Field Experiments in
Labor Economics." In Handbook of Labor Economics, Vol. 4A, edited
by Orley Ashenfelter and David Card. Amsterdam: North-Holland.
Sweeney, Latanya. 2000. "Simple Demographics Often Identify
People Uniquely." Data Privacy Working Paper no. 3, Carnegie Mellon
University. http://data
privacylab.org/projects/identifiability/paper1.pdf
Sweeney, Latanya, Akua Abu, and Julia Winn. 2013. "Identifying
Participants in the Personal Genome Project by Name." White Paper
1021-1. Harvard University.
http://dataprivacylab.org/projects/pgp/1021-1.pdf
Wallman, Katherine K., and Brian A. Harris-Kojetin. 2004.
"Implementing the Confidential Information Protection and
Statistical Efficiency Act of 2002." In Proceedings of the Survey
Research Methods Section. Alexandria: American Statistical Association
(revised version published in [2004] Chance 17, no. 3: 21-25).
(1.) 13U.S.C. 214.
(2.) The complete text of this and other General Social Survey
questions may be viewed in the General Social Survey "1972-2014
Cumulative Codebook" made available online by the National Opinion
Research Center at http://publicdata.norc.org/GSS/DOCUMENTS/
BOOK/GSS_Codebook.pdf. This survey question appears there on p. 2077.
Note that question wording for many questions varies slightly across
years.
(3.) The complete text of the "Health Insurance Portability
and Accountability Act of 1996" (HIPAA) may be viewed at
http://www.gpo.gov/fdsys/pkg/PLAW-104publl91/pdf/ PLAW-104publl91.pdf.
Figure 1. Percent of Adults Believing Government Access to Personal
Data Presents a Privacy Threat"
1985 63
1996 74
2006 66
Source: General Social Survey; Council of Economic Advisers
calculations.
a. The General Social Survey question states: "The federal
government has a lot of different pieces of information about
people which computers can bring together very quickly. Is this a
very serious threat to individual privacy, a fairly serious threat,
not a serious threat, or not a threat at all to individual
privacy?" This figure shows the percentage of respondents who
answered either "very serious threat" or "fairly serious threat."
Note: Table made from pie chart.
GENERAL DISCUSSION John Haltiwanger spoke first to say that like
the discussants, he thought this was a thought-provoking paper. It
opened up the doors so that one could look into the sausage factory, so
to speak, and see that what is inside is not very pretty. For example,
on the survey side, nonresponse rates are incredibly high. So even
before the statistical disclosure limitation (SDL) occurs, an enormous
amount of editing and imputation goes on. Haltiwanger thought the paper
was too optimistic about the potential virtues of using the synthetic
data and validation, because so much work in the statistical agencies
goes into creating the micro-datasets, for example to figure out what
industry a particular establishment belongs to or where it is located.
In cleaning the data, those decisions are not made hard and fast or once
and for all. In fact, the whole process is a moving target. Indeed, he
added, people who work with the confidential data, as he does, spend
most of their time on that data cleaning business.
He also felt it was important to recognize that the statistical
agencies are very intensive users of administrative data, not only for
the tabulations the authors discussed but also for micro-datasets, so
all the SDL issues that they raised also apply to the administrative
datasets. This is the case for county business patterns, the quarterly
census of employment and wages, and a lot else.
Katharine Abraham agreed with discussant Caroline Hoxby that the
main purpose of collecting survey data is to inform policy. Policy
officials at organizations such as the Federal Reserve Board and the
Department of the Treasury care a great deal about having current
information on employment, wages, output, prices and so on. Academic
researchers typically have different needs, but in truth are not the
data users that the economic statistics agencies view as their primary
customers. The need for information on current economic conditions is
unlikely to be satisfied by administrative data.
Abraham took issue with Hoxby's suggestion that the
statistical agencies are overly concerned about the risk of disclosure.
Serious hackers, paparazzi, and advertisers might have little interest
in identifying the individuals or firms that have provided survey
responses, she said, but there are other people who are very concerned
about privacy and believe that the government collects too much
information about its citizens. Such individuals could use any breach of
privacy to embarrass the government and press for cutting back on what
the government collects. From that standpoint, she is sympathetic to the
statistical agencies' concern about needing to protect the
confidentiality of survey respondents. At the same time, Abraham agreed
wholeheartedly with the goal of expanding access to raw data through the
research data centers. Given the legal and budgetary constraints they
face, she believes the statistical agencies have done a commendable job
of finding ways to make micro-data available for research purposes.
Christopher Carroll picked up on a point made by discussant Betsey
Stevenson concerning how decision makers in government agencies often
have personal incentives on the job that are not aligned with benefiting
the public as a whole. He argued that one way to address that problem
might be to systematize professional rewards to academics who actively
participate in the improvement and development of the data resource once
an initial version has been created. That is, one should establish a
permanent link between the research community and the agencies so that
visiting academics have a voice in setting the public release policies,
because in order for their research to be influential (or perhaps to be
published) it is necessary for other scholars to have access to the data
on which it is based. Outside researchers in this kind of role could be
funded by the National Science Foundation, for example, and professional
rewards could take the form of being given the first access to the data
that one has helped to get released.
A more indirect but ultimately more powerful approach, Carroll
thought, would be to further entrench and extend the ethic that
government policymaking should be evidence-based. Those inside a
government agency could more easily push for allowing much more
transparency in the data that are released if there is an expectation
that making the public case for a policy requires transparent evidence.
Justin Wolfers suggested that the authors were not critical enough
of the stupidity of statistical disclosure limitations. As an example,
he mentioned a marital history supplement that he relies on in his work,
where even such an obscure detail as whether a person had been divorced
before 1954 could not be disclosed, even though it could not possibly
reveal to the public who a person was. The Current Population Survey is
hamstrung by many such absurd limitations. It also struck Wolfers as
very significant that, to his knowledge, to date not a single person has
been prosecuted for breaking these disclosure laws.
A third point he wished to make was that historically, the people
one ought to be most worried about violating individuals' privacy
are actually government policy makers. A very serious case of this was
the Census Bureau's exposure of the names and addresses of Japanese
Americans during World War II when the government chose to intern those
people in camps. Congress had passed a law giving the bureau the power
to do this--the problem was not the research community.
Wolfers' fourth and final comment was that those doing the
damage ought to be identified plainly, and in his view there were two
groups to blame: one, the Census Bureau employees who are deeply risk
averse because they are more worried about their jobs and programs being
killed than about the broader benefits to the American people, and two,
the so-called Tea Party. The latter group of individuals has put real
fear into the first group, who worry that as soon as a single example of
a privacy breach occurs it is going to be used as the political weapon
with which to shut down the American Community Survey or even the Census
Bureau itself.
Hoxby spoke up to clarify points she had made in her comment.
First, when she spoke about valuable microdata that could take the place
of survey data, she was referring to what people in applied
microeconomics now increasingly use: tax data, Social Security
Administration data, Health and Human Services data, and so on. These
data are now available to qualified researchers for numerous other
countries. Research increasingly focuses outside the United States as a
result. However, even if one looks only at U.S.-focused studies, young
researchers now rely mainly on administrative data rather than the much
more expensive survey data. This is not because young researchers are
naive about the flaws in administrative data. Rather, despite their
flaws, the administrative data are judged to be superior. The more young
researchers know about statistical disclosure techniques, the more this
view will be reinforced.
Hoxby argued that it is crucial for policy evaluation to be
conducted by researchers outside the government. Yes, the
government's production of descriptive statistics is very valuable.
However, it is naive to think that people in the government, who have
career and other incentives to support certain policies, should be the
only ones with the untampered-with data and administrative data needed
to conduct policy evaluation. When qualified outside researchers have
access to data, evaluations of government policy are more disciplined
and less likely to be propagandistic.
John Abowd responded to the comments first by addressing the topic
of administrative data. In fact, he said, computer scientists have been
discussing this issue for about a decade already. The most relevant
person in this field is Cynthia Dwork, a lead scientist at Microsoft who
has championed methods in computer science that apply the same
statistical disclosure limitation to every look at the data. The method
is known as epsilon differential privacy and is well known to all of the
younger computer science practitioners. It imposes the inferential
disclosure limit he had spoken of in his randomized response example,
which was taken from Dwork and Aaron Roth's published work on
differential privacy.
With this method, a differential privacy filter is placed between
every researcher and the data, something that is already normal at
Microsoft when statistical researchers there look at confidential search
logs. Our public agencies are not similarly constrained, but that is
likely to change, because the people being trained at every leading
university already know how to work this way to safeguard data privacy,
and now they are collaborating with statisticians and economists.
Abowd added that he and coauthor Ian Schmutte have written another
paper that discusses the technological solutions to the privacy
challenge and addresses the issues that Hoxby and Betsy Stevenson spoke
about. It models the choices involved in handling the incentive problems
that citizens are concerned about in the matter of safeguarding personal
data. He added that trying to solve the problem by using administrative
data as an alternative, although it is a work-around that he and many
colleagues have been using for a full decade, is not a true solution but
only kicks the core problem further down the road.
He agreed with Hoxby and Stevenson that part of the burden to solve
this rests with the researchers and part of the burden rests with the
statistical agencies. In this paper he and Schmutte stressed the
users' obligations simply because the research community can do
something about that.
Responding to Haltiwanger's point about the cleaning and
editing of the raw data, he noted that the principal difference is that
those activities are generally revealed in excruciating detail in the
technical summaries and academic papers, and they are also flagged on
most public-use datasets. In fact, the latest versions of the synthetic
data projects at the Census Bureau flag all imputed variables, which
researchers have the freedom to reverse if they wish. Anything that one
can conceptualize probabilistically can be put into the synthetic data,
so what matters most for researchers is that they be aware of what was
done with the data beforehand, be it editing, imputation, sampling, or
confidentiality protections. The paper tried to shine some light on the
confidentiality protections.
Ian Schmutte had a short comment in response to Haltiwanger's
point as well. He noted that of all the disheartening things they
observed when they peered inside the "sausage factory" of the
statistical agencies, the methods associated with statistical disclosure
limitations were much less concerning than other problems, such as the
high nonresponse rates. He mentioned the work of Barry Hirsch, which has
demonstrated how missing data and the imputations used to address them
create very large problems in analyses. His impression is that the
biases resulting from statistical disclosure limitations are much
smaller, although the suppression rate associated with this is, in fact,
still unknown.
Table 1. Estimated Variance of QWI
Establishment Noise Factor ([delta]) (a)
County-sector (b) County
Employment Payroll Employment Payroll
(ln) (1) (ln) (2) (ln) (3) (ln) (4)
Number of -0.281 -0.211 -0.209 -0.155
establishments (ln) (.0016) (.0017) (.0061) (.0062)
Constant -1.527 -1.610 -2.027 -1.962
(.0153) (.0070) (.0408) (.0422)
No. of observations 228,770 236,925 18,000 18,057
[R.sup.2] 0.1246 0.0582 0.0604 0.0324
Var. fuzz 0.046 0.040 0.017 0.020
(V[[delta]])
State
Employment Payroll
(ln) (5) (ln) (6)
Number of -0.144 0.205
establishments (ln) (.0679) (.0987)
Constant -4.679 -6.747
(.7885) (1.159)
No. of observations 282 282
[R.sup.2] 0.0138 0.0196
Var. fuzz 0.0001 0.000
(V[[delta]])
Source: QCEW and QWI data for Q1 for years 2006-11.
(a.) Each column reports estimates of a bivariate regression of
the log coefficient of variation between QCEW and QWI employment
(payroll) onto the natural logarithm of the number of
establishments (reported in QCEW). The variance of the QWI noise
factor is estimated as V[[delta]] = exp(-2 x Constant).
(b.) Estimates from data disaggregated by county and NAICS major
sector.
Table 2. Estimated Variance of QWI Establishment Noise Factor
(5)--Mixed Models (a)
County-sector County
Employment Payroll Employment Payroll
(ln) (1) (ln) (2) (ln) (3) (ln) (4)
Number of -0.321 -0.219 -0.224 -0.153
establishments (ln) (.0035) (.0030) (.0166) (.0117)
Constant -1.405 -1.578 -1.971 -1.979
(.0126) (.0119) (.1179) (.0786)
No. of observations 228,770 236,925 18,000 18,057
Var. fuzz (V[[delta]]) 0.060 0.043 0.019 0.019
State
Employment Payroll
(ln) (5) (ln) (6)
Number of -0.164 0.254
establishments (ln) (.1467) (.1688)
Constant -4.455 -7.270
(1.692) (1.961)
No. of observations 282 282
Var. fuzz (V[[delta]]) 0.0001 0.000
Source: QCEW and QWI data for Q1 for years 2006-11.
(a.) Each column reports estimates of a mixed-effects model of
the log coefficient of variation between QCEW and QWI employment
(payroll) that includes fixed effects for the natural logarithm
of the number of establishments (reported in QCEW) and random
slopes and intercepts at the county-sector, county, and state
level, respectively.