期刊名称:Bulletin of the Technical Committee on Data Engineering
出版年度:2016
卷号:39
期号:2
页码:63
出版社:IEEE Computer Society
摘要:Traditionally, data cleaning has been performed as a pre-processing task: after all data are selectedfor a study (or application), they are cleaned and loaded into a database or data warehouse. In this pa-per, we argue that data cleaning should be an integral part of data exploration. Especially for complex,spatio-temporal data, it is only by exploring a dataset that one can discover which constraints should bechecked. In addition, in many instances, seemingly erroneous data may actually reflect interesting fea-tures. Distinguishing a feature from a data quality issue requires detailed analyses which often includesbringing in new datasets. We present a series of case studies using the NYC taxi data that illustrate datacleaning challenges that arise for spatial-temporal urban data and suggest methodologies to addressthese challenges.