期刊名称:Bulletin of the Technical Committee on Data Engineering
出版年度:2016
卷号:39
期号:2
页码:38
出版社:IEEE Computer Society
摘要:Enterprises have been acquiring large amounts of data from a variety of sources to build theirown “Data Lakes”, with the goal of enriching their data asset and enabling richer and more informedanalytics. The pace of the acquisition and the variety of the data sources make it impossible to clean thisdata as it arrives. This new reality has made data cleaning a continuous process and a part of day-to-daydata processing activities. The large body of data cleaning algorithms and techniques is strong evidenceof how complex the problem is, yet, it has had little success in being adopted in real-world data cleaningapplications. In this article we examine how the community has been evaluating the effectiveness of datacleaning algorithms, and if current data cleaning proposals are solving the right problems to enable thedevelopment of deployable and effective solutions.