Automatic Data Repair: Are We Ready to Deploy?

  • Wei Ni
  • , Xiaoye Miao*
  • , Xiangyu Zhao
  • , Yangyang Wu
  • , Shuwei Liang
  • , Jianwei Yin
  • *Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review

Abstract

Data quality is paramount in today’s data-driven world, especially in the era of generative AI. Dirty data with errors and inconsistencies usually leads to flawed insights, unreliable decision-making, and biased or low-quality outputs from generative models. The study of repairing erroneous data has gained significant importance. Existing data repair algorithms differ in information utilization, problem settings, and are tested in limited scenarios. In this paper, we compare and summarize these algorithms with a driven information-based taxonomy. We systematically conduct a comprehensive evaluation of 12 mainstream data repair algorithms on 12 datasets under the settings of various data error rates, error types, and 4 downstream analysis tasks, assessing their error reduction performance with a novel but practical metric. We develop an effective and unified repair optimization strategy that substantially benefits the state of the arts. We conclude that, it is always worthy of data repair. The clean data does not determine the upper bound of data analysis performance. We provide valuable guidelines, challenges, and promising directions in the data repair domain. We anticipate this paper enabling researchers and users to well understand and deploy data repair algorithms in practice.

Original languageEnglish
Pages (from-to)2617-2630
Number of pages14
JournalProceedings of the VLDB Endowment
Volume17
Issue number10
DOIs
StatePublished - 2024
Externally publishedYes
Event50th International Conference on Very Large Data Bases, VLDB 2024 - Guangzhou, China
Duration: 24 Aug 202429 Aug 2024

Fingerprint

Dive into the research topics of 'Automatic Data Repair: Are We Ready to Deploy?'. Together they form a unique fingerprint.

Cite this