IterClean: An Iterative Data Cleaning Framework with Large Language Models

  • Wei Ni
  • , Kaihang Zhang
  • , Xiaoye Miao*
  • , Xiangyu Zhao
  • , Yangyang Wu
  • , Jianwei Yin
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In the era of generative artificial intelligence, the accuracy of data is paramount. Erroneous data often leads to faulty outcomes and economic detriments. Previous cleaning methods employ a sequential detect-repair paradigm, leaving over half of the errors unsolved in real scenarios. We introduce IterClean, an iterative data cleaning framework leveraging large language models (LLMs). Utilizing an iterative mechanism, the framework employs a two-step process: data labeling and iterative data cleaning. With few labeled data, IterClean leverages an iterative cleaning process involving an error detector, an error verifier, and an error repairer to significantly enhance the cleaning performance. Extensive experiments across four datasets demonstrate that, IterClean achieves an F1 score that is up to three times higher than the best state-of-the-art approaches requiring only 5 labeled tuples.

Original languageEnglish
Title of host publicationProceedings of ACM Turing Award Celebration Conference - CHINA 2024, TURC 2024
PublisherAssociation for Computing Machinery
Pages100-105
Number of pages6
ISBN (Electronic)9798400710117
DOIs
StatePublished - 5 Jul 2024
Externally publishedYes
Event2024 ACM Turing Award Celebration Conference China, TURC 2024 - Changsha, China
Duration: 5 Jul 20247 Jul 2024

Publication series

NameACM International Conference Proceeding Series

Conference

Conference2024 ACM Turing Award Celebration Conference China, TURC 2024
Country/TerritoryChina
CityChangsha
Period5/07/247/07/24

Keywords

  • Data cleaning
  • error detection
  • error repair
  • large language models

Fingerprint

Dive into the research topics of 'IterClean: An Iterative Data Cleaning Framework with Large Language Models'. Together they form a unique fingerprint.

Cite this