跳到主要导航 跳到搜索 跳到主要内容

IterClean: An Iterative Data Cleaning Framework with Large Language Models

  • Wei Ni
  • , Kaihang Zhang
  • , Xiaoye Miao*
  • , Xiangyu Zhao
  • , Yangyang Wu
  • , Jianwei Yin
  • *此作品的通讯作者
  • Zhejiang University
  • City University of Hong Kong

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

In the era of generative artificial intelligence, the accuracy of data is paramount. Erroneous data often leads to faulty outcomes and economic detriments. Previous cleaning methods employ a sequential detect-repair paradigm, leaving over half of the errors unsolved in real scenarios. We introduce IterClean, an iterative data cleaning framework leveraging large language models (LLMs). Utilizing an iterative mechanism, the framework employs a two-step process: data labeling and iterative data cleaning. With few labeled data, IterClean leverages an iterative cleaning process involving an error detector, an error verifier, and an error repairer to significantly enhance the cleaning performance. Extensive experiments across four datasets demonstrate that, IterClean achieves an F1 score that is up to three times higher than the best state-of-the-art approaches requiring only 5 labeled tuples.

源语言英语
主期刊名Proceedings of ACM Turing Award Celebration Conference - CHINA 2024, TURC 2024
出版商Association for Computing Machinery
100-105
页数6
ISBN(电子版)9798400710117
DOI
出版状态已出版 - 5 7月 2024
已对外发布
活动2024 ACM Turing Award Celebration Conference China, TURC 2024 - Changsha, 中国
期限: 5 7月 20247 7月 2024

出版系列

姓名ACM International Conference Proceeding Series

会议

会议2024 ACM Turing Award Celebration Conference China, TURC 2024
国家/地区中国
Changsha
时期5/07/247/07/24

指纹

探究 'IterClean: An Iterative Data Cleaning Framework with Large Language Models' 的科研主题。它们共同构成独一无二的指纹。

引用此