TY - GEN
T1 - ZeroED
T2 - 41st IEEE International Conference on Data Engineering, ICDE 2025
AU - Ni, Wei
AU - Zhang, Kaihang
AU - Miao, Xiaoye
AU - Zhao, Xiangyu
AU - Wu, Yangyang
AU - Wang, Yaoshu
AU - Yin, Jianwei
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Error detection (ED) in tabular data is crucial yet challenging due to diverse error types and the need for contextual understanding. Traditional ED methods often rely heavily on manual criteria and labels, making them labor-intensive. Large language models (LLM) can minimize human effort but struggle with errors requiring a comprehensive understanding of data context. In this paper, we propose ZeroED, a novel hybrid error detection framework, which combines LLM reasoning ability with the machine learning pipeline via zero-shot prompting. ZeroED operates in four steps, i.e., feature representation, error labeling, training data construction, and detector training. Initially, to enhance error distinction, ZeroED generates rich data representations using LLM-driven error reason-aware binary features, pre-trained embeddings, and statistical features. Then, ZeroED employs LLM to holistically label errors through incontext learning, guided by a two-step LLM reasoning process for detailed ED guidelines. To reduce token costs, LLMs are applied only to representative data selected via clustering-based sampling. High-quality training data is constructed through in-cluster label propagation and LLM augmentation with verification. Finally, a classifier is trained to detect all errors. Extensive experiments on seven datasets demonstrate that, ZeroED outperforms state-of-the-art methods by a maximum 30 % improvement in F1 score and up to 90% token cost reduction.
AB - Error detection (ED) in tabular data is crucial yet challenging due to diverse error types and the need for contextual understanding. Traditional ED methods often rely heavily on manual criteria and labels, making them labor-intensive. Large language models (LLM) can minimize human effort but struggle with errors requiring a comprehensive understanding of data context. In this paper, we propose ZeroED, a novel hybrid error detection framework, which combines LLM reasoning ability with the machine learning pipeline via zero-shot prompting. ZeroED operates in four steps, i.e., feature representation, error labeling, training data construction, and detector training. Initially, to enhance error distinction, ZeroED generates rich data representations using LLM-driven error reason-aware binary features, pre-trained embeddings, and statistical features. Then, ZeroED employs LLM to holistically label errors through incontext learning, guided by a two-step LLM reasoning process for detailed ED guidelines. To reduce token costs, LLMs are applied only to representative data selected via clustering-based sampling. High-quality training data is constructed through in-cluster label propagation and LLM augmentation with verification. Finally, a classifier is trained to detect all errors. Extensive experiments on seven datasets demonstrate that, ZeroED outperforms state-of-the-art methods by a maximum 30 % improvement in F1 score and up to 90% token cost reduction.
KW - Data cleaning
KW - error detection
KW - large language model
UR - https://www.scopus.com/pages/publications/105015487948
U2 - 10.1109/ICDE65448.2025.00234
DO - 10.1109/ICDE65448.2025.00234
M3 - 会议稿件
AN - SCOPUS:105015487948
T3 - Proceedings - International Conference on Data Engineering
SP - 3126
EP - 3139
BT - Proceedings - 2025 IEEE 41st International Conference on Data Engineering, ICDE 2025
PB - IEEE Computer Society
Y2 - 19 May 2025 through 23 May 2025
ER -