跳到主要导航 跳到搜索 跳到主要内容

NEZHA: A Zero-sacrifice and Hyperspeed Decoding Architecture for Generative Recommendations

  • Yejing Wang
  • , Shengyu Zhou
  • , Jinyu Lu
  • , Ziwei Liu
  • , Langming Liu
  • , Maolin Wang
  • , Wenlin Zhang
  • , Feng Li
  • , Wenbo Su
  • , Pengjie Wang
  • , Jian Xu
  • , Xiangyu Zhao*
  • *此作品的通讯作者
  • City University of Hong Kong
  • Alibaba Group Holding Ltd.

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Generative Recommendation (GR), powered by Large Language Models (LLMs), represents a promising new paradigm for industrial recommender systems. However, their practical application is severely hindered by high inference latency, making them infeasible for high-throughput, real-time services and limiting their overall business impact. While Speculative Decoding (SD) has been proposed to accelerate the autoregressive generation process, existing implementations introduce new bottlenecks: they typically require separate draft models and model-based verifiers, which require additional training and increase latency overhead. In this paper, we address these challenges with NEZHA, a novel architecture that achieves hyperspeed decoding for GR systems without sacrificing recommendation quality. Specifically, NEZHA integrates a nimble autoregressive draft head directly into the primary model, enabling efficient self-drafting. This design, combined with a specialized input prompt structure, preserves the integrity of sequence-to-sequence generation. Furthermore, to tackle the critical problem of hallucination - a major source of performance degradation - we introduce an efficient, model-free verifier based on a hash set. We demonstrate the effectiveness of NEZHA through extensive experiments on public datasets and have successfully deployed the system on Taobao since October 2025, achieving 1.2% business improvement, translating to billion-level advertising revenue and serving hundreds of millions of daily active users. The code is available at https://github.com/Applied-Machine-Learning- Lab/WWW2026-NEZHA.

源语言英语
主期刊名WWW 2026 - Proceedings of the ACM Web Conference 2026
出版商Association for Computing Machinery, Inc
8073-8082
页数10
ISBN(电子版)9798400723070
DOI
出版状态已出版 - 12 4月 2026
已对外发布
活动35th ACM Web Conference, WWW 2026 - Dubai, 阿拉伯联合酋长国
期限: 29 6月 20263 7月 2026

出版系列

姓名WWW 2026 - Proceedings of the ACM Web Conference 2026

会议

会议35th ACM Web Conference, WWW 2026
国家/地区阿拉伯联合酋长国
Dubai
时期29/06/263/07/26

指纹

探究 'NEZHA: A Zero-sacrifice and Hyperspeed Decoding Architecture for Generative Recommendations' 的科研主题。它们共同构成独一无二的指纹。

引用此