TY - GEN
T1 - NEZHA
T2 - 35th ACM Web Conference, WWW 2026
AU - Wang, Yejing
AU - Zhou, Shengyu
AU - Lu, Jinyu
AU - Liu, Ziwei
AU - Liu, Langming
AU - Wang, Maolin
AU - Zhang, Wenlin
AU - Li, Feng
AU - Su, Wenbo
AU - Wang, Pengjie
AU - Xu, Jian
AU - Zhao, Xiangyu
N1 - Publisher Copyright:
© 2026 Owner/Author.
PY - 2026/4/12
Y1 - 2026/4/12
N2 - Generative Recommendation (GR), powered by Large Language Models (LLMs), represents a promising new paradigm for industrial recommender systems. However, their practical application is severely hindered by high inference latency, making them infeasible for high-throughput, real-time services and limiting their overall business impact. While Speculative Decoding (SD) has been proposed to accelerate the autoregressive generation process, existing implementations introduce new bottlenecks: they typically require separate draft models and model-based verifiers, which require additional training and increase latency overhead. In this paper, we address these challenges with NEZHA, a novel architecture that achieves hyperspeed decoding for GR systems without sacrificing recommendation quality. Specifically, NEZHA integrates a nimble autoregressive draft head directly into the primary model, enabling efficient self-drafting. This design, combined with a specialized input prompt structure, preserves the integrity of sequence-to-sequence generation. Furthermore, to tackle the critical problem of hallucination - a major source of performance degradation - we introduce an efficient, model-free verifier based on a hash set. We demonstrate the effectiveness of NEZHA through extensive experiments on public datasets and have successfully deployed the system on Taobao since October 2025, achieving 1.2% business improvement, translating to billion-level advertising revenue and serving hundreds of millions of daily active users. The code is available at https://github.com/Applied-Machine-Learning- Lab/WWW2026-NEZHA.
AB - Generative Recommendation (GR), powered by Large Language Models (LLMs), represents a promising new paradigm for industrial recommender systems. However, their practical application is severely hindered by high inference latency, making them infeasible for high-throughput, real-time services and limiting their overall business impact. While Speculative Decoding (SD) has been proposed to accelerate the autoregressive generation process, existing implementations introduce new bottlenecks: they typically require separate draft models and model-based verifiers, which require additional training and increase latency overhead. In this paper, we address these challenges with NEZHA, a novel architecture that achieves hyperspeed decoding for GR systems without sacrificing recommendation quality. Specifically, NEZHA integrates a nimble autoregressive draft head directly into the primary model, enabling efficient self-drafting. This design, combined with a specialized input prompt structure, preserves the integrity of sequence-to-sequence generation. Furthermore, to tackle the critical problem of hallucination - a major source of performance degradation - we introduce an efficient, model-free verifier based on a hash set. We demonstrate the effectiveness of NEZHA through extensive experiments on public datasets and have successfully deployed the system on Taobao since October 2025, achieving 1.2% business improvement, translating to billion-level advertising revenue and serving hundreds of millions of daily active users. The code is available at https://github.com/Applied-Machine-Learning- Lab/WWW2026-NEZHA.
KW - generative recommendations
KW - speculative decoding
UR - https://www.scopus.com/pages/publications/105038572565
U2 - 10.1145/3774904.3792797
DO - 10.1145/3774904.3792797
M3 - 会议稿件
AN - SCOPUS:105038572565
T3 - WWW 2026 - Proceedings of the ACM Web Conference 2026
SP - 8073
EP - 8082
BT - WWW 2026 - Proceedings of the ACM Web Conference 2026
PB - Association for Computing Machinery, Inc
Y2 - 29 June 2026 through 3 July 2026
ER -