Knowledge-enhanced medical image classification via descriptive priors from large language models

  • Yuhang Zhang
  • , Yiming Xu
  • , Peilin Chen
  • , Shiqi Wang
  • , Qi Song*
  • , Lei Yu
  • , Wei Cai
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Medical image classification aims to categorise clinically significant imaging patterns, thereby facilitating accurate and timely diagnosis. However, existing approaches predominantly rely on visual features extracted from raw pixel data, often overlooking fine-grained diagnostic cues grounded in medical expertise. To address this limitation, we propose a novel knowledge-enhanced Model, called KEM, that leverages medical large vision-language models (Medical LVLMs) as domain experts to generate descriptive priors, which are used to guide and support clinical decision-making. Specifically, we prompt Medical LVLMs to generate rich, Multi-Dimension clinical descriptions tailored to each input image, capturing nuanced semantics. These descriptive priors are then encoded and fused with visual features through a dual cross-attention module, which enables bidirectional interaction and alignment between modalities. This design allows the model to dynamically attend to both textual and visual cues, thereby enhancing its ability to recognise subtle disease patterns. Comprehensive experiments on four benchmark datasets demonstrate that the proposed method significantly outperforms state-of-the-art vision-only models and exhibits strong generalization across varied clinical settings.

Original languageEnglish
Article number61
JournalHealth Information Science and Systems
Volume13
Issue number1
DOIs
StatePublished - Dec 2025
Externally publishedYes

Keywords

  • Descriptive Priors
  • Dual Cross-Attention Module
  • Knowledge-Enhanced Model
  • Medical Image Classification
  • Medical Large Language Models

Fingerprint

Dive into the research topics of 'Knowledge-enhanced medical image classification via descriptive priors from large language models'. Together they form a unique fingerprint.

Cite this