Developing Knowledge Driven Multimodal Framework for 3D MRI Analysis and Interpretation: Grounding Visual Language Models for Diagnostic Reasoning | Faculty Development Scheme (FDS), Research Grants Council (RGC) Competitive Research Funding Schemes for the Local Self-financing Degree Sector 2025/26 | Project Reference No.: UGC/FDS24/E20/25| HK$ 654,945

Project Details

Description

3D Magnetic Resonance Imaging (MRI) plays a key role in disease diagnosis, treatment planning,
and monitoring. Advances in deep learning (DL) techniques have shown significant potential in
automating MRI-based diagnoses. However, the interpretability and generalization capabilities of
DL models in the MRI domain remain challenging. Convolutional neural networks (CNN) based
DL models are often black boxes, lacking reasoning and difficult to interpret. Clinicians and
radiologists need interpretability to trust and collaborate effectively with AI systems. While dealing
with generalization, one major issue is domain shift, which occurs when deploying DL models
across different clinical settings or patient populations. Models trained on one dataset often degrade
on another due to variations in imaging protocols, pulse sequences, equipment, demographics, and
disease characteristics.
Vision language models offer a suitable solution for interpretation. However, general-purpose
Vision Language Models (VLMs) struggle with reasoning with multi-sequence MRI data due to
their inability to extract crucial small details spanning 3D slices in MRI. We argue that effective
analysis of such data requires localized cues to capture better anatomical and clinical details called
grounding, which can be modelled through robust knowledge distillation segmentation methods
coupled with explicitly prepared radiomics. For instance, 3D-knee osteoarthritis (OA) MRIs
comprise thin tissues such as cartilage and meniscus, which require advanced insights. Radiologists
rely on cartilage continuity across slices, joint spacing, or synthesizing tissue surface area to
diagnose OA severity. Current VLMs lack the capability to process these radiomics-based features
or render 3D thickness maps, which impacts the solid reasoning behind derived answers, therefore
limiting their clinical applicability.
This proposal presents a generalized multimodal framework to address the domain variations and
interpretation in MRI understanding. We primarily focus on knee osteoarthritis (OA) MRI analysis
and interpretation while also demonstrating the applicability of our approach to liver MRI analysis,
showcasing its ability to handle varied anatomical structures. Our framework is built up of three
key units: Knowledge Distillation unit (KDU), Grounding unit (GU), and Vision Language
Understanding and Reasoning unit (VLURU).
KDU presents a novel semi-supervised segmentation approach using Successive Eigen Noiseassisted Mean Teacher Knowledge Distillation SEN-MTKD. This method addresses domain shifts
and label shortages while improving segmentation accuracy for thin tissues like cartilage and
meniscus. Meanwhile, unsupervised approaches to carry out the attention maps in MRI images are
presented to facilitate region detection.
GU is designed to provide strong grounding to VLM. We propose anatomical and radiomics-aware
alignment in VLMs through grounding, which involves precise segmentation labels for key
anatomical structures, such as the liver in abdominal imaging and knee joint tissues, including the
meniscus and cartilages. These segmentation outputs enable extracting clinically relevant
radiomics features, such as surface areas, tissue volumes, inter-tissue gaps, and a detailed set of
metrics that characterize tissue geometry and morphology. Furthermore, 2D thickness maps
derived from 3D segmented tissue offer a richer spatial representation and visual clue of tissue
structures.
VLURU comprises a two-step VLM, combining structured knowledge graphs and domain-specific
logical reasoning. The causal reasoning module, initially trained on large-scale medical datasets,
generates knowledge graphs from visual observations and integrates them with logical rules to
produce accurate, explainable answers to medical queries. VLURU uses GU-derived cues to fine- tune VLM, enabling it to generate clinically relevant and contextually accurate responses to queries
about MRI images. This multi-modal approach bridges the gap between pixel-level imaging data
and high-level language-driven reasoning, creating a coherent framework for automated medical
image interpretation.
StatusNot started
Effective start/end date1/01/2631/12/28

Keywords

  • MRI Processing
  • Diagnostic Reasoning
  • Deep Learning
  • Knowledge Distillation
  • Vision Language Models

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.