Align Your Query: Representation Alignment for Multimodality Medical Object Detection

Ara Seo1*, Bryan Sangwoo Kim1*, Hyungjin Chung2, Jong Chul Ye1
1 KAIST AI  ·  2 EverEx
* : Equal Contribution
Input 1 Output 1

"Aortic enlargement in CXR"

Input 2 Output 2

"Brain tumor in MRI"

Input 3 Output 3

"Right heart ventricle in cardiac MRI"

Input 4 Output 4

"Lymphocyte in Pathology (H&E stain)"

Input 5 Output 5

"COVID-19 infection in lung CT"

Input 6 Output 6

"Neoplastic polyp in colon endoscope"

Ours Ground Truth

Abstract

Medical object detection suffers when a single detector is trained on mixed medical modalities (e.g., CXR, CT, MRI) due to heterogeneous statistics and disjoint representation spaces. To address this challenge, we turn to representation alignment, an approach that has proven effective for bringing features from different sources into a shared space. Specifically, we target the representations of DETR-style object queries and propose a simple, detector-agnostic framework to align them with modality context. First, we define modality tokens: compact, text-derived embeddings encoding imaging modality that are lightweight and require no extra annotations. We integrate the modality tokens into the detection process via Multimodality Context Attention (MoCA), mixing object-query representations via self-attention to propagate modality context within the query set. This preserves DETR-style architectures and adds negligible latency while injecting modality cues into object queries. We further introduce QueryREPA, a short pretraining stage that aligns query representations to their modality tokens using a task-specific contrastive objective with modality-balanced batches. Together, MoCA and QueryREPA produce modality-aware, class-faithful queries that transfer effectively to downstream training. Across diverse modalities trained altogether, the proposed approach consistently improves AP with minimal overhead and no architectural modifications, offering a practical path toward robust multimodality medical object detection.

Method

Main Method figure

(a) Contrastive Representation Alignment of Object Queries (QueryREPA): During a pretraining stage, a text-derived modality token is selected for each image in a modality-balanced batch (e.g., a CXR and an MRI). Object query representations are aligned with the modality token through a contrastive alignment loss to produce queries aware of modality context.

(b) Multimodality Context Attention (MoCA): For a given image, a modality token is selected and concatenated with the set of object queries to form an augmented query set. Information fusion occurs in the self-attention layer of the decoder, allowing each object query to attend to the modality token to explicitly attain modality-specific context.


Comparison with other detectors

Comparison of text/modality integration in DETR-style detectors.
(a) DINO (baseline): image-only encoder–decoder with query selection.
(b) Grounding DINO: language-guided query selection and cross-attention in the decoder (object queries attend to a sequence of text tokens).
(c) Ours (DINO+MoCA): append compact modality tokens to the object queries and fuse by self-attention in the decoder; the tokens act as semantic anchors that refine queries in a modality-aware way with minimal overhead.

Experimental Results

Qualitative comparison results

Qualitative Comparison. Comparison results between various state-of-the-art detection methods and the proposed method is shown above. Our method effectively leverages modality context to significantly enhance anomaly localization (red), compared to baseline results (blue). Ground truth bounding boxes are green. For cases where the bounding boxes are small, insets show an enlarged view of the highlighted yellow region.

Gallery