PET-DINO LogoPET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training

Weifu Fu1,*,†, Jinyang Li2,*, Bin-Bin Gao1, Jialin Li3
Yuhuan Lin1, Hanqiu Deng1, Wenbing Tao2, Yong Liu1, Chengjie Wang1
1YouTu Lab, Tencent    2Huazhong University of Science and Technology    3Kling Team, Kuaishou Technology
CVPR 2026
*Equal contribution    Corresponding author

Abstract

Open-Set Object Detection (OSOD) enables recognition of novel categories beyond fixed classes but faces challenges in aligning text representations with complex visual concepts and the scarcity of image-text paired samples for rare categories. This results in suboptimal performance in specialized domains or with complex objects. Recent visual-prompted methods partially address these issues but often involve complex multi-modal designs and multi-stage optimizations, extending the development cycle. Additionally, effective training strategies for data-driven OSOD models remain largely unexplored. To address these challenges, we propose PET-DINO, a universal object detector supporting both text and visual prompts. Our Alignment-Friendly Visual Prompt Generation (AFVPG) module builds upon an advanced text-prompted detector, addressing the limitations of text representation guidance and reducing the development cycle. We introduce two prompt-enriched training strategies: Intra-Batch Parallel Prompting (IBP) at the iteration level and Dynamic Memory-Driven Prompting (DMD) at the overall training level. These strategies enable simultaneous modeling of multiple prompt routes, parallel alignment with diverse real-world usage scenarios. Comprehensive experiments demonstrate that PET-DINO exhibits competitive zero-shot object detection capabilities across various prompt-based detection protocols. These strengths can be attributable to inheritance-based philosophy and prompt-enriched training strategies, which play a critical role in building effective generic object detector.

Method

Method Overview
Overall architecture of PET-DINO. Input coordinates undergo a Visual Prompt Generation process, interacting with enhanced image features to obtain visual prompts. The text encoder creates text embeddings, which interact with image features in the Feature Enhancer module to produce text prompts. Both types of prompts guide the Query Selection Module, offering location priors for initial queries. These queries are then refined through decoder layers to predict objects and classifications.
AFVPG Feature Analysis
Feature correlation analysis between visual prompts and instance-level image features showing the impact of AFVPG.
IBP DMD t-SNE
t-SNE visualization of visual prompt features showing the impact of IBP and DMD.

Visualization

Cross Images Visualization
Single
Multi

BibTeX

@misc{fu2026petdinounifyingvisualcues,
      title={PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training}, 
      author={Weifu Fu and Jinyang Li and Bin-Bin Gao and Jialin Li and Yuhuan Lin and Hanqiu Deng and Wenbing Tao and Yong Liu and Chengjie Wang},
      year={2026},
      eprint={2604.00503},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.00503}, 
}