zhengmh
OmniVTG-7B
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
📖 Introduction
OmniVTG is a multimodal model designed to tackle Video Temporal Grounding (VTG). It aims to accurately localize specific video segments within untrimmed videos based on natural language queries.
Extending VTG to open-world applications has historically been challenging due to the limited scale and semantic diversity of existing datasets. To address this, we introduce the OmniVTG Dataset (featuring over 2,000 hours of rich, diverse video content) and a novel Self-Correction Chain-of-Thought (CoT) training paradigm. This combination unleashes the grounding capabilities of Multimodal Large Language Models (MLLMs).
This repository contains the official model weights for OmniVTG-7B, accepted at CVPR 2026.
To access the OmniVTG-Dataset, please visit https://huggingface.co/datasets/zhengmh/OmniVTG-Dataset.
✨ Highlights
- Open-World Readiness: Powered by a large-scale dataset featuring over 2,000 hours of video content with rich semantic diversity.
- Strong Zero-Shot Performance: Achieves robust zero-shot localization performance across four major VTG benchmarks (Charades-STA, ActivityNet Captions, QVHighlights, and TVGBench).
- Novel Training Paradigm: Trained via an advanced pipeline consisting of Supervised Fine-Tuning (SFT), Self-Correction CoT Tuning, and Reinforcement Learning (RL).
🚀 Quick Start
To use OmniVTG-7B, please refer to our official codebase for full installation and inference instructions.
- Clone the repository and install dependencies:
bash
git clone https://github.com/oceanflowlab/OmniVTGcd OmniVTG
- Download this checkpoint and launch the interactive demo:
bash
python demo.py --model /path/to/OmniVTG-7B
For complete details on evaluation, evaluation datasets, and the full training pipeline (SFT, CoT, RL), please visit our GitHub Repository.
📝 Citation
If you find our work or model helpful for your research, please consider citing our paper:
bibtex
@InProceedings{Zheng_2026_CVPR,author = {Zheng, Minghang and Yin, Zihao and Yang, Yi and Peng, Yuxin and Liu, Yang},title = {OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding},booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},month = {June},year = {2026},pages = {24620-24629}}
Model provider
zhengmh
Model tree
Base
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information