zhengmh

OmniVTG-7B

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

📖 Introduction

OmniVTG is a multimodal model designed to tackle Video Temporal Grounding (VTG). It aims to accurately localize specific video segments within untrimmed videos based on natural language queries.

Extending VTG to open-world applications has historically been challenging due to the limited scale and semantic diversity of existing datasets. To address this, we introduce the OmniVTG Dataset (featuring over 2,000 hours of rich, diverse video content) and a novel Self-Correction Chain-of-Thought (CoT) training paradigm. This combination unleashes the grounding capabilities of Multimodal Large Language Models (MLLMs).

This repository contains the official model weights for OmniVTG-7B, accepted at CVPR 2026.

To access the OmniVTG-Dataset, please visit https://huggingface.co/datasets/zhengmh/OmniVTG-Dataset.

✨ Highlights

  • Open-World Readiness: Powered by a large-scale dataset featuring over 2,000 hours of video content with rich semantic diversity.
  • Strong Zero-Shot Performance: Achieves robust zero-shot localization performance across four major VTG benchmarks (Charades-STA, ActivityNet Captions, QVHighlights, and TVGBench).
  • Novel Training Paradigm: Trained via an advanced pipeline consisting of Supervised Fine-Tuning (SFT), Self-Correction CoT Tuning, and Reinforcement Learning (RL).

🚀 Quick Start

To use OmniVTG-7B, please refer to our official codebase for full installation and inference instructions.

  1. Clone the repository and install dependencies:

bash

git clone https://github.com/oceanflowlab/OmniVTG
cd OmniVTG
  1. Download this checkpoint and launch the interactive demo:

bash

python demo.py --model /path/to/OmniVTG-7B

For complete details on evaluation, evaluation datasets, and the full training pipeline (SFT, CoT, RL), please visit our GitHub Repository.

📝 Citation

If you find our work or model helpful for your research, please consider citing our paper:

bibtex

@InProceedings{Zheng_2026_CVPR,
author = {Zheng, Minghang and Yin, Zihao and Yang, Yi and Peng, Yuxin and Liu, Yang},
title = {OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2026},
pages = {24620-24629}
}

Model provider

zhengmh

Model tree

Base

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today