zhengmh

OmniVTG-7B

📖 Introduction

OmniVTG is a multimodal model designed to tackle Video Temporal Grounding (VTG). It aims to accurately localize specific video segments within untrimmed videos based on natural language queries.

Extending VTG to open-world applications has historically been challenging due to the limited scale and semantic diversity of existing datasets. To address this, we introduce the OmniVTG Dataset (featuring over 2,000 hours of rich, diverse video content) and a novel Self-Correction Chain-of-Thought (CoT) training paradigm. This combination unleashes the grounding capabilities of Multimodal Large Language Models (MLLMs).

This repository contains the official model weights for OmniVTG-7B, accepted at CVPR 2026.

To access the OmniVTG-Dataset, please visit https://huggingface.co/datasets/zhengmh/OmniVTG-Dataset.

✨ Highlights

Open-World Readiness: Powered by a large-scale dataset featuring over 2,000 hours of video content with rich semantic diversity.
Strong Zero-Shot Performance: Achieves robust zero-shot localization performance across four major VTG benchmarks (Charades-STA, ActivityNet Captions, QVHighlights, and TVGBench).
Novel Training Paradigm: Trained via an advanced pipeline consisting of Supervised Fine-Tuning (SFT), Self-Correction CoT Tuning, and Reinforcement Learning (RL).

🚀 Quick Start

To use OmniVTG-7B, please refer to our official codebase for full installation and inference instructions.

Clone the repository and install dependencies:

bash
git clone https://github.com/oceanflowlab/OmniVTG
cd OmniVTG

Download this checkpoint and launch the interactive demo:

bash
python demo.py --model /path/to/OmniVTG-7B

For complete details on evaluation, evaluation datasets, and the full training pipeline (SFT, CoT, RL), please visit our GitHub Repository.

📝 Citation

If you find our work or model helpful for your research, please consider citing our paper:

bibtex
@InProceedings{Zheng_2026_CVPR,
    author    = {Zheng, Minghang and Yin, Zihao and Yang, Yi and Peng, Yuxin and Liu, Yang},
    title     = {OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2026},
    pages     = {24620-24629}
}

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

zhengmh

Model Tree

Base

this model

Input Modalities

TextImage

Output Modalities

Text

Supported Functionality

Dedicated EndpointsContainer

Explore FriendliAI today

Get started Talk to an engineer

📖 Introduction

This repository contains the official model weights for OmniVTG-7B, accepted at CVPR 2026.

To access the OmniVTG-Dataset, please visit https://huggingface.co/datasets/zhengmh/OmniVTG-Dataset.

✨ Highlights

Open-World Readiness: Powered by a large-scale dataset featuring over 2,000 hours of video content with rich semantic diversity.
Strong Zero-Shot Performance: Achieves robust zero-shot localization performance across four major VTG benchmarks (Charades-STA, ActivityNet Captions, QVHighlights, and TVGBench).
Novel Training Paradigm: Trained via an advanced pipeline consisting of Supervised Fine-Tuning (SFT), Self-Correction CoT Tuning, and Reinforcement Learning (RL).

🚀 Quick Start

To use OmniVTG-7B, please refer to our official codebase for full installation and inference instructions.

Clone the repository and install dependencies:

bash
git clone https://github.com/oceanflowlab/OmniVTG
cd OmniVTG

Download this checkpoint and launch the interactive demo:

bash
python demo.py --model /path/to/OmniVTG-7B

For complete details on evaluation, evaluation datasets, and the full training pipeline (SFT, CoT, RL), please visit our GitHub Repository.

📝 Citation

If you find our work or model helpful for your research, please consider citing our paper:

bibtex
@InProceedings{Zheng_2026_CVPR,
    author    = {Zheng, Minghang and Yin, Zihao and Yang, Yi and Peng, Yuxin and Liu, Yang},
    title     = {OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2026},
    pages     = {24620-24629}
}

OmniVTG-7B

README

📖 Introduction

✨ Highlights

🚀 Quick Start

📝 Citation

Explore FriendliAI today

README

📖 Introduction

✨ Highlights

🚀 Quick Start

📝 Citation