nvidia
EGM-4B-SFT
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Model Summary
EGM-Qwen3-VL-4B-SFT is the supervised fine-tuning (SFT) checkpoint from the first stage of the EGM (Efficient Visual Grounding Language Models) training pipeline. It is built on top of Qwen3-VL-4B-Thinking.
This is an intermediate checkpoint intended for further reinforcement learning training. For the final model with best performance, see nvidia/EGM-4B.
Training Details
SFT Stage
In the SFT stage, a proprietary VLM generates detailed chain-of-thought reasoning steps for visual grounding training data. The base Qwen3-VL-4B-Thinking model is then fine-tuned on this reasoning-augmented data to learn structured visual grounding with explicit reasoning.
This SFT checkpoint serves as the initialization for the subsequent RL stage (GRPO), which yields the final EGM-4B model.
How to Use for RL Training
bash
pip install -U huggingface_hubhuggingface-cli download nvidia/EGM-4B-SFT --local-dir ./models/EGM-4B-SFT
Then follow the installation and RL training instructions in the EGM repository.
Model Architecture
| Component | Details |
|---|---|
| Architecture | Qwen3VLForConditionalGeneration |
| Precision | bfloat16 |
| Text Hidden Size | 2560 |
| Text Layers | 36 |
| Attention Heads | 32 (8 KV heads) |
| Text Intermediate Size | 9728 |
| Vision Hidden Size | 1024 |
| Vision Layers | 24 |
| Patch Size | 16 x 16 |
| Max Position Embeddings | 262,144 |
| Vocabulary Size | 151,936 |
Related Models
| Model | Description |
|---|---|
| nvidia/EGM-4B | Final RL-trained model (best performance) |
| nvidia/EGM-8B-SFT | SFT checkpoint for the 8B variant |
| nvidia/EGM-8B | Final RL-trained 8B model |
Citation
bibtex
@article{zhan2026EGM,author = {Zhan, Guanqi and Li, Changye and Liu, Zhijian and Lu, Yao and Wu, Yi and Han, Song and Zhu, Ligeng},title = {EGM: Efficient Visual Grounding Language Models},booktitle = {arXiv},year = {2026}}
Acknowledgment
This repository benefits from Qwen3-VL, InternVL, verl and verl-internvl.
Model provider
nvidia
Model tree
Base
Qwen/Qwen3-VL-4B-Thinking
Fine-tuned
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information