nvidia

nvidia

EGM-4B-SFT

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model Summary

EGM-Qwen3-VL-4B-SFT is the supervised fine-tuning (SFT) checkpoint from the first stage of the EGM (Efficient Visual Grounding Language Models) training pipeline. It is built on top of Qwen3-VL-4B-Thinking.

This is an intermediate checkpoint intended for further reinforcement learning training. For the final model with best performance, see nvidia/EGM-4B.

Training Details

SFT Stage

In the SFT stage, a proprietary VLM generates detailed chain-of-thought reasoning steps for visual grounding training data. The base Qwen3-VL-4B-Thinking model is then fine-tuned on this reasoning-augmented data to learn structured visual grounding with explicit reasoning.

This SFT checkpoint serves as the initialization for the subsequent RL stage (GRPO), which yields the final EGM-4B model.

How to Use for RL Training

bash

pip install -U huggingface_hub
huggingface-cli download nvidia/EGM-4B-SFT --local-dir ./models/EGM-4B-SFT

Then follow the installation and RL training instructions in the EGM repository.

Model Architecture

Table
ComponentDetails
ArchitectureQwen3VLForConditionalGeneration
Precisionbfloat16
Text Hidden Size2560
Text Layers36
Attention Heads32 (8 KV heads)
Text Intermediate Size9728
Vision Hidden Size1024
Vision Layers24
Patch Size16 x 16
Max Position Embeddings262,144
Vocabulary Size151,936
Table
ModelDescription
nvidia/EGM-4BFinal RL-trained model (best performance)
nvidia/EGM-8B-SFTSFT checkpoint for the 8B variant
nvidia/EGM-8BFinal RL-trained 8B model

Citation

bibtex

@article{zhan2026EGM,
author = {Zhan, Guanqi and Li, Changye and Liu, Zhijian and Lu, Yao and Wu, Yi and Han, Song and Zhu, Ligeng},
title = {EGM: Efficient Visual Grounding Language Models},
booktitle = {arXiv},
year = {2026}
}

Acknowledgment

This repository benefits from Qwen3-VL, InternVL, verl and verl-internvl.

Model provider

nvidia

nvidia

Model tree

Base

Qwen/Qwen3-VL-4B-Thinking

Fine-tuned

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today