nvidia

nvidia

EGM-8B

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model Summary

EGM-Qwen3-VL-8B is the flagship model of the EGM (Efficient Visual Grounding Language Models) family. It is built on top of Qwen3-VL-8B-Thinking and trained with a two-stage pipeline: supervised fine-tuning (SFT) followed by reinforcement learning (RL) using GRPO (Group Relative Policy Optimization).

EGM demonstrates that by increasing test-time computation, small vision-language models can outperform much larger models in visual grounding tasks while being significantly faster at inference.

Key Results

  • 91.4 average IoU on the RefCOCO benchmark (vs. 87.8 for the base Qwen3-VL-8B-Thinking)
  • +3.6 IoU improvement over the base model
  • Outperforms Qwen3-VL-235B-A22B-Instruct (88.2 avg IoU) and Qwen3-VL-235B-A22B-Thinking (90.7 avg IoU)
  • 5.9x faster inference than Qwen3-VL-235B (737ms vs 4,320ms average latency)
  • 18.9x faster than Qwen3-VL-235B-Thinking

RefCOCO Benchmark Results

Table
ModelRefCOCO valRefCOCO test-ARefCOCO test-BRefCOCO+ valRefCOCO+ test-ARefCOCO+ test-BRefCOCOg valRefCOCOg testAvg
Qwen3-VL-8B-Thinking91.092.586.686.291.280.587.888.687.8
EGM-Qwen3-VL-8B93.995.691.290.593.586.390.891.491.4
Qwen3-VL-235B-A22B-Instruct90.494.682.286.492.178.590.590.588.2
Qwen3-VL-235B-A22B-Thinking93.494.190.689.591.485.290.490.590.7

How It Works

VLMs of different sizes often share the same visual encoder. Small models fall behind large models primarily due to a gap in text understanding capabilities — 62.8% of small model errors stem from complex prompts with multiple relational descriptions. EGM mitigates this gap by generating many mid-quality tokens (from small models) to match the performance of large VLMs that produce fewer but more expensive tokens.

Training Pipeline

  1. SFT Stage: A proprietary VLM generates detailed chain-of-thought reasoning steps for visual grounding training data. The base model is fine-tuned on this data. The SFT checkpoint is available as nvidia/EGM-8B-SFT.
  2. RL Stage: GRPO is applied with a reward function combining IoU and task success metrics, further improving grounding accuracy.

Quickstart

Download

bash

pip install -U huggingface_hub
huggingface-cli download nvidia/EGM-8B --local-dir ./models/EGM-8B

Inference with SGLang

Launch the server:

bash

pip install "sglang[all]>=0.5.5"
python -m sglang.launch_server \
--model-path nvidia/EGM-8B \
--chat-template=qwen3-vl \
--port 30000

Send a visual grounding request:

python

import openai
import base64
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
# Load a local image as base64
with open("example.jpg", "rb") as f:
image_base64 = base64.b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="nvidia/EGM-8B",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}},
{"type": "text", "text": "Please provide the bounding box coordinate of the region this sentence describes: the person on the left."},
],
}
],
temperature=0.6,
top_p=0.95,
max_tokens=8192,
)
print(response.choices[0].message.content)

Model Architecture

Table
ComponentDetails
ArchitectureQwen3VLForConditionalGeneration
Text Hidden Size4096
Text Layers36
Attention Heads32 (8 KV heads)
Text Intermediate Size12,288
Vision Hidden Size1152
Vision Layers27
Patch Size16 x 16
Max Position Embeddings262,144
Vocabulary Size151,936

Citation

bibtex

@article{zhan2026EGM,
author = {Zhan, Guanqi and Li, Changye and Liu, Zhijian and Lu, Yao and Wu, Yi and Han, Song and Zhu, Ligeng},
title = {EGM: Efficient Visual Grounding Language Models},
booktitle = {arXiv},
year = {2026}
}

Acknowledgment

This repository benefits from Qwen3-VL, InternVL, verl and verl-internvl.

Model provider

nvidia

nvidia

Model tree

Base

Qwen/Qwen3-VL-8B-Thinking

Fine-tuned

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today