Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

About LantErn

LantErn extends Qwen2.5-VL-3B-Instruct with Latent Visual Reasoning (LVR) tokens. Instead of always verbalising what it sees, the model can emit compressed visual embeddings (<|lvr_start|>…<|lvr_end|>) during its chain-of-thought, enabling non-verbalized visual reasoning interleaved with text.

Special tokens:

TokenRole
<lvr_start>Begin a latent visual reasoning block
<lvr_sep>Placeholder replaced by compressed visual embeddings (8 tokens)
<lvr_end>End a latent visual reasoning block

Usage

Codebase: github.com/GuilhermeViveiros/LantErn

bash

git clone https://github.com/GuilhermeViveiros/LantErn.git
cd LantErn
pip install -r requirements.txt
pip install -e .

python

import torch
from PIL import Image
from qwen_vl_utils import process_vision_info
from src.lantern_generate.generate import generate as lantern_generate
from src.models import load_model
# ── 1. Load model + processor ─────────────────────────────────────────────────
device = "cuda" if torch.cuda.is_available() else "cpu"
model, processor = load_model("AGViveiros/LanteRn-3B-Tetris", compute_dtype=torch.bfloat16, use_cache=True)
model.eval().to(device)
processor.tokenizer.padding_side = "left"
# ── 2. Build inputs ───────────────────────────────────────────────────────────
image = Image.open("path/to/image.jpg").convert("RGB")
question = "Your question here"
messages = [{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": question},
],
}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, _ = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, return_tensors="pt").to(device)
prompt_len = inputs["input_ids"].shape[1]
# ── 3. Generate with latent visual reasoning ──────────────────────────────────
output = model.generate(
**inputs,
max_new_tokens=512,
do_sample=False,
custom_generate=lantern_generate,
use_cache=True,
return_dict_in_generate=True,
)
generated = output.sequences[0][prompt_len:]
print(processor.decode(generated, skip_special_tokens=False))

Citation

bibtex

@article{viveiros2026holding,
title={What's Holding Back Latent Visual Reasoning?},
author={Viveiros, Andr{\'e} G and Gon{\c{c}}alves, Nuno and Martins, Andr{\'e} FT and Lindemann, Matthias},
journal={arXiv preprint arXiv:2605.18445},
year={2026}
}

Model provider

AGViveiros

Model tree

Base

Qwen/Qwen2.5-VL-3B-Instruct

Fine-tuned

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today