Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0About LantErn
LantErn extends Qwen2.5-VL-3B-Instruct with
Latent Visual Reasoning (LVR) tokens. Instead of always verbalising what it sees, the model can emit
compressed visual embeddings (<|lvr_start|>…<|lvr_end|>) during its chain-of-thought, enabling
non-verbalized visual reasoning interleaved with text.
Special tokens:
| Token | Role |
|---|---|
<lvr_start> | Begin a latent visual reasoning block |
<lvr_sep> | Placeholder replaced by compressed visual embeddings (8 tokens) |
<lvr_end> | End a latent visual reasoning block |
Usage
Codebase: github.com/GuilhermeViveiros/LantErn
bash
git clone https://github.com/GuilhermeViveiros/LantErn.gitcd LantErnpip install -r requirements.txtpip install -e .
python
import torchfrom PIL import Imagefrom qwen_vl_utils import process_vision_infofrom src.lantern_generate.generate import generate as lantern_generatefrom src.models import load_model# ── 1. Load model + processor ─────────────────────────────────────────────────device = "cuda" if torch.cuda.is_available() else "cpu"model, processor = load_model("AGViveiros/LanteRn-3B-Tetris", compute_dtype=torch.bfloat16, use_cache=True)model.eval().to(device)processor.tokenizer.padding_side = "left"# ── 2. Build inputs ───────────────────────────────────────────────────────────image = Image.open("path/to/image.jpg").convert("RGB")question = "Your question here"messages = [{"role": "user","content": [{"type": "image", "image": image},{"type": "text", "text": question},],}]text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)image_inputs, _ = process_vision_info(messages)inputs = processor(text=[text], images=image_inputs, return_tensors="pt").to(device)prompt_len = inputs["input_ids"].shape[1]# ── 3. Generate with latent visual reasoning ──────────────────────────────────output = model.generate(**inputs,max_new_tokens=512,do_sample=False,custom_generate=lantern_generate,use_cache=True,return_dict_in_generate=True,)generated = output.sequences[0][prompt_len:]print(processor.decode(generated, skip_special_tokens=False))
Citation
bibtex
@article{viveiros2026holding,title={What's Holding Back Latent Visual Reasoning?},author={Viveiros, Andr{\'e} G and Gon{\c{c}}alves, Nuno and Martins, Andr{\'e} FT and Lindemann, Matthias},journal={arXiv preprint arXiv:2605.18445},year={2026}}
Model provider
AGViveiros
Model tree
Base
Qwen/Qwen2.5-VL-3B-Instruct
Fine-tuned
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information