Recipe
- Stage: 1 (NTP SFT; no alignment, no latent slots)
- Base model:
Qwen/Qwen2.5-VL-7B-Instruct
- Init checkpoint:
(none)
- Dataset:
ohjoonhee/visual-cot-50k-poc (Monet-SFT-125K Visual_CoT subset, eval-200 excluded)
- Hardware: 4× H100 80GB, DeepSpeed ZeRO-2 + CPU optim offload, bf16
- (no config available)
Notes
Pure NTP SFT — no Monet Stage 2 alignment loss, no latent-mode forward.
The Monet special tokens (<observation>, <abs_vis_token>, etc.) ARE
registered in the tokenizer and embedded so the model learns to produce
them, but the architectural latent-slot mechanism is unused at this stage.
This revision (step-1500)
No training log row available.
Notes
Faithful upstream Monet Stage 3 reproduction (lambda_reg=0). Init: Monet-SFT-7B/stage1. Teacher: upstream-precomputed (124K latents). Trained ~1942 step target, walltime-cut at step ~1728 (epoch 1.77). Final: loss=0.19 alignment_loss=0.032 obs_acc=0.97 — collapse signature.
Other revisions: see the revisions dropdown on this page.
How to load
from transformers import AutoModelForVision2Seq, AutoProcessor
m = AutoModelForVision2Seq.from_pretrained(
"ohjoonhee/vlatents-qwen25vl7b-stage3-upstream-baseline-v1", revision="step-1500", torch_dtype="bfloat16")
p = AutoProcessor.from_pretrained("ohjoonhee/vlatents-qwen25vl7b-stage3-upstream-baseline-v1", revision="step-1500")
Limitations
Research checkpoint, eval-only. Mid-training step (1500/?).
Not for production.
Card generated 2026-06-01 from training_log.jsonl + the run's training config.