Recipe
- Stage: 2 (post-SFT, alignment + emphasized-CE objective)
- Base model:
Qwen/Qwen2.5-VL-7B-Instruct
- Init checkpoint:
/data/joonhee/visual-latents/cluster_phase3/stage1_sft/checkpoint
- Dataset:
ohjoonhee/visual-cot-50k-poc (Monet-SFT-125K Visual_CoT subset, eval-200 excluded)
- Hardware: 4× H100 80GB, DeepSpeed ZeRO-2 + CPU optim offload, bf16
latent_size: 8
alignment_weight: 2.0
ce_emphasize_factor: 4.0
alignment_layer: all_layers
use_attn_mask_4d: True
lr: 1e-05
weight_decay: 0.01
warmup_steps: 100
max_steps: 2000
grad_accum_steps: 32
max_pixels: 1568000
Fidelity to the Monet paper
- Latent-only backprop — paper-faithful (Job C).
emphasize_latent_weight
uses a verbatim port of upstream compute_latents_only_loss: the alignment
loss is computed in the CE forward (where ce_patch_vec is spliced into
inputs_embeds) and backpropped ONLY through the latent embeddings, i.e.
total = emphasize_latent_weight * compute_latents_only_loss(ce_patch_vec, alignment_weight*align) + ce
(mirrors upstream src/trainer.py:152-224).
The earlier plain-scalar-add approximation (see the *-repro-v1 repo) is
NOT used here.
attention_mask_4d is hand-rolled in mask_utils.build_monet_4d_attn
with latent_cross_isolate=True. Verified equivalent on tested cases
(see phase1_5b_attn/MASK_VALIDATION.md) but not byte-identical to upstream.
- Inline teacher forward (not offline-precomputed). Functionally
equivalent if teacher checkpoint is the same; saves precompute storage.
This revision (step-1500)
Last logged training row: step=1800, ce_loss=1.0302, align_loss=0.0416, total_loss=1.1133, elapsed=85939s
Notes
Job D Stage 3 BASELINE (lambda_reg=0) step-1500. Walltime-killed at step 1800/2000. ce1.0 align0.04 vicreg=0. Pairwise-cos collapse signature pending internal probe.
Other revisions: see the revisions dropdown on this page.
How to load
from transformers import AutoModelForVision2Seq, AutoProcessor
m = AutoModelForVision2Seq.from_pretrained(
"ohjoonhee/vlatents-qwen25vl7b-stage3-baseline-v1", revision="step-1500", torch_dtype="bfloat16")
p = AutoProcessor.from_pretrained("ohjoonhee/vlatents-qwen25vl7b-stage3-baseline-v1", revision="step-1500")
Limitations
Research checkpoint, eval-only. Mid-training step (1500/2000).
Not for production.
Card generated 2026-05-29 from training_log.jsonl + the run's training config.