Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Recipe

  • Stage: 2 (post-SFT, alignment + emphasized-CE objective)
  • Base model: Qwen/Qwen2.5-VL-7B-Instruct
  • Init checkpoint: /data/joonhee/visual-latents/cluster_phase3/stage1_sft/checkpoint
  • Dataset: ohjoonhee/visual-cot-50k-poc (Monet-SFT-125K Visual_CoT subset, eval-200 excluded)
  • Hardware: 4× H100 80GB, DeepSpeed ZeRO-2 + CPU optim offload, bf16
  • latent_size: 8
  • alignment_weight: 2.0
  • ce_emphasize_factor: 4.0
  • alignment_layer: all_layers
  • use_attn_mask_4d: True
  • lr: 1e-05
  • weight_decay: 0.01
  • warmup_steps: 100
  • max_steps: 2000
  • grad_accum_steps: 32
  • max_pixels: 1568000

Fidelity to the Monet paper

  1. Latent-only backprop — paper-faithful (Job C). emphasize_latent_weight uses a verbatim port of upstream compute_latents_only_loss: the alignment loss is computed in the CE forward (where ce_patch_vec is spliced into inputs_embeds) and backpropped ONLY through the latent embeddings, i.e.

    markdown

    total = emphasize_latent_weight * compute_latents_only_loss(ce_patch_vec, alignment_weight*align) + ce
    (mirrors upstream src/trainer.py:152-224). The earlier plain-scalar-add approximation (see the *-repro-v1 repo) is NOT used here.
  2. attention_mask_4d is hand-rolled in mask_utils.build_monet_4d_attn with latent_cross_isolate=True. Verified equivalent on tested cases (see phase1_5b_attn/MASK_VALIDATION.md) but not byte-identical to upstream.
  3. Inline teacher forward (not offline-precomputed). Functionally equivalent if teacher checkpoint is the same; saves precompute storage.

This revision (step-1500)

Last logged training row: step=1800, ce_loss=1.0302, align_loss=0.0416, total_loss=1.1133, elapsed=85939s

Notes

Job D Stage 3 BASELINE (lambda_reg=0) step-1500. Walltime-killed at step 1800/2000. ce1.0 align0.04 vicreg=0. Pairwise-cos collapse signature pending internal probe.

Other revisions: see the revisions dropdown on this page.

How to load

python

from transformers import AutoModelForVision2Seq, AutoProcessor
m = AutoModelForVision2Seq.from_pretrained(
"ohjoonhee/vlatents-qwen25vl7b-stage3-baseline-v1", revision="step-1500", torch_dtype="bfloat16")
p = AutoProcessor.from_pretrained("ohjoonhee/vlatents-qwen25vl7b-stage3-baseline-v1", revision="step-1500")

Limitations

Research checkpoint, eval-only. Mid-training step (1500/2000). Not for production.


Card generated 2026-05-29 from training_log.jsonl + the run's training config.

Model provider

ohjoonhee

ohjoonhee

Model tree

Base

Qwen/Qwen2.5-VL-7B-Instruct

Fine-tuned

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today