ohjoonhee

vlatents-qwen25vl7b-stage3-baseline-v1

Deploy Dedicated

README

License: apache-2.0

Recipe

Stage: 2 (post-SFT, alignment + emphasized-CE objective)
Base model: Qwen/Qwen2.5-VL-7B-Instruct
Init checkpoint: /data/joonhee/visual-latents/cluster_phase3/stage1_sft/checkpoint
Dataset: ohjoonhee/visual-cot-50k-poc (Monet-SFT-125K Visual_CoT subset, eval-200 excluded)
Hardware: 4× H100 80GB, DeepSpeed ZeRO-2 + CPU optim offload, bf16
latent_size: 8
alignment_weight: 2.0
ce_emphasize_factor: 4.0
alignment_layer: all_layers
use_attn_mask_4d: True
lr: 1e-05
weight_decay: 0.01
warmup_steps: 100
max_steps: 2000
grad_accum_steps: 32
max_pixels: 1568000

Fidelity to the Monet paper

Latent-only backprop — paper-faithful (Job C). emphasize_latent_weight uses a verbatim port of upstream compute_latents_only_loss: the alignment loss is computed in the CE forward (where ce_patch_vec is spliced into inputs_embeds) and backpropped ONLY through the latent embeddings, i.e.
markdown
```
total = emphasize_latent_weight * compute_latents_only_loss(ce_patch_vec, alignment_weight*align) + ce
```
(mirrors upstream src/trainer.py:152-224). The earlier plain-scalar-add approximation (see the *-repro-v1 repo) is NOT used here.
attention_mask_4d is hand-rolled in mask_utils.build_monet_4d_attn with latent_cross_isolate=True. Verified equivalent on tested cases (see phase1_5b_attn/MASK_VALIDATION.md) but not byte-identical to upstream.
Inline teacher forward (not offline-precomputed). Functionally equivalent if teacher checkpoint is the same; saves precompute storage.

This revision (`step-1500`)

Last logged training row: step=1800, ce_loss=1.0302, align_loss=0.0416, total_loss=1.1133, elapsed=85939s

Notes

Job D Stage 3 BASELINE (lambda_reg=0) step-1500. Walltime-killed at step 1800/2000. ce~~1.0 align~~0.04 vicreg=0. Pairwise-cos collapse signature pending internal probe.

Other revisions: see the revisions dropdown on this page.

How to load

python
from transformers import AutoModelForVision2Seq, AutoProcessor
m = AutoModelForVision2Seq.from_pretrained(
    "ohjoonhee/vlatents-qwen25vl7b-stage3-baseline-v1", revision="step-1500", torch_dtype="bfloat16")
p = AutoProcessor.from_pretrained("ohjoonhee/vlatents-qwen25vl7b-stage3-baseline-v1", revision="step-1500")

Limitations

Research checkpoint, eval-only. Mid-training step (1500/2000). Not for production.

Card generated 2026-05-29 from training_log.jsonl + the run's training config.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

ohjoonhee

Model Tree

Base

Qwen/Qwen2.5-VL-7B-Instruct

Fine-tuned

this model

Input Modalities

Text

Image

Output Modalities

Text

Supported Functionality

Dedicated Endpoints

Container

Explore FriendliAI today

Get started Talk to an engineer

README

License: apache-2.0

Recipe

Stage: 2 (post-SFT, alignment + emphasized-CE objective)
Base model: Qwen/Qwen2.5-VL-7B-Instruct
Init checkpoint: /data/joonhee/visual-latents/cluster_phase3/stage1_sft/checkpoint
Dataset: ohjoonhee/visual-cot-50k-poc (Monet-SFT-125K Visual_CoT subset, eval-200 excluded)
Hardware: 4× H100 80GB, DeepSpeed ZeRO-2 + CPU optim offload, bf16
latent_size: 8
alignment_weight: 2.0
ce_emphasize_factor: 4.0
alignment_layer: all_layers
use_attn_mask_4d: True
lr: 1e-05
weight_decay: 0.01
warmup_steps: 100
max_steps: 2000
grad_accum_steps: 32
max_pixels: 1568000

Fidelity to the Monet paper

Latent-only backprop — paper-faithful (Job C). emphasize_latent_weight uses a verbatim port of upstream compute_latents_only_loss: the alignment loss is computed in the CE forward (where ce_patch_vec is spliced into inputs_embeds) and backpropped ONLY through the latent embeddings, i.e.
markdown
```
total = emphasize_latent_weight * compute_latents_only_loss(ce_patch_vec, alignment_weight*align) + ce
```
(mirrors upstream src/trainer.py:152-224). The earlier plain-scalar-add approximation (see the *-repro-v1 repo) is NOT used here.
attention_mask_4d is hand-rolled in mask_utils.build_monet_4d_attn with latent_cross_isolate=True. Verified equivalent on tested cases (see phase1_5b_attn/MASK_VALIDATION.md) but not byte-identical to upstream.
Inline teacher forward (not offline-precomputed). Functionally equivalent if teacher checkpoint is the same; saves precompute storage.

This revision (`step-1500`)

Last logged training row: step=1800, ce_loss=1.0302, align_loss=0.0416, total_loss=1.1133, elapsed=85939s

Notes

Job D Stage 3 BASELINE (lambda_reg=0) step-1500. Walltime-killed at step 1800/2000. ce~~1.0 align~~0.04 vicreg=0. Pairwise-cos collapse signature pending internal probe.

Other revisions: see the revisions dropdown on this page.

How to load

python
from transformers import AutoModelForVision2Seq, AutoProcessor
m = AutoModelForVision2Seq.from_pretrained(
    "ohjoonhee/vlatents-qwen25vl7b-stage3-baseline-v1", revision="step-1500", torch_dtype="bfloat16")
p = AutoProcessor.from_pretrained("ohjoonhee/vlatents-qwen25vl7b-stage3-baseline-v1", revision="step-1500")

Limitations

Research checkpoint, eval-only. Mid-training step (1500/2000). Not for production.

Card generated 2026-05-29 from training_log.jsonl + the run's training config.

vlatents-qwen25vl7b-stage3-baseline-v1

README

Recipe

Fidelity to the Monet paper

This revision (step-1500)

Notes

How to load

Limitations

Explore FriendliAI today

README

Recipe

Fidelity to the Monet paper

This revision (step-1500)

Notes

How to load

Limitations

This revision (`step-1500`)

This revision (`step-1500`)