hhllzz

robocasa-qwen3vl4b-completion-judge-seen-nav-p30-1to2

README

License: other

Intended Use

Use this adapter for offline evaluation or as a lightweight VLM completion checker in RoboCasa long-horizon task pipelines.

Input:

A 64-frame video clip from the robot observation stream.
A subtask instruction, such as Place the straw inside the glass cup.

Recommended prompt:

text
Instruction: <subtask instruction>
Question: At the final moment shown in the video, is the instruction already completed?
Answer exactly one label: not complete or complete.

Do not include simulator privileged state in the inference prompt. The model should not see success frames, object poses, rewards, oracle predicates, or environment metadata.

Best Inference Rule

The most reliable inference mode tested so far is right + left view mean normalized NLL:

Run the same instruction and sample window once with robot0_agentview_right.
Run it again with robot0_agentview_left.
For each view, score both candidate answers, not complete and complete.
Normalize each candidate score by its answer-token count.
Average the normalized NLL scores from right and left views.
Choose the label with the lower averaged score.

Formula:

text
score_final(label)
  = 0.5 * NLL_right(label) / num_answer_tokens(label)
  + 0.5 * NLL_left(label)  / num_answer_tokens(label)

prediction = argmin_label score_final(label)

Lower score means the model prefers that answer. Normalizing by answer-token count matters because not complete is longer than complete.

In a 16-sample complete-heavy multi-view smoke test:

Table with columns: Method, Accuracy
Method	Accuracy
right view only	87.50%
left view only	68.75%
eye-in-hand only	31.25%
three-view majority vote	75.00%
three-view mean NLL	75.00%
right + left mean normalized NLL	93.75%

The eye-in-hand view was not used in the recommended ensemble because it was strongly biased toward not complete without additional fine-tuning.

Training Data Summary

This adapter was trained from a previous RoboCasa completion-judge adapter on a seen-navigation long-horizon subtask dataset with completion_frame + 30 positive windows when possible, plus DeliverStraw.

Training dataset root on the original 5090 machine:

text
/home/zhengqingao/projects/robocasa_completion_judge/data/seen_navigation_p30_completion_full_with_deliverstraw_1to2

Dataset size:

Table with columns: Split, Samples
Split	Samples
train	28,416
val	3,786
test	3,648
total	35,850

Label mix:

Table with columns: Label, Samples
Label	Samples
not complete	23,900
complete	11,950

Phase mix:

Table with columns: Phase, Samples
Phase	Samples
pre_completion_negative	17,925
stable_complete_30	8,754
hard_negative	5,975
terminal_complete_partial	3,196

The training target was a two-choice answer over not complete / complete. Oracle information was used only for offline labeling and quality control, not as model input.

Evaluation Summary

Final training quick eval at optimizer step 3552:

Table with columns: Metric, Value
Metric	Value
eval accuracy	87.50%
eval loss	0.0812
eval samples	96

Small complete-heavy right-view test using this final adapter:

Table with columns: Metric, Value
Metric	Value
accuracy	87.50%
complete accuracy	83.33%
not complete accuracy	100.00%

Small complete-heavy right+left mean-NLL ensemble:

Table with columns: Metric, Value
Metric	Value
accuracy	93.75%
complete accuracy	91.67%
not complete accuracy	100.00%

These numbers are useful smoke-test diagnostics, not a final broad benchmark.

Loading The Adapter

Install dependencies:

bash
pip install "transformers>=4.57.0" peft accelerate bitsandbytes qwen-vl-utils decord safetensors

Minimal single-view usage:

python
from peft import PeftModel
from transformers import AutoProcessor, BitsAndBytesConfig, Qwen3VLForConditionalGeneration
import torch

base_model = "Qwen/Qwen3-VL-4B-Instruct"
adapter_id = "hhllzz/robocasa-qwen3vl4b-completion-judge-seen-nav-p30-1to2"

processor = AutoProcessor.from_pretrained(base_model, trust_remote_code=True)
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    base_model,
    quantization_config=quant_config,
    device_map={"": 0},
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(model, adapter_id, is_trainable=False)
model.eval()

For practical inference, use the included inference_right_left_nll.py script rather than free generation.

Example Inference Commands

Single-view scoring:

bash
python inference_right_left_nll.py \
  --base-model Qwen/Qwen3-VL-4B-Instruct \
  --adapter-id hhllzz/robocasa-qwen3vl4b-completion-judge-seen-nav-p30-1to2 \
  --instruction "Place the straw inside the glass cup." \
  --right-video /path/to/robot0_agentview_right_clip.mp4

Right + left mean normalized NLL:

bash
python inference_right_left_nll.py \
  --base-model Qwen/Qwen3-VL-4B-Instruct \
  --adapter-id hhllzz/robocasa-qwen3vl4b-completion-judge-seen-nav-p30-1to2 \
  --instruction "Place the straw inside the glass cup." \
  --right-video /path/to/robot0_agentview_right_clip.mp4 \
  --left-video /path/to/robot0_agentview_left_clip.mp4

Example JSON output:

json
{
  "prediction": "complete",
  "ensemble": "right_left_mean_normalized_nll",
  "scores": {
    "not complete": 0.2324,
    "complete": 0.1846
  }
}

Limitations

The adapter is specialized for RoboCasa/RoboCasa365 robot observation videos and short subtask instructions.
It is not a general-purpose video model.
It was primarily trained on external agent views; eye-in-hand inference may be unreliable unless further tuned.
The recommended inference path uses candidate scoring over the two fixed labels, not open-ended generation.
Performance can drop when the relevant object or completion evidence is occluded in both external views.

Reproducibility Notes

Video clips use 64 sampled frames.
The training/eval implementation used 224x224 video pixels.
Candidate answer scoring uses normalized negative log likelihood over answer tokens.
Seed used in the project scripts: 5090.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.