hhllzz

robocasa-qwen3vl4b-completion-judge-seen-nav-p30-1to2

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: other

Intended Use

Use this adapter for offline evaluation or as a lightweight VLM completion checker in RoboCasa long-horizon task pipelines.

Input:

  1. A 64-frame video clip from the robot observation stream.
  2. A subtask instruction, such as Place the straw inside the glass cup.

Recommended prompt:

text

Instruction: <subtask instruction>
Question: At the final moment shown in the video, is the instruction already completed?
Answer exactly one label: not complete or complete.

Do not include simulator privileged state in the inference prompt. The model should not see success frames, object poses, rewards, oracle predicates, or environment metadata.

Best Inference Rule

The most reliable inference mode tested so far is right + left view mean normalized NLL:

  1. Run the same instruction and sample window once with robot0_agentview_right.
  2. Run it again with robot0_agentview_left.
  3. For each view, score both candidate answers, not complete and complete.
  4. Normalize each candidate score by its answer-token count.
  5. Average the normalized NLL scores from right and left views.
  6. Choose the label with the lower averaged score.

Formula:

text

score_final(label)
= 0.5 * NLL_right(label) / num_answer_tokens(label)
+ 0.5 * NLL_left(label) / num_answer_tokens(label)
prediction = argmin_label score_final(label)

Lower score means the model prefers that answer. Normalizing by answer-token count matters because not complete is longer than complete.

In a 16-sample complete-heavy multi-view smoke test:

Table
MethodAccuracy
right view only87.50%
left view only68.75%
eye-in-hand only31.25%
three-view majority vote75.00%
three-view mean NLL75.00%
right + left mean normalized NLL93.75%

The eye-in-hand view was not used in the recommended ensemble because it was strongly biased toward not complete without additional fine-tuning.

Training Data Summary

This adapter was trained from a previous RoboCasa completion-judge adapter on a seen-navigation long-horizon subtask dataset with completion_frame + 30 positive windows when possible, plus DeliverStraw.

Training dataset root on the original 5090 machine:

text

/home/zhengqingao/projects/robocasa_completion_judge/data/seen_navigation_p30_completion_full_with_deliverstraw_1to2

Dataset size:

Table
SplitSamples
train28,416
val3,786
test3,648
total35,850

Label mix:

Table
LabelSamples
not complete23,900
complete11,950

Phase mix:

Table
PhaseSamples
pre_completion_negative17,925
stable_complete_308,754
hard_negative5,975
terminal_complete_partial3,196

The training target was a two-choice answer over not complete / complete. Oracle information was used only for offline labeling and quality control, not as model input.

Evaluation Summary

Final training quick eval at optimizer step 3552:

Table
MetricValue
eval accuracy87.50%
eval loss0.0812
eval samples96

Small complete-heavy right-view test using this final adapter:

Table
MetricValue
accuracy87.50%
complete accuracy83.33%
not complete accuracy100.00%

Small complete-heavy right+left mean-NLL ensemble:

Table
MetricValue
accuracy93.75%
complete accuracy91.67%
not complete accuracy100.00%

These numbers are useful smoke-test diagnostics, not a final broad benchmark.

Loading The Adapter

Install dependencies:

bash

pip install "transformers>=4.57.0" peft accelerate bitsandbytes qwen-vl-utils decord safetensors

Minimal single-view usage:

python

from peft import PeftModel
from transformers import AutoProcessor, BitsAndBytesConfig, Qwen3VLForConditionalGeneration
import torch
base_model = "Qwen/Qwen3-VL-4B-Instruct"
adapter_id = "hhllzz/robocasa-qwen3vl4b-completion-judge-seen-nav-p30-1to2"
processor = AutoProcessor.from_pretrained(base_model, trust_remote_code=True)
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = Qwen3VLForConditionalGeneration.from_pretrained(
base_model,
quantization_config=quant_config,
device_map={"": 0},
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
model = PeftModel.from_pretrained(model, adapter_id, is_trainable=False)
model.eval()

For practical inference, use the included inference_right_left_nll.py script rather than free generation.

Example Inference Commands

Single-view scoring:

bash

python inference_right_left_nll.py \
--base-model Qwen/Qwen3-VL-4B-Instruct \
--adapter-id hhllzz/robocasa-qwen3vl4b-completion-judge-seen-nav-p30-1to2 \
--instruction "Place the straw inside the glass cup." \
--right-video /path/to/robot0_agentview_right_clip.mp4

Right + left mean normalized NLL:

bash

python inference_right_left_nll.py \
--base-model Qwen/Qwen3-VL-4B-Instruct \
--adapter-id hhllzz/robocasa-qwen3vl4b-completion-judge-seen-nav-p30-1to2 \
--instruction "Place the straw inside the glass cup." \
--right-video /path/to/robot0_agentview_right_clip.mp4 \
--left-video /path/to/robot0_agentview_left_clip.mp4

Example JSON output:

json

{
"prediction": "complete",
"ensemble": "right_left_mean_normalized_nll",
"scores": {
"not complete": 0.2324,
"complete": 0.1846
}
}

Limitations

  • The adapter is specialized for RoboCasa/RoboCasa365 robot observation videos and short subtask instructions.
  • It is not a general-purpose video model.
  • It was primarily trained on external agent views; eye-in-hand inference may be unreliable unless further tuned.
  • The recommended inference path uses candidate scoring over the two fixed labels, not open-ended generation.
  • Performance can drop when the relevant object or completion evidence is occluded in both external views.

Reproducibility Notes

  • Video clips use 64 sampled frames.
  • The training/eval implementation used 224x224 video pixels.
  • Candidate answer scoring uses normalized negative log likelihood over answer tokens.
  • Seed used in the project scripts: 5090.

Model provider

hhllzz

Model tree

Base

Qwen/Qwen3-VL-4B-Instruct

Adapter

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today