hhllzz
robocasa-qwen3vl4b-completion-judge-seen-nav-p30-1to2
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: otherIntended Use
Use this adapter for offline evaluation or as a lightweight VLM completion checker in RoboCasa long-horizon task pipelines.
Input:
- A 64-frame video clip from the robot observation stream.
- A subtask instruction, such as
Place the straw inside the glass cup.
Recommended prompt:
text
Instruction: <subtask instruction>Question: At the final moment shown in the video, is the instruction already completed?Answer exactly one label: not complete or complete.
Do not include simulator privileged state in the inference prompt. The model should not see success frames, object poses, rewards, oracle predicates, or environment metadata.
Best Inference Rule
The most reliable inference mode tested so far is right + left view mean normalized NLL:
- Run the same instruction and sample window once with
robot0_agentview_right. - Run it again with
robot0_agentview_left. - For each view, score both candidate answers,
not completeandcomplete. - Normalize each candidate score by its answer-token count.
- Average the normalized NLL scores from right and left views.
- Choose the label with the lower averaged score.
Formula:
text
score_final(label)= 0.5 * NLL_right(label) / num_answer_tokens(label)+ 0.5 * NLL_left(label) / num_answer_tokens(label)prediction = argmin_label score_final(label)
Lower score means the model prefers that answer. Normalizing by answer-token count matters because not complete is longer than complete.
In a 16-sample complete-heavy multi-view smoke test:
| Method | Accuracy |
|---|---|
| right view only | 87.50% |
| left view only | 68.75% |
| eye-in-hand only | 31.25% |
| three-view majority vote | 75.00% |
| three-view mean NLL | 75.00% |
| right + left mean normalized NLL | 93.75% |
The eye-in-hand view was not used in the recommended ensemble because it was strongly biased toward not complete without additional fine-tuning.
Training Data Summary
This adapter was trained from a previous RoboCasa completion-judge adapter on a seen-navigation long-horizon subtask dataset with completion_frame + 30 positive windows when possible, plus DeliverStraw.
Training dataset root on the original 5090 machine:
text
/home/zhengqingao/projects/robocasa_completion_judge/data/seen_navigation_p30_completion_full_with_deliverstraw_1to2
Dataset size:
| Split | Samples |
|---|---|
| train | 28,416 |
| val | 3,786 |
| test | 3,648 |
| total | 35,850 |
Label mix:
| Label | Samples |
|---|---|
| not complete | 23,900 |
| complete | 11,950 |
Phase mix:
| Phase | Samples |
|---|---|
| pre_completion_negative | 17,925 |
| stable_complete_30 | 8,754 |
| hard_negative | 5,975 |
| terminal_complete_partial | 3,196 |
The training target was a two-choice answer over not complete / complete. Oracle information was used only for offline labeling and quality control, not as model input.
Evaluation Summary
Final training quick eval at optimizer step 3552:
| Metric | Value |
|---|---|
| eval accuracy | 87.50% |
| eval loss | 0.0812 |
| eval samples | 96 |
Small complete-heavy right-view test using this final adapter:
| Metric | Value |
|---|---|
| accuracy | 87.50% |
| complete accuracy | 83.33% |
| not complete accuracy | 100.00% |
Small complete-heavy right+left mean-NLL ensemble:
| Metric | Value |
|---|---|
| accuracy | 93.75% |
| complete accuracy | 91.67% |
| not complete accuracy | 100.00% |
These numbers are useful smoke-test diagnostics, not a final broad benchmark.
Loading The Adapter
Install dependencies:
bash
pip install "transformers>=4.57.0" peft accelerate bitsandbytes qwen-vl-utils decord safetensors
Minimal single-view usage:
python
from peft import PeftModelfrom transformers import AutoProcessor, BitsAndBytesConfig, Qwen3VLForConditionalGenerationimport torchbase_model = "Qwen/Qwen3-VL-4B-Instruct"adapter_id = "hhllzz/robocasa-qwen3vl4b-completion-judge-seen-nav-p30-1to2"processor = AutoProcessor.from_pretrained(base_model, trust_remote_code=True)quant_config = BitsAndBytesConfig(load_in_4bit=True,bnb_4bit_quant_type="nf4",bnb_4bit_use_double_quant=True,bnb_4bit_compute_dtype=torch.bfloat16,)model = Qwen3VLForConditionalGeneration.from_pretrained(base_model,quantization_config=quant_config,device_map={"": 0},torch_dtype=torch.bfloat16,trust_remote_code=True,)model = PeftModel.from_pretrained(model, adapter_id, is_trainable=False)model.eval()
For practical inference, use the included inference_right_left_nll.py script rather than free generation.
Example Inference Commands
Single-view scoring:
bash
python inference_right_left_nll.py \--base-model Qwen/Qwen3-VL-4B-Instruct \--adapter-id hhllzz/robocasa-qwen3vl4b-completion-judge-seen-nav-p30-1to2 \--instruction "Place the straw inside the glass cup." \--right-video /path/to/robot0_agentview_right_clip.mp4
Right + left mean normalized NLL:
bash
python inference_right_left_nll.py \--base-model Qwen/Qwen3-VL-4B-Instruct \--adapter-id hhllzz/robocasa-qwen3vl4b-completion-judge-seen-nav-p30-1to2 \--instruction "Place the straw inside the glass cup." \--right-video /path/to/robot0_agentview_right_clip.mp4 \--left-video /path/to/robot0_agentview_left_clip.mp4
Example JSON output:
json
{"prediction": "complete","ensemble": "right_left_mean_normalized_nll","scores": {"not complete": 0.2324,"complete": 0.1846}}
Limitations
- The adapter is specialized for RoboCasa/RoboCasa365 robot observation videos and short subtask instructions.
- It is not a general-purpose video model.
- It was primarily trained on external agent views; eye-in-hand inference may be unreliable unless further tuned.
- The recommended inference path uses candidate scoring over the two fixed labels, not open-ended generation.
- Performance can drop when the relevant object or completion evidence is occluded in both external views.
Reproducibility Notes
- Video clips use 64 sampled frames.
- The training/eval implementation used 224x224 video pixels.
- Candidate answer scoring uses normalized negative log likelihood over answer tokens.
- Seed used in the project scripts:
5090.
Model provider
hhllzz
Model tree
Base
Qwen/Qwen3-VL-4B-Instruct
Adapter
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information