AIcell

guava-05-22

README

License: apache-2.0

⚠ Loading: use the multimodal auto-class

Qwen/Qwen3.5-4B is a vision-language model. Load with AutoModelForImageTextToText (or Qwen3_5ForConditionalGeneration directly), NOT AutoModelForCausalLM — the latter returns the text-only variant without language_model and will fail at generation.

Training hyperparameters

Table

Base model	`Qwen/Qwen3.5-4B`
Dtype	bfloat16
Tuner	Full fine-tune (LM trained, ViT + aligner frozen)
Epochs	3.0
LR / schedule	1e-05 / cosine, 0.05 warmup
Per-device batch / grad accum	2 / 2
Max length	10240
Final loss @ step 228	0.2162

System prompt

This model was trained against a specific prompt — see system_prompt.txt. Use that exact content as the system message; any other prompt produces a distribution shift.

Usage (transformers, no PEFT)

python
import torch
from PIL import Image
from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "AIcell/guava-05-22", torch_dtype=torch.bfloat16, device_map="cuda",
)
proc = AutoProcessor.from_pretrained("AIcell/guava-05-22")

system_prompt = open("system_prompt.txt").read().strip()
scene_img = Image.open("scene.png").convert("RGB")

messages = [
    {"role": "system", "content": [{"type": "text", "text": system_prompt}]},
    {"role": "user", "content": [
        {"type": "image", "image": scene_img},
        {"type": "text", "text":
            "Task: <your task description>.\n\n"
            "Gripper is at [...] rotation [...] width X%."},
    ]},
]
inputs = proc.apply_chat_template(
    messages, add_generation_prompt=True,
    tokenize=True, return_dict=True, return_tensors="pt",
).to("cuda")
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(proc.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])

Per-turn assistant output: a <think>…</think> block followed by exactly one <tool_call>{"name": "<tool>", "arguments": {…}}</tool_call> (or Task complete. / Task failed. to terminate).

vLLM serving (no LoRA flags needed)

bash
vllm serve AIcell/guava-05-22 \
    --port 8000 --max-model-len 24576 \
    --reasoning-parser qwen3 --tool-call-parser qwen3_coder \
    --enable-auto-tool-choice \
    --limit-mm-per-prompt '{"image": 20}'

Source

Training script, eval harness, and upload tooling: https://github.com/hdacnw/guava

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

AIcell

Model Tree

Base

Qwen/Qwen3.5-4B

Fine-tuned

this model

Input Modalities

TextImageVideo

Output Modalities

Text

Supported Functionality