AIcell
guava-05-22
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0⚠ Loading: use the multimodal auto-class
Qwen/Qwen3.5-4B is a vision-language model. Load with
AutoModelForImageTextToText (or Qwen3_5ForConditionalGeneration
directly), NOT AutoModelForCausalLM — the latter returns the
text-only variant without language_model and will fail at generation.
Training hyperparameters
| Base model | Qwen/Qwen3.5-4B |
| Dtype | bfloat16 |
| Tuner | Full fine-tune (LM trained, ViT + aligner frozen) |
| Epochs | 3.0 |
| LR / schedule | 1e-05 / cosine, 0.05 warmup |
| Per-device batch / grad accum | 2 / 2 |
| Max length | 10240 |
| Final loss @ step 228 | 0.2162 |
System prompt
This model was trained against a specific prompt — see
system_prompt.txt. Use that exact content as
the system message; any other prompt produces a distribution shift.
Usage (transformers, no PEFT)
python
import torchfrom PIL import Imagefrom transformers import AutoModelForImageTextToText, AutoProcessormodel = AutoModelForImageTextToText.from_pretrained("AIcell/guava-05-22", torch_dtype=torch.bfloat16, device_map="cuda",)proc = AutoProcessor.from_pretrained("AIcell/guava-05-22")system_prompt = open("system_prompt.txt").read().strip()scene_img = Image.open("scene.png").convert("RGB")messages = [{"role": "system", "content": [{"type": "text", "text": system_prompt}]},{"role": "user", "content": [{"type": "image", "image": scene_img},{"type": "text", "text":"Task: <your task description>.\n\n""Gripper is at [...] rotation [...] width X%."},]},]inputs = proc.apply_chat_template(messages, add_generation_prompt=True,tokenize=True, return_dict=True, return_tensors="pt",).to("cuda")with torch.no_grad():out = model.generate(**inputs, max_new_tokens=512, do_sample=False)print(proc.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])
Per-turn assistant output: a <think>…</think> block followed by
exactly one <tool_call>{"name": "<tool>", "arguments": {…}}</tool_call>
(or Task complete. / Task failed. to terminate).
vLLM serving (no LoRA flags needed)
bash
vllm serve AIcell/guava-05-22 \--port 8000 --max-model-len 24576 \--reasoning-parser qwen3 --tool-call-parser qwen3_coder \--enable-auto-tool-choice \--limit-mm-per-prompt '{"image": 20}'
Source
Training script, eval harness, and upload tooling: https://github.com/hdacnw/guava
Model provider
AIcell
Model tree
Base
Qwen/Qwen3.5-4B
Fine-tuned
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information