junwatu
ono-gemma-4-12b-fable5-agent
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: otherTraining
| Item | Value |
|---|---|
| Dataset | tool_use rows only (~3,600), CoT capped at 1,200 chars |
| Train / val split | 95% / 5% (seed=42) |
| Epochs | 3 |
| Learning rate | 1e-5 (cosine, 3% warmup) |
| Effective batch size | 16 (batch 1 × grad accum 16) |
| Max sequence length | 3,072 tokens |
| Loss masking | User + CoT masked → train only on call JSON |
| Optimizer | AdamW 8-bit |
| GPU | NVIDIA H200 on Modal |
| Train loss | 0.937 |
| Eval loss | 0.400 |
| Training time | ~3h 48m |
Vision and audio towers are present in the unified Gemma 4 checkpoint but were frozen during text-only training.
Evaluation
Batch evaluation on 50 held-out Fable-5 samples (seed=42, max_new_tokens=1024, temperature=0.2):
| Metric | Result |
|---|---|
| Tool name accuracy | 56% |
call block emitted | 96% |
| Parseable tool JSON | 94% |
These numbers are indicative only and do not meet production reliability thresholds.
Recommended inference settings:
| Parameter | Value |
|---|---|
max_new_tokens | 1024 |
temperature | 0.2 |
do_sample | true (or greedy for max consistency) |
Prompt format
Each turn follows Gemma chat tokens with an explicit thought → call structure:
markdown
<start_of_turn>user{agent context: tool defs, history, task}<end_of_turn><start_of_turn>modelthought{chain-of-thought reasoning}call{'tool': 'Edit', 'input': {'file_path': '...', 'old_string': '...', 'new_string': '...'}}<end_of_turn>
At inference, start the model turn and let it generate from thought:
python
prompt = (f"<start_of_turn>user\n{context}<end_of_turn>\n"f"<start_of_turn>model\nthought\n")
Quick start
python
import torchfrom transformers import AutoModelForMultimodalLM, AutoTokenizermodel_id = "junwatu/ono-gemma-4-12b-fable5-agent"tokenizer = AutoTokenizer.from_pretrained(model_id)model = AutoModelForMultimodalLM.from_pretrained(model_id,dtype=torch.bfloat16,device_map="auto",)model.eval()context = "You are a coding agent. List all Python files in the current directory."prompt = (f"<start_of_turn>user\n{context}<end_of_turn>\n"f"<start_of_turn>model\nthought\n")inputs = tokenizer(prompt, return_tensors="pt")inputs["token_type_ids"] = torch.zeros_like(inputs["input_ids"])inputs["mm_token_type_ids"] = torch.zeros_like(inputs["input_ids"])inputs = {k: v.to(model.device) for k, v in inputs.items()}with torch.no_grad():output_ids = model.generate(**inputs,max_new_tokens=1024,temperature=0.2,top_p=0.9,do_sample=True,pad_token_id=tokenizer.pad_token_id,eos_token_id=tokenizer.eos_token_id,)response = tokenizer.decode(output_ids[0][inputs["input_ids"].shape[1]:],skip_special_tokens=False,)print(response)
Important: Gemma 4 unified models require
token_type_idsandmm_token_type_ids(all zeros for text-only) even when not using vision or audio.
Supported tools (from training data)
Common tool names seen in Fable-5 traces include Bash, Edit, Read, Write, Grep, WebSearch, TaskUpdate, PowerShell, and MCP-prefixed tools. Accuracy varies by tool type.
Limitations
- Not for production — experimental checkpoint with ~56% tool accuracy on a small eval set; unsuitable for live agent deployment without further work.
- Long contexts are truncated to 3,072 tokens during training.
- Sampling matters — low temperature (0.2) and sufficient
max_new_tokens(1024) are important for reliablecallblock generation. - Multimodal weights are included but unused; only text LM weights were fine-tuned.
- Trained on a single agent trace style (Fable-5); may not generalize to other tool schemas without further fine-tuning.
License
Built on google/gemma-4-12B-it. Use is subject to the Gemma license terms. Fable-5 dataset: Glint-Research/Fable-5-traces.
Model provider
junwatu
Model tree
Base
google/gemma-4-12B-it
Fine-tuned
this model
Modalities
Input
Video, Audio, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information