junwatu/ono-gemma-4-12b-fable5-agent API & Inference Endpoint

Training

Table
Item	Value
Dataset	`tool_use` rows only (~3,600), CoT capped at 1,200 chars
Train / val split	95% / 5% (seed=42)
Epochs	3
Learning rate	1e-5 (cosine, 3% warmup)
Effective batch size	16 (batch 1 × grad accum 16)
Max sequence length	3,072 tokens
Loss masking	User + CoT masked → train only on `call` JSON
Optimizer	AdamW 8-bit
GPU	NVIDIA H200 on Modal
Train loss	0.937
Eval loss	0.400
Training time	~3h 48m

Vision and audio towers are present in the unified Gemma 4 checkpoint but were frozen during text-only training.

Evaluation

Batch evaluation on 50 held-out Fable-5 samples (seed=42, max_new_tokens=1024, temperature=0.2):

Table
Metric	Result
Tool name accuracy	56%
`call` block emitted	96%
Parseable tool JSON	94%

These numbers are indicative only and do not meet production reliability thresholds.

Recommended inference settings:

Table
Parameter	Value
`max_new_tokens`	1024
`temperature`	0.2
`do_sample`	true (or greedy for max consistency)

Prompt format

Each turn follows Gemma chat tokens with an explicit thought → call structure:

markdown
<start_of_turn>user
{agent context: tool defs, history, task}<end_of_turn>
<start_of_turn>model
thought
{chain-of-thought reasoning}
call
{'tool': 'Edit', 'input': {'file_path': '...', 'old_string': '...', 'new_string': '...'}}<end_of_turn>

At inference, start the model turn and let it generate from thought:

python
prompt = (
    f"<start_of_turn>user\n{context}<end_of_turn>\n"
    f"<start_of_turn>model\nthought\n"
)

Quick start

python
import torch
from transformers import AutoModelForMultimodalLM, AutoTokenizer

model_id = "junwatu/ono-gemma-4-12b-fable5-agent"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMultimodalLM.from_pretrained(
    model_id,
    dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

context = "You are a coding agent. List all Python files in the current directory."
prompt = (
    f"<start_of_turn>user\n{context}<end_of_turn>\n"
    f"<start_of_turn>model\nthought\n"
)

inputs = tokenizer(prompt, return_tensors="pt")
inputs["token_type_ids"] = torch.zeros_like(inputs["input_ids"])
inputs["mm_token_type_ids"] = torch.zeros_like(inputs["input_ids"])
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=1024,
        temperature=0.2,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

response = tokenizer.decode(
    output_ids[0][inputs["input_ids"].shape[1]:],
    skip_special_tokens=False,
)
print(response)

Important: Gemma 4 unified models require token_type_ids and mm_token_type_ids (all zeros for text-only) even when not using vision or audio.

Supported tools (from training data)

Common tool names seen in Fable-5 traces include Bash, Edit, Read, Write, Grep, WebSearch, TaskUpdate, PowerShell, and MCP-prefixed tools. Accuracy varies by tool type.

Limitations

Not for production — experimental checkpoint with ~56% tool accuracy on a small eval set; unsuitable for live agent deployment without further work.
Long contexts are truncated to 3,072 tokens during training.
Sampling matters — low temperature (0.2) and sufficient max_new_tokens (1024) are important for reliable call block generation.
Multimodal weights are included but unused; only text LM weights were fine-tuned.
Trained on a single agent trace style (Fable-5); may not generalize to other tool schemas without further fine-tuning.

License

Built on google/gemma-4-12B-it. Use is subject to the Gemma license terms. Fable-5 dataset: Glint-Research/Fable-5-traces.

ono-gemma-4-12b-fable5-agent

Get help setting up a custom Dedicated Endpoints.

README