Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0🎯 What This Model Does
This model generates structured tool calls in a compact format when given a user query and available tool definitions.
Output format:
markdown
call:function_name{param1:value1,param2:value2}
Example:
markdown
Input: "What's the weather in Tokyo?"Output: call:get_weather{city:Tokyo}
📊 Evaluation Results (v0.1)
Evaluated on held-out validation set (200 samples):
| Metric | Score |
|---|---|
| Tool Selection Accuracy | 64.2% |
| Full Match (name + args) | 28.4% |
| No-Call Accuracy (avoids hallucination) | 69.9% |
| Missed Tool Call Rate | 35.8% |
Strengths:
- ✅ Learned when NOT to call tools (70% no-call accuracy, low hallucination)
- ✅ Generates structured tool calls (not free-form text)
- ✅ Selects correct tool ~64% of the time from multiple options
Known Limitations:
- ⚠️ Uses compact format (
call:name{args}) rather than standard JSON - ⚠️ Misses tool calls ~36% of the time (responds with text instead)
- ⚠️ Argument extraction needs improvement (28% full match)
- ⚠️ v0.1 — not production-ready, experimental release
🚀 Quick Start
Option 1: Merged Model (recommended)
python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfigimport torch, json# Load merged model (no adapter needed)bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True)model = AutoModelForCausalLM.from_pretrained("roshangrewal/gemma4-e4b-toolcall-v01", quantization_config=bnb, device_map="auto", torch_dtype=torch.float16)tokenizer = AutoTokenizer.from_pretrained("roshangrewal/gemma4-e4b-toolcall-v01")# Define toolstools = [{"type": "function", "function": {"name": "get_weather", "description": "Get weather for a city", "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}}}]messages = [{"role": "system", "content": f"You have access to these tools:\n{json.dumps(tools)}\nCall the appropriate function when needed."},{"role": "user", "content": "What's the weather in Mumbai?"}]text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)inputs = tokenizer(text, return_tensors="pt").to(model.device)with torch.no_grad():out = model.generate(**inputs, max_new_tokens=200, do_sample=False, pad_token_id=tokenizer.pad_token_id)print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))# Output: call:get_weather{city:Mumbai}
Option 2: Adapter (lightweight, 51MB)
python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfigfrom peft import PeftModelimport torch, json# Load base + adapterbnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True)base = AutoModelForCausalLM.from_pretrained("google/gemma-4-E4B-it", quantization_config=bnb, device_map="auto", torch_dtype=torch.float16)tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-E4B-it")model = PeftModel.from_pretrained(base, "roshangrewal/gemma4-e4b-toolcall-v01-lora")model.eval()
Deployment Options
| Method | VRAM Required | Speed |
|---|---|---|
| 4-bit quantized (above) | ~10 GB | Good for T4/4090 |
| fp16 (full precision) | ~16 GB | Best quality, needs A10+ |
| GGUF via llama.cpp/Ollama | ~6 GB | CPU + GPU hybrid |
💬 Prompt Examples
Single tool, simple query
python
tools = [{"type": "function", "function": {"name": "get_weather", "description": "Get current weather","parameters": {"type": "object", "properties": {"city": {"type": "string"}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}}, "required": ["city"]}}}]messages = [{"role": "system", "content": f"You have access to these tools:\n{json.dumps(tools)}\nCall the appropriate function when needed. When no tool is needed, respond directly."},{"role": "user", "content": "What's the weather in Tokyo?"}]# Output: call:get_weather{city:Tokyo}
Multiple tools — model selects the right one
python
tools = [{"type": "function", "function": {"name": "get_weather", "description": "Get weather for a city","parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}}},{"type": "function", "function": {"name": "search_web", "description": "Search the web","parameters": {"type": "object", "properties": {"query": {"type": "string"}}, "required": ["query"]}}},{"type": "function", "function": {"name": "send_email", "description": "Send an email","parameters": {"type": "object", "properties": {"to": {"type": "string"}, "subject": {"type": "string"}, "body": {"type": "string"}}, "required": ["to", "subject", "body"]}}}]messages = [{"role": "system", "content": f"You have access to these tools:\n{json.dumps(tools)}\nCall the appropriate function when needed. When no tool is needed, respond directly."},{"role": "user", "content": "Search for latest news about AI startups in India"}]# Output: call:search_web{query:latest news AI startups India}
No tool needed — model responds directly
python
messages = [{"role": "system", "content": f"You have access to these tools:\n{json.dumps(tools)}\nCall the appropriate function when needed. When no tool is needed, respond directly."},{"role": "user", "content": "What is 2 + 2?"}]# Output: 4 (no tool call generated)
Prompt structure (for custom integrations)
markdown
System: You have access to these tools:[tool definitions as JSON array]Call the appropriate function when needed. When no tool is needed, respond directly.User: <query>| Parameter | Value ||-----------|-------|| Base Model | google/gemma-4-E4B-it (8B params, 4.5B effective) || Method | QLoRA (4-bit NF4, double quantization) || LoRA Rank | 16 || LoRA Alpha | 16 || Target Modules | q_proj.linear, k_proj.linear, v_proj.linear, o_proj.linear, up_proj.linear, down_proj.linear || Learning Rate | 1e-4 (linear decay, 10% warmup) || Effective Batch Size | 16 || Max Length | 1024 || Steps | 10,000 (~84% of 1 epoch) || Training Time | 56 hours || GPU | NVIDIA Tesla T4 (16GB) || Cost | ~$0 (own hardware) |## 📚 Training Data174,853 function-calling examples from:| Dataset | Examples ||---------|----------|| [Glaive Function Calling v2](https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2) | 112,960 || [Salesforce xLAM 60K](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) | 60,000 || [Hermes Function Calling v1](https://huggingface.co/datasets/NousResearch/hermes-function-calling-v1) | 1,893 |## 🗺️ Roadmap- **v0.1** (current): Initial fine-tune, compact call format- **v0.2** (planned): Align with Gemma 4's native tool-calling template, target 80%+ accuracy- **v1.0** (planned): Production-ready with BFCL leaderboard submission## 💡 Parsing the Output```pythonimport redef parse_tool_call(text):m = re.findall(r'call:(\w+)\{(.+?)\}', text)if m:name = m[0][0]args = dict(re.findall(r'(\w+):([^,}]+)', m[0][1]))return {"name": name, "arguments": args}return Noneresult = parse_tool_call("call:get_weather{city:Tokyo}")# {'name': 'get_weather', 'arguments': {'city': 'Tokyo'}}
🔗 Links
- Training Code: github.com/roshangrewal/f-for-finetuning
- Base Model: google/gemma-4-E4B-it
📝 Citation
bibtex
@misc{grewal2026gemma4toolcall,title={Gemma 4 E4B Tool-Calling Fine-Tune v0.1},author={Roshan Grewal},year={2026},url={https://huggingface.co/roshangrewal/gemma4-e4b-toolcall-v01}}
markdown
**Tips:**- Always include tool definitions in the system message as a JSON array- The system message must contain the instruction "Call the appropriate function when needed"- Model outputs `call:function_name{param:value}` format when it decides to use a tool- Model responds with plain text when no tool is appropriate## 🏗️ Training Details| Parameter | Value ||-----------|-------|| Base Model | google/gemma-4-E4B-it (8B params, 4.5B effective) || Method | QLoRA (4-bit NF4, double quantization) || LoRA Rank | 16 || LoRA Alpha | 16 || Target Modules | q_proj.linear, k_proj.linear, v_proj.linear, o_proj.linear, up_proj.linear, down_proj.linear || Learning Rate | 1e-4 (linear decay, 10% warmup) || Effective Batch Size | 16 || Max Length | 1024 || Steps | 10,000 (~84% of 1 epoch) || Training Time | 56 hours || GPU | NVIDIA Tesla T4 (16GB) |## 📚 Training Data174,853 function-calling examples from:| Dataset | Examples ||---------|----------|| [Glaive Function Calling v2](https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2) | 112,960 || [Salesforce xLAM 60K](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) | 60,000 || [Hermes Function Calling v1](https://huggingface.co/datasets/NousResearch/hermes-function-calling-v1) | 1,893 |## 🗺️ Roadmap- **v0.1** (current): Initial fine-tune, compact call format- **v0.2** (planned): Align with Gemma 4's native tool-calling template, target 80%+ accuracy- **v1.0** (planned): Production-ready with BFCL leaderboard submission## 💡 Parsing the Output```pythonimport redef parse_tool_call(text):m = re.findall(r'call:(\w+)\{(.+?)\}', text)if m:name = m[0][0]args = dict(re.findall(r'(\w+):([^,}]+)', m[0][1]))return {"name": name, "arguments": args}return Noneresult = parse_tool_call("call:get_weather{city:Tokyo}")# {'name': 'get_weather', 'arguments': {'city': 'Tokyo'}}
🔗 Links
- Training Code: github.com/roshangrewal/f-for-finetuning
- Base Model: google/gemma-4-E4B-it
📝 Citation
bibtex
@misc{grewal2026gemma4toolcall,title={Gemma 4 E4B Tool-Calling Fine-Tune v0.1},author={Roshan Grewal},year={2026},url={https://huggingface.co/roshangrewal/gemma4-e4b-toolcall-v01}}
Model provider
roshangrewal
Model tree
Base
google/gemma-4-E4B-it
Fine-tuned
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information