Ayush0110
toolforge-qwen7b-r64
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0What it does
Routes a query to one (or several) of these tools, or to a direct answer:
web_search, calculator, weather, wikipedia, datetime,
dictionary, translate, unit_converter, web_reader
— plus no_tool (answer directly) and multi_tool (chained calls).
Output format:
markdown
<tool_calls>[{"name": "weather", "arguments": {"location": "Tokyo"}}]</tool_calls>
Evaluation (honest, non-circular)
Measured on a hand-written, non-circular test set (36 realistic, indirectly phrased queries, hand-labeled — no teacher model involved), comparing the base model against this adapter on identical inputs. Grading is format-agnostic: a prediction counts if the correct tool is identified in any recognizable format, so the base model isn't penalized for not using the trained format.
| Model | Routing accuracy | Strict-format accuracy |
|---|---|---|
| Base Qwen2.5-7B-Instruct | 75.0% | 75.0% |
| ToolForge (this adapter) | 83.3% | 83.3% |
| Gain from fine-tuning | +8.3 pp | +8.3 pp |
Key point: strict and lenient scores are identical for both models — base
Qwen already emits parseable tool-call formats, so the improvement comes from
better routing decisions, not output formatting. Gains concentrate on
disambiguating web_search vs wikipedia, unit_converter vs calculator,
and multi-tool selection.
A separate ablation on a held-out split of the (teacher-labeled) synthetic data reports ~86%, but that number is partly circular and is best read as an internal hyperparameter comparison. The table above is the unbiased estimate.
Limitations
- Fixed tool set. This is a specialist router for the 9 tools above. It does not generalize to arbitrary, prompt-supplied function schemas the way a general function-calling model does. Adding a tool requires retraining. The tradeoff is intentional: a small, cheap, self-hostable router for a known tool set, instead of a large general model on every call.
- Over-triggering on chit-chat. Fine-tuning slightly increases the tendency to call a tool on no-tool conversational queries (e.g. "what is 2 plus 2") — a precision/recall tradeoff.
- Trained on synthetic data (template-generated + Gemini-distilled), English only.
How to use
python
from peft import PeftModelfrom transformers import AutoModelForCausalLM, AutoTokenizerimport torchbase_id = "Qwen/Qwen2.5-7B-Instruct"tok = AutoTokenizer.from_pretrained(base_id)base = AutoModelForCausalLM.from_pretrained(base_id, torch_dtype=torch.bfloat16, device_map="auto")model = PeftModel.from_pretrained(base, "Ayush0110/toolforge-qwen7b-r64")model.eval()SYS = ("You are a tool-routing assistant. Given a user query, decide which tool(s) ""to call and with what arguments. If no tool is needed, respond directly. ""You have access to: web_search, calculator, weather, wikipedia, datetime, ""dictionary, translate, unit_converter, web_reader. "'Output tool calls as: <tool_calls>[{"name": "tool", "arguments": {...}}]</tool_calls>')msgs = [{"role": "system", "content": SYS},{"role": "user", "content": "is it jacket weather in Copenhagen right now"}]text = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)inputs = tok(text, return_tensors="pt").to(model.device)out = model.generate(**inputs, max_new_tokens=128, do_sample=False)print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))# -> <tool_calls>[{"name": "weather", "arguments": {"location": "Copenhagen"}}]</tool_calls>
Training details
| Base | Qwen/Qwen2.5-7B-Instruct |
| Quantization | 4-bit NF4 + double quant |
| LoRA | r=64, α=128, dropout=0.05, targets: q,k,v,o,gate,up,down |
| Optimizer / LR | AdamW, 2e-4 cosine, 10% warmup |
| Batch | 4 × 4 grad-accum = 16 effective |
| Epochs | 3 (best at eval_loss ≈ 0.14) |
| Data | 1,173 examples (template-generated + Gemini-2.5-flash distilled) |
| Hardware | single T4 (16GB), ~2.4 h |
| Tracking | Weights & Biases |
License
Apache-2.0 (inherits from the Qwen2.5 base model).
Model provider
Ayush0110
Model tree
Base
Qwen/Qwen2.5-7B-Instruct
Adapter
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information