Ayush0110/toolforge-qwen7b-r64 API & Inference Endpoint

What it does

Routes a query to one (or several) of these tools, or to a direct answer:

web_search, calculator, weather, wikipedia, datetime, dictionary, translate, unit_converter, web_reader — plus no_tool (answer directly) and multi_tool (chained calls).

Output format:

markdown
<tool_calls>[{"name": "weather", "arguments": {"location": "Tokyo"}}]</tool_calls>

Evaluation (honest, non-circular)

Measured on a hand-written, non-circular test set (36 realistic, indirectly phrased queries, hand-labeled — no teacher model involved), comparing the base model against this adapter on identical inputs. Grading is format-agnostic: a prediction counts if the correct tool is identified in any recognizable format, so the base model isn't penalized for not using the trained format.

Table
Model	Routing accuracy	Strict-format accuracy
Base Qwen2.5-7B-Instruct	75.0%	75.0%
ToolForge (this adapter)	83.3%	83.3%
Gain from fine-tuning	+8.3 pp	+8.3 pp

Key point: strict and lenient scores are identical for both models — base Qwen already emits parseable tool-call formats, so the improvement comes from better routing decisions, not output formatting. Gains concentrate on disambiguating web_search vs wikipedia, unit_converter vs calculator, and multi-tool selection.

A separate ablation on a held-out split of the (teacher-labeled) synthetic data reports ~86%, but that number is partly circular and is best read as an internal hyperparameter comparison. The table above is the unbiased estimate.

Limitations

Fixed tool set. This is a specialist router for the 9 tools above. It does not generalize to arbitrary, prompt-supplied function schemas the way a general function-calling model does. Adding a tool requires retraining. The tradeoff is intentional: a small, cheap, self-hostable router for a known tool set, instead of a large general model on every call.
Over-triggering on chit-chat. Fine-tuning slightly increases the tendency to call a tool on no-tool conversational queries (e.g. "what is 2 plus 2") — a precision/recall tradeoff.
Trained on synthetic data (template-generated + Gemini-distilled), English only.

How to use

python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base_id = "Qwen/Qwen2.5-7B-Instruct"
tok = AutoTokenizer.from_pretrained(base_id)
base = AutoModelForCausalLM.from_pretrained(base_id, torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(base, "Ayush0110/toolforge-qwen7b-r64")
model.eval()

SYS = ("You are a tool-routing assistant. Given a user query, decide which tool(s) "
       "to call and with what arguments. If no tool is needed, respond directly. "
       "You have access to: web_search, calculator, weather, wikipedia, datetime, "
       "dictionary, translate, unit_converter, web_reader. "
       'Output tool calls as: <tool_calls>[{"name": "tool", "arguments": {...}}]</tool_calls>')

msgs = [{"role": "system", "content": SYS},
        {"role": "user", "content": "is it jacket weather in Copenhagen right now"}]
text = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=128, do_sample=False)
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
# -> <tool_calls>[{"name": "weather", "arguments": {"location": "Copenhagen"}}]</tool_calls>

Training details

Table

Base	Qwen/Qwen2.5-7B-Instruct
Quantization	4-bit NF4 + double quant
LoRA	r=64, α=128, dropout=0.05, targets: q,k,v,o,gate,up,down
Optimizer / LR	AdamW, 2e-4 cosine, 10% warmup
Batch	4 × 4 grad-accum = 16 effective
Epochs	3 (best at eval_loss ≈ 0.14)
Data	1,173 examples (template-generated + Gemini-2.5-flash distilled)
Hardware	single T4 (16GB), ~2.4 h
Tracking	Weights & Biases

License

Apache-2.0 (inherits from the Qwen2.5 base model).

toolforge-qwen7b-r64

Get help setting up a custom Dedicated Endpoints.

README