Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Summary
This is an experimental PEFT-format conversion of the public MLX LoRA adapter
edithatogo/qwen3-4b-hermes-lora. It is intended to make the Qwen3 v4 strict
Hermes tool-call adapter usable from CUDA/Hugging Face tooling such as
transformers, peft, and lm-evaluation-harness.
The PEFT base model is:
text
Qwen/Qwen3-4B
Source MLX adapter repo:
text
https://huggingface.co/edithatogo/qwen3-4b-hermes-lora
Converted PEFT adapter repo:
text
https://huggingface.co/edithatogo/qwen3-4b-hermes-lora-peft-converted
The adapter is intended for local evaluation and agent-runtime packaging. It requires the recorded runtime prompt condition:
- first user turn prefixed with
/no_think - assistant prefill:
<think>\n\n</think>\n\n
Without the assistant prefill, the model still emits an empty leading thinking wrapper and does not satisfy the strict raw-output gate.
Base Model
- PEFT base:
Qwen/Qwen3-4B - Source adapter base:
Qwen/Qwen3-4B-MLX-4bit - Base license: Apache-2.0, checked via Hugging Face API on 2026-05-25
Conversion
- Source adapter:
gemma4/experiments/qwen3-4b-strict-toolcall-v4-targeted/lora_adapter - Conversion script:
scripts/convert_mlx_lora_to_peft.py - Conversion report:
reports/cloud/qwen3-v4-mlx-to-peft-conversion-20260613.md - Source tensors: 112
- Converted PEFT tensors: 112
- LoRA rank: 8
- LoRA alpha: 16.0
- Layers: 28-35
- Target modules:
q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj
Training
- Training config:
gemma4/scripts/train_config.qwen3-4b.strict-toolcall-v4-targeted.yaml - Data:
gemma4/data/strict_tool_call/expanded_splits_v4_targeted - Adapter:
gemma4/experiments/qwen3-4b-strict-toolcall-v4-targeted/lora_adapter - Training tokens: 37,936
- Dataset token audit:
reports/publication/qwen3-4b-strict-toolcall-v4-targeted/dataset-token-audit.json - Dataset overlap audit:
reports/publication/qwen3-4b-strict-toolcall-v4-targeted/dataset-overlap-audit.json - Peak memory: 3.785 GB
Evaluation
PEFT conversion checks:
| Check | Status |
|---|---|
| Static PEFT config load | pass |
| Colab T4 4-bit PEFT load smoke | pass |
Colab T4 lm_eval[hf] selected task route, limit 5 | pass |
Full no-limit lm_eval scorecard | blocked by Colab session pruning |
Bounded lm_eval route pilot on Colab T4:
| Task | Metric | Value | Samples |
|---|---|---|---|
arc_challenge | acc_norm | 0.2000 | 5 |
hellaswag | acc_norm | 0.6000 | 5 |
truthfulqa_mc2 | acc | 0.5166 | 5 |
gsm8k | exact_match,strict-match | 0.8000 | 5 |
winogrande | acc | 0.4000 | 5 |
These are route-pilot scores only and must not be used as no-limit benchmark claims.
Held-out strict local tool-call gate:
| Suite | Pass | JSON valid | Arguments | Invalid tool | Multi-turn |
|---|---|---|---|---|---|
benchmarks/tool_call_local/heldout_suite.json | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
Mirrored regression:
| Suite | Pass |
|---|---|
benchmarks/tool_call_local/suite.json | 1.000 |
Repo-native pilot benchmarks:
| Pilot | Pass | Notes |
|---|---|---|
| BFCL-style pilot | 0.667 | local pilot only, not official BFCL |
| IFEval-style pilot | 0.667 | local pilot only, not official IFEval |
| Coding sanity pilot | 1.000 | local pilot only, not HumanEval/MBPP |
Exact held-out command:
bash
source scripts/env.shPYTHONPATH=scripts ./.venv/bin/python scripts/run_tool_call_benchmark.py \--model Qwen/Qwen3-4B-MLX-4bit \--adapter gemma4/experiments/qwen3-4b-strict-toolcall-v4-targeted/lora_adapter \--suite benchmarks/tool_call_local/heldout_suite.json \--user-prefix /no_think \--assistant-prefill $'<think>\n\n</think>\n\n' \--run-id qwen3-4b-strict-toolcall-v4-targeted-heldout-prefill-20260525 \--max-tokens 256
Raw local artifact:
text
/Volumes/PortableSSD/hermes-evals/tool-call-benchmark/qwen3-4b-strict-toolcall-v4-targeted-heldout-prefill-20260525
The reusable runtime prompt contract is recorded in
RUNTIME_PROMPT_PROFILES.yaml as qwen3-no-think-assistant-prefill.
Limitations
- This is an experimental conversion from MLX LoRA tensor orientation to PEFT tensor orientation. Use the original MLX adapter repo for the canonical MLX release.
- This is a small local strict-format benchmark, not broad BFCL or production tool-use evidence.
- The PEFT route has a successful Colab T4 load smoke and bounded
lm_evalpilot, but no full no-limitlm_evalscorecard yet. - The release does not include official BFCL, HumanEval, MBPP, EvalPlus, BigCodeBench, LiveCodeBench, safety/refusal, or RULER long-context scores.
- The selected
lm_evalendpoint route was attempted separately, but the current local MLX endpoint is not loglikelihood-compatible for those tasks. A direct MLX adapter has scored bounded selected-task limit-10 and limit-25 runs; treat those as pilot evidence only, not as full officiallm_evalor leaderboard scores. - The adapter is sensitive to runtime prompt formatting.
- The V4 training data has no held-out user-prompt overlap in the recorded
audit, but it shares one generic held-out tool name,
notify_care_team. - Dataset/source redistribution review is complete for adapter-release purposes
with caveats. The separately approved cleaned synthetic-only dataset has been
published at
https://huggingface.co/datasets/edithatogo/qwen3-hermes-strict-toolcall-synthetic-v4. - Public release approval is recorded in
release-decision.md; the publication bundle is expected to pass.markdown
scripts/validate_publication_bundle.py --require-ready
Model provider
edithatogo
Model tree
Base
Qwen/Qwen3-4B
Adapter
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information