Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Summary

This is an experimental PEFT-format conversion of the public MLX LoRA adapter edithatogo/qwen3-4b-hermes-lora. It is intended to make the Qwen3 v4 strict Hermes tool-call adapter usable from CUDA/Hugging Face tooling such as transformers, peft, and lm-evaluation-harness.

The PEFT base model is:

text

Qwen/Qwen3-4B

Source MLX adapter repo:

text

https://huggingface.co/edithatogo/qwen3-4b-hermes-lora

Converted PEFT adapter repo:

text

https://huggingface.co/edithatogo/qwen3-4b-hermes-lora-peft-converted

The adapter is intended for local evaluation and agent-runtime packaging. It requires the recorded runtime prompt condition:

  • first user turn prefixed with /no_think
  • assistant prefill: <think>\n\n</think>\n\n

Without the assistant prefill, the model still emits an empty leading thinking wrapper and does not satisfy the strict raw-output gate.

Base Model

  • PEFT base: Qwen/Qwen3-4B
  • Source adapter base: Qwen/Qwen3-4B-MLX-4bit
  • Base license: Apache-2.0, checked via Hugging Face API on 2026-05-25

Conversion

  • Source adapter: gemma4/experiments/qwen3-4b-strict-toolcall-v4-targeted/lora_adapter
  • Conversion script: scripts/convert_mlx_lora_to_peft.py
  • Conversion report: reports/cloud/qwen3-v4-mlx-to-peft-conversion-20260613.md
  • Source tensors: 112
  • Converted PEFT tensors: 112
  • LoRA rank: 8
  • LoRA alpha: 16.0
  • Layers: 28-35
  • Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Training

  • Training config: gemma4/scripts/train_config.qwen3-4b.strict-toolcall-v4-targeted.yaml
  • Data: gemma4/data/strict_tool_call/expanded_splits_v4_targeted
  • Adapter: gemma4/experiments/qwen3-4b-strict-toolcall-v4-targeted/lora_adapter
  • Training tokens: 37,936
  • Dataset token audit: reports/publication/qwen3-4b-strict-toolcall-v4-targeted/dataset-token-audit.json
  • Dataset overlap audit: reports/publication/qwen3-4b-strict-toolcall-v4-targeted/dataset-overlap-audit.json
  • Peak memory: 3.785 GB

Evaluation

PEFT conversion checks:

CheckStatus
Static PEFT config loadpass
Colab T4 4-bit PEFT load smokepass
Colab T4 lm_eval[hf] selected task route, limit 5pass
Full no-limit lm_eval scorecardblocked by Colab session pruning

Bounded lm_eval route pilot on Colab T4:

TaskMetricValueSamples
arc_challengeacc_norm0.20005
hellaswagacc_norm0.60005
truthfulqa_mc2acc0.51665
gsm8kexact_match,strict-match0.80005
winograndeacc0.40005

These are route-pilot scores only and must not be used as no-limit benchmark claims.

Held-out strict local tool-call gate:

SuitePassJSON validArgumentsInvalid toolMulti-turn
benchmarks/tool_call_local/heldout_suite.json1.0001.0001.0001.0001.000

Mirrored regression:

SuitePass
benchmarks/tool_call_local/suite.json1.000

Repo-native pilot benchmarks:

PilotPassNotes
BFCL-style pilot0.667local pilot only, not official BFCL
IFEval-style pilot0.667local pilot only, not official IFEval
Coding sanity pilot1.000local pilot only, not HumanEval/MBPP

Exact held-out command:

bash

source scripts/env.sh
PYTHONPATH=scripts ./.venv/bin/python scripts/run_tool_call_benchmark.py \
--model Qwen/Qwen3-4B-MLX-4bit \
--adapter gemma4/experiments/qwen3-4b-strict-toolcall-v4-targeted/lora_adapter \
--suite benchmarks/tool_call_local/heldout_suite.json \
--user-prefix /no_think \
--assistant-prefill $'<think>\n\n</think>\n\n' \
--run-id qwen3-4b-strict-toolcall-v4-targeted-heldout-prefill-20260525 \
--max-tokens 256

Raw local artifact:

text

/Volumes/PortableSSD/hermes-evals/tool-call-benchmark/qwen3-4b-strict-toolcall-v4-targeted-heldout-prefill-20260525

The reusable runtime prompt contract is recorded in RUNTIME_PROMPT_PROFILES.yaml as qwen3-no-think-assistant-prefill.

Limitations

  • This is an experimental conversion from MLX LoRA tensor orientation to PEFT tensor orientation. Use the original MLX adapter repo for the canonical MLX release.
  • This is a small local strict-format benchmark, not broad BFCL or production tool-use evidence.
  • The PEFT route has a successful Colab T4 load smoke and bounded lm_eval pilot, but no full no-limit lm_eval scorecard yet.
  • The release does not include official BFCL, HumanEval, MBPP, EvalPlus, BigCodeBench, LiveCodeBench, safety/refusal, or RULER long-context scores.
  • The selected lm_eval endpoint route was attempted separately, but the current local MLX endpoint is not loglikelihood-compatible for those tasks. A direct MLX adapter has scored bounded selected-task limit-10 and limit-25 runs; treat those as pilot evidence only, not as full official lm_eval or leaderboard scores.
  • The adapter is sensitive to runtime prompt formatting.
  • The V4 training data has no held-out user-prompt overlap in the recorded audit, but it shares one generic held-out tool name, notify_care_team.
  • Dataset/source redistribution review is complete for adapter-release purposes with caveats. The separately approved cleaned synthetic-only dataset has been published at https://huggingface.co/datasets/edithatogo/qwen3-hermes-strict-toolcall-synthetic-v4.
  • Public release approval is recorded in release-decision.md; the publication bundle is expected to pass

    markdown

    scripts/validate_publication_bundle.py --require-ready
    .

Model provider

edithatogo

Model tree

Base

Qwen/Qwen3-4B

Adapter

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today