hexagoneAI

extraction-slm-reduced-v1-adapter

README

License: apache-2.0

What's inside

Table with columns: File, Size, For
File	Size	For
`adapter_v1.gguf`	51 MB	llama.cpp runtime
`adapter_model.safetensors`	51 MB	HF PEFT / vLLM / transformers
`adapter_config.json`	<1 KB	PEFT config
`extraction_v1.gbnf`	~2 KB	Optional JSON grammar for output enforcement
`extraction_diagnostics.json`	<1 KB	SVD reconstruction error stats

Task

Extract every named person, company, contract, financial account, identifier and document-level fact from a document, output as JSON. Multilingual content preserved in original language; field names in English.

Output schema (enforced by extraction_v1.gbnf if grammar is used):

json
{
  "entities": [
    {
      "type": "Person",
      "client_or_service_provider": "CLIENT",
      "value": {"name": "John Smith", "email": "john@techcorp.com", "...": "..."}
    }
  ]
}

Provenance

This adapter was produced by:

Training: DoRA fine-tune (use_dora=True, r=32, α=64, targets q/k/v/o/gate/up/down_proj) on Qwen/Qwen3.5-0.8B over ~162k extraction examples with the reduced (no pre_scan / no role / no relationships) schema and a shortened ~314-token system prompt.
Merge: PEFT merge_and_unload() baked the DoRA into a copy of the base.
SVD extraction: per-module weight delta (merged − base) decomposed to a rank-64 plain LoRA via SVD. This step exists because convert_lora_to_gguf.py in llama.cpp does not currently support DoRA's lora_magnitude_vector; extracting a plain LoRA is the standard workaround. Per-module Frobenius reconstruction error: mean 13.5%, p95 17%, max 20%.
GGUF conversion: convert_lora_to_gguf.py from llama.cpp b0df4c0.

Despite the lossy SVD step, per-prompt outputs are byte-identical between the original DoRA adapter and the rank-64 plain LoRA on the sanity-test prompts we checked.

Quality vs original DoRA baseline

Internal 100-test extraction benchmark (extraction_test_a.py):

Table with columns: Setup, Pass, Avg composite
Setup	Pass	Avg composite
Original DoRA + vLLM bf16 (baseline)	67/100	0.9283
This plain LoRA + GGUF Q4_K_M (no MTP)	63/100	0.9181
This plain LoRA + GGUF Q4_K_M (MTP on)	65/100	0.9184

Net cost of the GGUF + adapter-extraction pipeline: −4 passes, −0.010 composite.

⚠️ MTP gotcha (read before enabling speculative decoding)

The base model has a Multi-Token Prediction head (blk.24.nextn.*) that ships in the base GGUF. Enabling --spec-type draft-mtp with this fine-tuned adapter is a 2× slowdown — not a speedup. The LoRA shifts hidden states enough that the untrained MTP head's drafts are mostly wrong, and the draft overhead exceeds the savings.

Measured on H100, 100-test suite:

Without MTP: 302 tok/s, 186 s wall
With MTP: 164 tok/s, 330 s wall, 14.3% acceptance rate

Recommended runtime: omit --spec-type. Until/unless the MTP head is retrained on extraction-task hidden states, ship without MTP.

Usage (llama.cpp)

bash
# 1. Start llama-server
llama-server \
    --model base_v1.Q4_K_M_mtpQ8.gguf \
    --lora adapter_v1.gguf \
    --jinja \
    --ctx-size 8192 \
    --host 127.0.0.1 --port 8089

# 2. Optional: enforce JSON shape with the included grammar
llama-server \
    --model base_v1.Q4_K_M_mtpQ8.gguf \
    --lora adapter_v1.gguf \
    --jinja \
    --grammar-file extraction_v1.gbnf \
    --ctx-size 8192 \
    --host 127.0.0.1 --port 8089

Usage (PEFT / HF transformers)

python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.5-0.8B",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="cuda",
)
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-0.8B", trust_remote_code=True)
model = PeftModel.from_pretrained(base, "hexagoneAI/extraction-slm-reduced-v1-adapter")

# Use with the system prompt and chat_template_kwargs={"enable_thinking": False}

Usage (vLLM with LoRA)

python
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest

llm = LLM(model="Qwen/Qwen3.5-0.8B", enable_lora=True, max_lora_rank=64,
          trust_remote_code=True, dtype="bfloat16")
sp = SamplingParams(temperature=0, max_tokens=4096)
outputs = llm.generate(
    [prompt],
    sp,
    lora_request=LoRARequest("extraction", 1, "hexagoneAI/extraction-slm-reduced-v1-adapter"),
)

Sampling recommendations

temperature: 0 (deterministic JSON output)
max_tokens: 4096 (a typical extraction is 100-500 tokens; long documents can exceed)
enable_thinking: false (skip the Qwen3.5 reasoning trace)

License

Apache-2.0 (matches upstream Qwen/Qwen3.5-0.8B).

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

hexagoneAI

Model Tree

Base

Qwen/Qwen3.5-0.8B

Adapter

this model

Input Modalities

TextImageVideo

Output Modalities

Text

Supported Functionality