hexagoneAI
extraction-slm-reduced-v1-adapter
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0What's inside
| File | Size | For |
|---|---|---|
adapter_v1.gguf | 51 MB | llama.cpp runtime |
adapter_model.safetensors | 51 MB | HF PEFT / vLLM / transformers |
adapter_config.json | <1 KB | PEFT config |
extraction_v1.gbnf | ~2 KB | Optional JSON grammar for output enforcement |
extraction_diagnostics.json | <1 KB | SVD reconstruction error stats |
Task
Extract every named person, company, contract, financial account, identifier and document-level fact from a document, output as JSON. Multilingual content preserved in original language; field names in English.
Output schema (enforced by extraction_v1.gbnf if grammar is used):
json
{"entities": [{"type": "Person","client_or_service_provider": "CLIENT","value": {"name": "John Smith", "email": "john@techcorp.com", "...": "..."}}]}
Provenance
This adapter was produced by:
- Training: DoRA fine-tune (
use_dora=True, r=32, α=64, targets q/k/v/o/gate/up/down_proj) onQwen/Qwen3.5-0.8Bover ~162k extraction examples with the reduced (no pre_scan / no role / no relationships) schema and a shortened ~314-token system prompt. - Merge: PEFT
merge_and_unload()baked the DoRA into a copy of the base. - SVD extraction: per-module weight delta (merged − base) decomposed to a rank-64 plain LoRA via SVD. This step exists because
convert_lora_to_gguf.pyin llama.cpp does not currently support DoRA'slora_magnitude_vector; extracting a plain LoRA is the standard workaround. Per-module Frobenius reconstruction error: mean 13.5%, p95 17%, max 20%. - GGUF conversion:
convert_lora_to_gguf.pyfrom llama.cppb0df4c0.
Despite the lossy SVD step, per-prompt outputs are byte-identical between the original DoRA adapter and the rank-64 plain LoRA on the sanity-test prompts we checked.
Quality vs original DoRA baseline
Internal 100-test extraction benchmark (extraction_test_a.py):
| Setup | Pass | Avg composite |
|---|---|---|
| Original DoRA + vLLM bf16 (baseline) | 67/100 | 0.9283 |
| This plain LoRA + GGUF Q4_K_M (no MTP) | 63/100 | 0.9181 |
| This plain LoRA + GGUF Q4_K_M (MTP on) | 65/100 | 0.9184 |
Net cost of the GGUF + adapter-extraction pipeline: −4 passes, −0.010 composite.
⚠️ MTP gotcha (read before enabling speculative decoding)
The base model has a Multi-Token Prediction head (blk.24.nextn.*) that ships
in the base GGUF. Enabling --spec-type draft-mtp with this fine-tuned
adapter is a 2× slowdown — not a speedup. The LoRA shifts hidden states
enough that the untrained MTP head's drafts are mostly wrong, and the draft
overhead exceeds the savings.
Measured on H100, 100-test suite:
- Without MTP: 302 tok/s, 186 s wall
- With MTP: 164 tok/s, 330 s wall, 14.3% acceptance rate
Recommended runtime: omit --spec-type. Until/unless the MTP head is
retrained on extraction-task hidden states, ship without MTP.
Usage (llama.cpp)
bash
# 1. Start llama-serverllama-server \--model base_v1.Q4_K_M_mtpQ8.gguf \--lora adapter_v1.gguf \--jinja \--ctx-size 8192 \--host 127.0.0.1 --port 8089# 2. Optional: enforce JSON shape with the included grammarllama-server \--model base_v1.Q4_K_M_mtpQ8.gguf \--lora adapter_v1.gguf \--jinja \--grammar-file extraction_v1.gbnf \--ctx-size 8192 \--host 127.0.0.1 --port 8089
Usage (PEFT / HF transformers)
python
from transformers import AutoModelForCausalLM, AutoTokenizerfrom peft import PeftModelimport torchbase = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-0.8B",torch_dtype=torch.bfloat16,trust_remote_code=True,device_map="cuda",)tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-0.8B", trust_remote_code=True)model = PeftModel.from_pretrained(base, "hexagoneAI/extraction-slm-reduced-v1-adapter")# Use with the system prompt and chat_template_kwargs={"enable_thinking": False}
Usage (vLLM with LoRA)
python
from vllm import LLM, SamplingParamsfrom vllm.lora.request import LoRARequestllm = LLM(model="Qwen/Qwen3.5-0.8B", enable_lora=True, max_lora_rank=64,trust_remote_code=True, dtype="bfloat16")sp = SamplingParams(temperature=0, max_tokens=4096)outputs = llm.generate([prompt],sp,lora_request=LoRARequest("extraction", 1, "hexagoneAI/extraction-slm-reduced-v1-adapter"),)
Sampling recommendations
temperature: 0(deterministic JSON output)max_tokens: 4096(a typical extraction is 100-500 tokens; long documents can exceed)enable_thinking: false(skip the Qwen3.5 reasoning trace)
License
Apache-2.0 (matches upstream Qwen/Qwen3.5-0.8B).
Model provider
hexagoneAI
Model tree
Base
Qwen/Qwen3.5-0.8B
Adapter
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information