hexagoneAI

extraction-slm-reduced-v1-adapter

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

What's inside

Table
FileSizeFor
adapter_v1.gguf51 MBllama.cpp runtime
adapter_model.safetensors51 MBHF PEFT / vLLM / transformers
adapter_config.json<1 KBPEFT config
extraction_v1.gbnf~2 KBOptional JSON grammar for output enforcement
extraction_diagnostics.json<1 KBSVD reconstruction error stats

Task

Extract every named person, company, contract, financial account, identifier and document-level fact from a document, output as JSON. Multilingual content preserved in original language; field names in English.

Output schema (enforced by extraction_v1.gbnf if grammar is used):

json

{
"entities": [
{
"type": "Person",
"client_or_service_provider": "CLIENT",
"value": {"name": "John Smith", "email": "john@techcorp.com", "...": "..."}
}
]
}

Provenance

This adapter was produced by:

  1. Training: DoRA fine-tune (use_dora=True, r=32, α=64, targets q/k/v/o/gate/up/down_proj) on Qwen/Qwen3.5-0.8B over ~162k extraction examples with the reduced (no pre_scan / no role / no relationships) schema and a shortened ~314-token system prompt.
  2. Merge: PEFT merge_and_unload() baked the DoRA into a copy of the base.
  3. SVD extraction: per-module weight delta (merged − base) decomposed to a rank-64 plain LoRA via SVD. This step exists because convert_lora_to_gguf.py in llama.cpp does not currently support DoRA's lora_magnitude_vector; extracting a plain LoRA is the standard workaround. Per-module Frobenius reconstruction error: mean 13.5%, p95 17%, max 20%.
  4. GGUF conversion: convert_lora_to_gguf.py from llama.cpp b0df4c0.

Despite the lossy SVD step, per-prompt outputs are byte-identical between the original DoRA adapter and the rank-64 plain LoRA on the sanity-test prompts we checked.

Quality vs original DoRA baseline

Internal 100-test extraction benchmark (extraction_test_a.py):

Table
SetupPassAvg composite
Original DoRA + vLLM bf16 (baseline)67/1000.9283
This plain LoRA + GGUF Q4_K_M (no MTP)63/1000.9181
This plain LoRA + GGUF Q4_K_M (MTP on)65/1000.9184

Net cost of the GGUF + adapter-extraction pipeline: −4 passes, −0.010 composite.

⚠️ MTP gotcha (read before enabling speculative decoding)

The base model has a Multi-Token Prediction head (blk.24.nextn.*) that ships in the base GGUF. Enabling --spec-type draft-mtp with this fine-tuned adapter is a 2× slowdown — not a speedup. The LoRA shifts hidden states enough that the untrained MTP head's drafts are mostly wrong, and the draft overhead exceeds the savings.

Measured on H100, 100-test suite:

  • Without MTP: 302 tok/s, 186 s wall
  • With MTP: 164 tok/s, 330 s wall, 14.3% acceptance rate

Recommended runtime: omit --spec-type. Until/unless the MTP head is retrained on extraction-task hidden states, ship without MTP.

Usage (llama.cpp)

bash

# 1. Start llama-server
llama-server \
--model base_v1.Q4_K_M_mtpQ8.gguf \
--lora adapter_v1.gguf \
--jinja \
--ctx-size 8192 \
--host 127.0.0.1 --port 8089
# 2. Optional: enforce JSON shape with the included grammar
llama-server \
--model base_v1.Q4_K_M_mtpQ8.gguf \
--lora adapter_v1.gguf \
--jinja \
--grammar-file extraction_v1.gbnf \
--ctx-size 8192 \
--host 127.0.0.1 --port 8089

Usage (PEFT / HF transformers)

python

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3.5-0.8B",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="cuda",
)
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-0.8B", trust_remote_code=True)
model = PeftModel.from_pretrained(base, "hexagoneAI/extraction-slm-reduced-v1-adapter")
# Use with the system prompt and chat_template_kwargs={"enable_thinking": False}

Usage (vLLM with LoRA)

python

from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
llm = LLM(model="Qwen/Qwen3.5-0.8B", enable_lora=True, max_lora_rank=64,
trust_remote_code=True, dtype="bfloat16")
sp = SamplingParams(temperature=0, max_tokens=4096)
outputs = llm.generate(
[prompt],
sp,
lora_request=LoRARequest("extraction", 1, "hexagoneAI/extraction-slm-reduced-v1-adapter"),
)

Sampling recommendations

  • temperature: 0 (deterministic JSON output)
  • max_tokens: 4096 (a typical extraction is 100-500 tokens; long documents can exceed)
  • enable_thinking: false (skip the Qwen3.5 reasoning trace)

License

Apache-2.0 (matches upstream Qwen/Qwen3.5-0.8B).

Model provider

hexagoneAI

Model tree

Base

Qwen/Qwen3.5-0.8B

Adapter

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today