mozarcik/Bielik-11B-v3.0-medadapt-awq API & Inference Endpoint

Opis modelu (PL)

Czterobitowa kwantyzacja (AWQ W4A16, format compressed-tensors) modelu Bielik-11B-v3.0-medadapt autorstwa jmajkutewicz — medycznej adaptacji domenowej (w języku polskim) modelu speakleash/Bielik-11B-v3.0-Instruct (architektura Llama, 11,2 mld parametrów). Celem kwantyzacji jest serwowanie w vLLM na pojedynczej karcie: footprint ~5,8 GiB pozwala uruchomić model w budżecie < 8 GiB VRAM (zweryfikowane w budżecie 7,8 GiB przy kontekście 2048 tokenów, pojedyncze GPU), podczas gdy bazowy model FP16 zajmuje ~22 GiB.

Kalibracja została przeprowadzona na wewnątrzdziedzinowym polskim korpusie klinicznym SmPC (Charakterystyka Produktu Leczniczego) — zob. sekcję Calibration corpus.

Co dało nowego: rodzina medadapt jest publikowana wyłącznie jako GGUF (llama.cpp / Ollama); nie istnieje wariant AWQ / serwowalny w vLLM. Według wiedzy autora jest to pierwsza kwantyzacja AWQ (vLLM-native) tej polskiej klinicznej rodziny modeli. GGUF i vLLM to różne środowiska uruchomieniowe: GGUF celuje w inferencję desktopową dla pojedynczego użytkownika, vLLM zapewnia wsadowe, wysokoprzepustowe serwowanie używane w pipeline'ach produkcyjnych / badawczych. Ten artefakt wypełnia tę lukę środowiskową.

Pełni także funkcję metodologiczną w projekcie NaviMed-UMB: ten sam polski kliniczny korpus kalibracyjny SmPC służy do kwantyzacji zarówno ogólnych polskich LLM (PLLuM), jak i tego klinicznie dostrojonego Bielika, dając kontrolowany korpusem punkt porównawczy.

Model description (EN)

4-bit AWQ W4A16 quantization (compressed-tensors format) of Bielik-11B-v3.0-medadapt by jmajkutewicz — a Polish-language medical domain-adaptation of speakleash/Bielik-11B-v3.0-Instruct (Llama architecture, 11.2 B parameters). Quantized for single-GPU vLLM serving: a ~5.8 GiB footprint lets the model run within a < 8 GiB VRAM budget (validated within a 7.8 GiB budget at 2048-token context, single GPU), versus ~22 GiB for the FP16 base.

Calibration was performed on an in-domain Polish clinical SmPC corpus (Charakterystyka Produktu Leczniczego / Summary of Product Characteristics) — see Calibration corpus.

What's new: the medadapt family is published only as GGUF (llama.cpp / Ollama); there is no AWQ / vLLM-servable build. To the author's knowledge this is the first AWQ (vLLM-native) build of this Polish clinical model family. GGUF and vLLM are different runtimes: GGUF targets single-user desktop inference, while vLLM provides batched, high-throughput serving used in production / research pipelines. This artifact fills that runtime gap.

It also serves a methodological purpose in the NaviMed-UMB project: the same Polish clinical SmPC calibration corpus is used to quantize both general Polish LLMs (PLLuM) and this clinically-tuned Bielik, giving a corpus-controlled comparison point.

Quantization details

Method: AWQ (Activation-aware Weight Quantization), scheme W4A16, via llm-compressor.
Recipe: AWQModifier — targets: Linear, ignore: [lm_head], duo_scaling: true, n_grid: 20, group-symmetric 4-bit weights.
Output format: compressed-tensors v0.14.x (quant_method: compressed-tensors, quantization_status: compressed).

Calibration corpus

Corpus: clinical-pl-smpc-awq-calibration — Polish-language clinical text derived from EMA Polish SmPC (Charakterystyka Produktu Leczniczego).
Samples: 418 chunks, all used (num_calibration_samples = 418), max_seq_length = 512.
Integrity: sha256(corpus.jsonl) = f8af734d8326e7bedb274fed14abeabb0a13439db22c9d12b0b6425e4321e1a0.
Rationale: calibrating on the in-domain distribution (Polish clinical / drug-label text) minimizes quantization error precisely where the model is intended to operate.

Stack: llm-compressor 0.10.x, transformers 4.57.x, compressed-tensors 0.14.x. Quantization performed on an AMD MI300X (ROCm 7.0).

Usage (vLLM)

python
from vllm import LLM, SamplingParams

llm = LLM(
    model="mozarcik/Bielik-11B-v3.0-medadapt-awq",
    dtype="auto",                 # compressed-tensors auto-detected
    tensor_parallel_size=1,
    max_model_len=2048,
    enforce_eager=True,           # see hardware note below
    gpu_memory_utilization=0.90,
)
out = llm.chat(
    [[{"role": "user", "content": "Co oznacza skrót SmPC?"}]],
    SamplingParams(max_tokens=128, temperature=0.3),
)
print(out[0].outputs[0].text)

Hardware note (AMD RDNA / gfx1201): quantization is compressed-tensors, not classic AWQ — let vLLM auto-detect it; forcing quantization="awq_marlin" raises a validation error. CUDA graphs were disabled (enforce_eager=True) because the graph path can segfault on gfx1201; on NVIDIA you may omit it.

Uwaga sprzętowa (AMD RDNA / gfx1201): kwantyzacja to compressed-tensors, a nie klasyczny AWQ — pozwól vLLM wykryć ją automatycznie; wymuszenie quantization="awq_marlin" zgłasza błąd walidacji. Grafy CUDA wyłączono (enforce_eager=True), bo ścieżka grafu może powodować segfault na gfx1201; na NVIDIA można to pominąć.

VRAM footprint (measured)

Measured on a single AMD Radeon AI PRO R9700 (32 GiB), vLLM 0.19.0, ROCm:

Weights ~5.9 GiB. Within a 7.8 GiB gpu_memory_utilization budget the engine loads and serves a 2048-token context (KV headroom ~3.5 k tokens). At a 7.5 GiB budget the max context tightens to ~1.8 k tokens.
Conclusion: runs on a single < 8 GiB consumer GPU; the FP16 base (~22 GiB) does not.

Sanity check (Gate-1)

Load + coherence probe (vLLM, llm.chat() so the chat template is applied):

Load: 19.8 s, TP=1, enforce_eager=True.
Result: 5/5 Polish clinical prompts answered, fluent and on-topic.

Gate-1 verifies only that the model loads and produces fluent, on-topic Polish — it does not validate clinical correctness.

Przeznaczenie i ograniczenia / Intended use & limitations

Nieprzeznaczony do podejmowania decyzji klinicznych. / Not for clinical decision-making.

Przeznaczenie i ograniczenia (PL)

Wyłącznie do celów badawczych / inżynierskich. Nie jest wyrobem medycznym; nie służy do diagnozy, dawkowania ani jakiegokolwiek podejmowania decyzji klinicznych.
Dokładność kliniczna nie została tu zwalidowana. W próbnym sanity-teście model bazowy generował klinicznie nieprecyzyjne odpowiedzi (np. sposób klasyfikacji ciężkości w spirometrii, interwały dawkowania). Są to właściwości modelu bazowego medadapt, a nie kwantyzacji. Formalna ewaluacja jakości AWQ-vs-FP16 to odrębny, oczekujący krok.
Skąpa proweniencja modelu bazowego. Fine-tune medadapt jest opublikowany z minimalną dokumentacją; dane treningowe i metoda nie są ujawnione przez pierwotnego autora. Traktuj twierdzenia domenowe odpowiednio ostrożnie.
Efekty kwantyzacji. W4A16 wprowadza pewną degradację względem FP16. AWQ + kalibracja wewnątrzdziedzinowa powinny utrzymać ją na niskim poziomie na rozkładzie kalibracyjnym, ale jej wielkość nie została zmierzona dla tego modelu.

Intended use & limitations (EN)

Research / engineering only. Not a medical device; not for diagnosis, dosing, or any clinical decision-making.
Clinical accuracy is unvalidated here. In sanity probing the base model produced clinically imprecise answers (e.g. spirometry severity classification framing, dosing intervals). These are properties of the base medadapt model, not of quantization. A formal AWQ-vs-FP16 quality evaluation is a separate, pending step.
Thin base-model provenance. The base medadapt fine-tune is published with minimal documentation; its training data and method are not disclosed by the original author. Treat domain claims accordingly.
Quantization effects. W4A16 introduces some degradation vs FP16. AWQ + in-domain calibration are expected to keep it small on the calibration distribution, but the magnitude has not been measured for this model.

No clinical guarantee. Not a medical device. / Brak gwarancji klinicznej. To nie jest wyrób medyczny.

Provenance & license

Base model: jmajkutewicz/Bielik-11B-v3.0-medadapt ← speakleash/Bielik-11B-v3.0-Instruct.
License: gated. The base instruct model is permissively licensed, but the medadapt fine-tuning corpus license is unconfirmed by its author. This artifact therefore stays private until that license is clarified. Do not redistribute.
Project: NaviMed-UMB (Polish clinical LLM serving / quantization), Medical University of Białystok.

Proweniencja i licencja (PL): model bazowy jmajkutewicz/Bielik-11B-v3.0-medadapt ← speakleash/Bielik-11B-v3.0-Instruct. Licencja: gated. Bazowy model instruct ma licencję permisywną, ale licencja korpusu fine-tuningowego medadapt jest niepotwierdzona przez jego autora. Z tego powodu artefakt pozostaje prywatny do czasu wyjaśnienia tej licencji. Nie redystrybuować. Projekt: NaviMed-UMB (serwowanie / kwantyzacja polskich klinicznych LLM), Uniwersytet Medyczny w Białymstoku.

Citation

bibtex
@misc{minarowski_bielik_medadapt_awq_2026,
  author = {Minarowski, Łukasz},
  title  = {Bielik-11B-v3.0-medadapt — AWQ W4A16 (clinical-corpus calibrated)},
  year   = {2026},
  note   = {AWQ W4A16 quantization of jmajkutewicz/Bielik-11B-v3.0-medadapt,
            calibrated on a Polish clinical SmPC corpus; NaviMed-UMB project},
  howpublished = {Hugging Face},
}

Bielik-11B-v3.0-medadapt-awq

Get help setting up a custom Dedicated Endpoints.

README