nakue/SmolLM2-1.7B-W4A16-wiki API & Inference Endpoint

Model Details

Property	Value
Base model	HuggingFaceTB/SmolLM2-1.7B-Instruct
Architecture	LlamaForCausalLM
Parameters	~1.7B
Quantization scheme	W4A16 — INT4 weights, BF16 activations
Excluded layers	`lm_head` (kept in BF16)
Format	`compressed-tensors` (Safetensors)
Calibration dataset	Wikipedia / `wikitext2` (512 samples, max_seq_length 2048)
Quantization tool	llm-compressor

Why Calibration Dataset Matters

In static quantization, calibration data is used to compute the scale and zero-point for each weight tensor. The model sees a forward pass through these samples and records activation distributions — those statistics determine how INT4 ranges are assigned.

Calibration data	Strengths	Weaknesses
Wikipedia (this model)	Factual recall, long-form prose, knowledge-dense text	Conversational tasks, instruction following

This model is expected to perform better on knowledge-intensive or document-style tasks (summarization, factual Q&A, RAG retrieval), while the ultrachat-calibrated variant may be stronger for conversational and instruction-following use cases.

Empirical comparison pending — see the Evaluation section.

How to Use

Option 1 — vLLM (recommended for serving)

bash
pip install vllm

python
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "nakue/SmolLM2-1.7B-W4A16-wiki"

llm = LLM(
    model=model_id,
    quantization="compressed-tensors",
    dtype="bfloat16",
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Summarize the history of the transformer architecture."},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)

Option 2 — Transformers + compressed-tensors (no vLLM)

bash
pip install compressed-tensors transformers

python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "nakue/SmolLM2-1.7B-W4A16-wiki"
tokenizer = AutoTokenizer.from_pretrained(model_id)

# compressed-tensors backend auto-detected from quantization_config in config.json
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the Pythagorean theorem?"},
]

inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True
).to(model.device)

outputs = model.generate(inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

Note: W4A16 dequantizes weights to BF16 at runtime — compute stays in BF16. You get ~4x memory reduction; throughput gains depend on whether your workload is memory-bandwidth-bound.

Option 3 — llmcompressor

bash
pip install llmcompressor

python
from llmcompressor.transformers import SparseAutoModelForCausalLM
from transformers import AutoTokenizer

model_id = "nakue/SmolLM2-1.7B-W4A16-wiki"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = SparseAutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

inputs = tokenizer("The transformer architecture was introduced in", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Option 4 — Dequantize to BF16

bash
pip install llmcompressor

python
from llmcompressor.transformers import SparseAutoModelForCausalLM
from transformers import AutoTokenizer
import torch

model_id = "nakue/SmolLM2-1.7B-W4A16-wiki"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = SparseAutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
model.save_pretrained("smollm2-bf16-dequantized")
tokenizer.save_pretrained("smollm2-bf16-dequantized")

Load smollm2-bf16-dequantized with plain AutoModelForCausalLM. No runtime dependencies, but memory savings are lost.

Quantization Recipe

python
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from transformers import AutoTokenizer

model_id = "HuggingFaceTB/SmolLM2-1.7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = SparseAutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype="auto", device_map="auto"
)

recipe = QuantizationModifier(
    targets="Linear",
    scheme="W4A16",
    ignore=["lm_head"],
)

oneshot(
    model=model,
    dataset="wikitext2",          # ← Wikipedia calibration
    recipe=recipe,
    max_seq_length=2048,
    num_calibration_samples=512,
)

model.save_pretrained("SmolLM2-1.7B-W4A16-wiki")
tokenizer.save_pretrained("SmolLM2-1.7B-W4A16-wiki")

Evaluation

⚠️ Evaluation pending. Accuracy vs. the BF16 base and the ultrachat-calibrated variant has not yet been formally benchmarked. Results will be added once lm-evaluation-harness evals complete.

Planned evaluation — run all three variants for a direct calibration comparison:

bash
# BF16 baseline
lm_eval --model hf \
  --model_args "pretrained=HuggingFaceTB/SmolLM2-1.7B-Instruct,dtype=bfloat16" \
  --tasks hellaswag,winogrande,arc_easy,arc_challenge,piqa,wikitext \
  --num_fewshot 0 --batch_size 32 --output_path results/baseline

# W4A16 ultrachat calibration
lm_eval --model hf \
  --model_args "pretrained=nakue/SmolLM2-1.7B-W4A16-instruct,dtype=bfloat16" \
  --tasks hellaswag,winogrande,arc_easy,arc_challenge,piqa,wikitext \
  --num_fewshot 0 --batch_size 32 --output_path results/w4a16_ultrachat

# W4A16 Wikipedia calibration (this model)
lm_eval --model hf \
  --model_args "pretrained=nakue/SmolLM2-1.7B-W4A16-wiki,dtype=bfloat16" \
  --tasks hellaswag,winogrande,arc_easy,arc_challenge,piqa,wikitext \
  --num_fewshot 0 --batch_size 32 --output_path results/w4a16_wiki

Results table (to be filled):

Task	BF16 Base	W4A16 ultrachat	W4A16 wiki (this)
HellaSwag (acc_norm)	—	—	—
WinoGrande (acc)	—	—	—
ARC-Easy (acc_norm)	—	—	—
ARC-Challenge (acc_norm)	—	—	—
PIQA (acc_norm)	—	—	—
WikiText-2 PPL ↓	—	—	—

Limitations

Weight-only quantization — activations remain in BF16; no INT4 tensor core dispatch at runtime.
Wikipedia calibration bias — may underperform the ultrachat variant on conversational and instruction-following tasks.
lm_head excluded — output projection kept in BF16.
Evaluation pending — formal benchmarks not yet run.

Full Model Series

Model	Scheme	Calibration	Link
SmolLM2-1.7B-Instruct	BF16 (base)	—	HuggingFaceTB/SmolLM2-1.7B-Instruct
SmolLM2-1.7B-W8A8-Instruct	W8A8 INT8	ultrachat	nakue/SmolLM2-1.7B-W8A8-instruct
SmolLM2-1.7B-W4A16-Instruct	W4A16 INT4	ultrachat	nakue/SmolLM2-1.7B-W4A16-instruct
SmolLM2-1.7B-W4A16-Wiki	W4A16 INT4	Wikipedia	This model

License

Apache 2.0 — inherited from the base model. See LICENSE.

Citation

bibtex
@misc{smollm2,
  title  = {SmolLM2: When Smol Goes Big},
  author = {HuggingFaceTB},
  year   = {2024},
  url    = {https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct}
}

Quantized by nakue · Portfolio · Part of a calibration dataset ablation study on W4A16 quantization.

SmolLM2-1.7B-W4A16-wiki

Get help setting up a custom Dedicated Endpoints.

README