Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Model Details
| Property | Value |
|---|---|
| Base model | HuggingFaceTB/SmolLM2-1.7B-Instruct |
| Architecture | LlamaForCausalLM |
| Parameters | ~1.7B |
| Quantization scheme | W4A16 — INT4 weights, BF16 activations |
| Excluded layers | lm_head (kept in BF16) |
| Format | compressed-tensors (Safetensors) |
| Calibration dataset | Wikipedia / wikitext2 (512 samples, max_seq_length 2048) |
| Quantization tool | llm-compressor |
Why Calibration Dataset Matters
In static quantization, calibration data is used to compute the scale and zero-point for each weight tensor. The model sees a forward pass through these samples and records activation distributions — those statistics determine how INT4 ranges are assigned.
| Calibration data | Strengths | Weaknesses |
|---|---|---|
| Wikipedia (this model) | Factual recall, long-form prose, knowledge-dense text | Conversational tasks, instruction following |
This model is expected to perform better on knowledge-intensive or document-style tasks (summarization, factual Q&A, RAG retrieval), while the ultrachat-calibrated variant may be stronger for conversational and instruction-following use cases.
Empirical comparison pending — see the Evaluation section.
How to Use
Option 1 — vLLM (recommended for serving)
bash
pip install vllm
python
from vllm import LLM, SamplingParamsfrom transformers import AutoTokenizermodel_id = "nakue/SmolLM2-1.7B-W4A16-wiki"llm = LLM(model=model_id,quantization="compressed-tensors",dtype="bfloat16",)sampling_params = SamplingParams(temperature=0.7, max_tokens=256)tokenizer = AutoTokenizer.from_pretrained(model_id)messages = [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Summarize the history of the transformer architecture."},]prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)outputs = llm.generate([prompt], sampling_params)print(outputs[0].outputs[0].text)
Option 2 — Transformers + compressed-tensors (no vLLM)
bash
pip install compressed-tensors transformers
python
from transformers import AutoModelForCausalLM, AutoTokenizerimport torchmodel_id = "nakue/SmolLM2-1.7B-W4A16-wiki"tokenizer = AutoTokenizer.from_pretrained(model_id)# compressed-tensors backend auto-detected from quantization_config in config.jsonmodel = AutoModelForCausalLM.from_pretrained(model_id,torch_dtype=torch.bfloat16,device_map="auto",)messages = [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "What is the Pythagorean theorem?"},]inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)outputs = model.generate(inputs, max_new_tokens=128)print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))
Note: W4A16 dequantizes weights to BF16 at runtime — compute stays in BF16. You get ~4x memory reduction; throughput gains depend on whether your workload is memory-bandwidth-bound.
Option 3 — llmcompressor
bash
pip install llmcompressor
python
from llmcompressor.transformers import SparseAutoModelForCausalLMfrom transformers import AutoTokenizermodel_id = "nakue/SmolLM2-1.7B-W4A16-wiki"tokenizer = AutoTokenizer.from_pretrained(model_id)model = SparseAutoModelForCausalLM.from_pretrained(model_id,torch_dtype="auto",device_map="auto",)inputs = tokenizer("The transformer architecture was introduced in", return_tensors="pt").to(model.device)outputs = model.generate(**inputs, max_new_tokens=128)print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Option 4 — Dequantize to BF16
bash
pip install llmcompressor
python
from llmcompressor.transformers import SparseAutoModelForCausalLMfrom transformers import AutoTokenizerimport torchmodel_id = "nakue/SmolLM2-1.7B-W4A16-wiki"tokenizer = AutoTokenizer.from_pretrained(model_id)model = SparseAutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)model.save_pretrained("smollm2-bf16-dequantized")tokenizer.save_pretrained("smollm2-bf16-dequantized")
Load smollm2-bf16-dequantized with plain AutoModelForCausalLM. No runtime dependencies, but memory savings are lost.
Quantization Recipe
python
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshotfrom llmcompressor.modifiers.quantization import QuantizationModifierfrom transformers import AutoTokenizermodel_id = "HuggingFaceTB/SmolLM2-1.7B-Instruct"tokenizer = AutoTokenizer.from_pretrained(model_id)model = SparseAutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")recipe = QuantizationModifier(targets="Linear",scheme="W4A16",ignore=["lm_head"],)oneshot(model=model,dataset="wikitext2", # ← Wikipedia calibrationrecipe=recipe,max_seq_length=2048,num_calibration_samples=512,)model.save_pretrained("SmolLM2-1.7B-W4A16-wiki")tokenizer.save_pretrained("SmolLM2-1.7B-W4A16-wiki")
Evaluation
⚠️ Evaluation pending. Accuracy vs. the BF16 base and the
ultrachat-calibrated variant has not yet been formally benchmarked. Results will be added once lm-evaluation-harness evals complete.
Planned evaluation — run all three variants for a direct calibration comparison:
bash
# BF16 baselinelm_eval --model hf \--model_args "pretrained=HuggingFaceTB/SmolLM2-1.7B-Instruct,dtype=bfloat16" \--tasks hellaswag,winogrande,arc_easy,arc_challenge,piqa,wikitext \--num_fewshot 0 --batch_size 32 --output_path results/baseline# W4A16 ultrachat calibrationlm_eval --model hf \--model_args "pretrained=nakue/SmolLM2-1.7B-W4A16-instruct,dtype=bfloat16" \--tasks hellaswag,winogrande,arc_easy,arc_challenge,piqa,wikitext \--num_fewshot 0 --batch_size 32 --output_path results/w4a16_ultrachat# W4A16 Wikipedia calibration (this model)lm_eval --model hf \--model_args "pretrained=nakue/SmolLM2-1.7B-W4A16-wiki,dtype=bfloat16" \--tasks hellaswag,winogrande,arc_easy,arc_challenge,piqa,wikitext \--num_fewshot 0 --batch_size 32 --output_path results/w4a16_wiki
Results table (to be filled):
| Task | BF16 Base | W4A16 ultrachat | W4A16 wiki (this) |
|---|---|---|---|
| HellaSwag (acc_norm) | — | — | — |
| WinoGrande (acc) | — | — | — |
| ARC-Easy (acc_norm) | — | — | — |
| ARC-Challenge (acc_norm) | — | — | — |
| PIQA (acc_norm) | — | — | — |
| WikiText-2 PPL ↓ | — | — | — |
Limitations
- Weight-only quantization — activations remain in BF16; no INT4 tensor core dispatch at runtime.
- Wikipedia calibration bias — may underperform the
ultrachatvariant on conversational and instruction-following tasks. lm_headexcluded — output projection kept in BF16.- Evaluation pending — formal benchmarks not yet run.
Full Model Series
| Model | Scheme | Calibration | Link |
|---|---|---|---|
| SmolLM2-1.7B-Instruct | BF16 (base) | — | HuggingFaceTB/SmolLM2-1.7B-Instruct |
| SmolLM2-1.7B-W8A8-Instruct | W8A8 INT8 | ultrachat | nakue/SmolLM2-1.7B-W8A8-instruct |
| SmolLM2-1.7B-W4A16-Instruct | W4A16 INT4 | ultrachat | nakue/SmolLM2-1.7B-W4A16-instruct |
| SmolLM2-1.7B-W4A16-Wiki | W4A16 INT4 | Wikipedia | This model |
License
Apache 2.0 — inherited from the base model. See LICENSE.
Citation
bibtex
@misc{smollm2,title = {SmolLM2: When Smol Goes Big},author = {HuggingFaceTB},year = {2024},url = {https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct}}
Quantized by nakue · Portfolio · Part of a calibration dataset ablation study on W4A16 quantization.
Model provider
nakue
Model tree
Base
HuggingFaceTB/SmolLM2-1.7B-Instruct
Quantized
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information