Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model Details

PropertyValue
Base modelHuggingFaceTB/SmolLM2-1.7B-Instruct
ArchitectureLlamaForCausalLM
Parameters~1.7B
Quantization schemeW4A16 — INT4 weights, BF16 activations
Excluded layerslm_head (kept in BF16)
Formatcompressed-tensors (Safetensors)
Calibration datasetWikipedia / wikitext2 (512 samples, max_seq_length 2048)
Quantization toolllm-compressor

Why Calibration Dataset Matters

In static quantization, calibration data is used to compute the scale and zero-point for each weight tensor. The model sees a forward pass through these samples and records activation distributions — those statistics determine how INT4 ranges are assigned.

Calibration dataStrengthsWeaknesses
Wikipedia (this model)Factual recall, long-form prose, knowledge-dense textConversational tasks, instruction following

This model is expected to perform better on knowledge-intensive or document-style tasks (summarization, factual Q&A, RAG retrieval), while the ultrachat-calibrated variant may be stronger for conversational and instruction-following use cases.

Empirical comparison pending — see the Evaluation section.


How to Use

Option 1 — vLLM (recommended for serving)

bash

pip install vllm

python

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = "nakue/SmolLM2-1.7B-W4A16-wiki"
llm = LLM(
model=model_id,
quantization="compressed-tensors",
dtype="bfloat16",
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Summarize the history of the transformer architecture."},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)

Option 2 — Transformers + compressed-tensors (no vLLM)

bash

pip install compressed-tensors transformers

python

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "nakue/SmolLM2-1.7B-W4A16-wiki"
tokenizer = AutoTokenizer.from_pretrained(model_id)
# compressed-tensors backend auto-detected from quantization_config in config.json
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the Pythagorean theorem?"},
]
inputs = tokenizer.apply_chat_template(
messages, return_tensors="pt", add_generation_prompt=True
).to(model.device)
outputs = model.generate(inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

Note: W4A16 dequantizes weights to BF16 at runtime — compute stays in BF16. You get ~4x memory reduction; throughput gains depend on whether your workload is memory-bandwidth-bound.


Option 3 — llmcompressor

bash

pip install llmcompressor

python

from llmcompressor.transformers import SparseAutoModelForCausalLM
from transformers import AutoTokenizer
model_id = "nakue/SmolLM2-1.7B-W4A16-wiki"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = SparseAutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto",
)
inputs = tokenizer("The transformer architecture was introduced in", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Option 4 — Dequantize to BF16

bash

pip install llmcompressor

python

from llmcompressor.transformers import SparseAutoModelForCausalLM
from transformers import AutoTokenizer
import torch
model_id = "nakue/SmolLM2-1.7B-W4A16-wiki"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = SparseAutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
model.save_pretrained("smollm2-bf16-dequantized")
tokenizer.save_pretrained("smollm2-bf16-dequantized")

Load smollm2-bf16-dequantized with plain AutoModelForCausalLM. No runtime dependencies, but memory savings are lost.


Quantization Recipe

python

from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from transformers import AutoTokenizer
model_id = "HuggingFaceTB/SmolLM2-1.7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = SparseAutoModelForCausalLM.from_pretrained(
model_id, torch_dtype="auto", device_map="auto"
)
recipe = QuantizationModifier(
targets="Linear",
scheme="W4A16",
ignore=["lm_head"],
)
oneshot(
model=model,
dataset="wikitext2", # ← Wikipedia calibration
recipe=recipe,
max_seq_length=2048,
num_calibration_samples=512,
)
model.save_pretrained("SmolLM2-1.7B-W4A16-wiki")
tokenizer.save_pretrained("SmolLM2-1.7B-W4A16-wiki")

Evaluation

⚠️ Evaluation pending. Accuracy vs. the BF16 base and the ultrachat-calibrated variant has not yet been formally benchmarked. Results will be added once lm-evaluation-harness evals complete.

Planned evaluation — run all three variants for a direct calibration comparison:

bash

# BF16 baseline
lm_eval --model hf \
--model_args "pretrained=HuggingFaceTB/SmolLM2-1.7B-Instruct,dtype=bfloat16" \
--tasks hellaswag,winogrande,arc_easy,arc_challenge,piqa,wikitext \
--num_fewshot 0 --batch_size 32 --output_path results/baseline
# W4A16 ultrachat calibration
lm_eval --model hf \
--model_args "pretrained=nakue/SmolLM2-1.7B-W4A16-instruct,dtype=bfloat16" \
--tasks hellaswag,winogrande,arc_easy,arc_challenge,piqa,wikitext \
--num_fewshot 0 --batch_size 32 --output_path results/w4a16_ultrachat
# W4A16 Wikipedia calibration (this model)
lm_eval --model hf \
--model_args "pretrained=nakue/SmolLM2-1.7B-W4A16-wiki,dtype=bfloat16" \
--tasks hellaswag,winogrande,arc_easy,arc_challenge,piqa,wikitext \
--num_fewshot 0 --batch_size 32 --output_path results/w4a16_wiki

Results table (to be filled):

TaskBF16 BaseW4A16 ultrachatW4A16 wiki (this)
HellaSwag (acc_norm)
WinoGrande (acc)
ARC-Easy (acc_norm)
ARC-Challenge (acc_norm)
PIQA (acc_norm)
WikiText-2 PPL ↓

Limitations

  • Weight-only quantization — activations remain in BF16; no INT4 tensor core dispatch at runtime.
  • Wikipedia calibration bias — may underperform the ultrachat variant on conversational and instruction-following tasks.
  • lm_head excluded — output projection kept in BF16.
  • Evaluation pending — formal benchmarks not yet run.

Full Model Series

ModelSchemeCalibrationLink
SmolLM2-1.7B-InstructBF16 (base)HuggingFaceTB/SmolLM2-1.7B-Instruct
SmolLM2-1.7B-W8A8-InstructW8A8 INT8ultrachatnakue/SmolLM2-1.7B-W8A8-instruct
SmolLM2-1.7B-W4A16-InstructW4A16 INT4ultrachatnakue/SmolLM2-1.7B-W4A16-instruct
SmolLM2-1.7B-W4A16-WikiW4A16 INT4WikipediaThis model

License

Apache 2.0 — inherited from the base model. See LICENSE.


Citation

bibtex

@misc{smollm2,
title = {SmolLM2: When Smol Goes Big},
author = {HuggingFaceTB},
year = {2024},
url = {https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct}
}

Quantized by nakue · Portfolio · Part of a calibration dataset ablation study on W4A16 quantization.

Model provider

nakue

Model tree

Base

HuggingFaceTB/SmolLM2-1.7B-Instruct

Quantized

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today