Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Model Details
| Property | Value |
|---|---|
| Base model | HuggingFaceTB/SmolLM2-1.7B-Instruct |
| Architecture | LlamaForCausalLM |
| Parameters | ~1.7B |
| Quantization | W8A8 (INT8 weights + INT8 activations) |
| Format | compressed-tensors (Safetensors) |
| Calibration dataset | ultrachat (512 samples) |
| Quantization tool | llm-compressor |
Motivation
W8A8 quantization reduces memory footprint and enables use of INT8 tensor core throughput on modern NVIDIA GPUs, without the accuracy degradation typical of weight-only schemes like W4A16. This model is useful for:
- Serving on memory-constrained GPUs (e.g., T4, L4, A10G)
- High-throughput batched inference via vLLM's INT8 kernel path
- Benchmarking quantization accuracy vs. latency trade-offs
How to Use
With vLLM (recommended)
python
from vllm import LLM, SamplingParamsllm = LLM(model="nakue/SmolLM2-1.7B-W8A8-instruct",quantization="compressed-tensors",dtype="bfloat16",)sampling_params = SamplingParams(temperature=0.7, max_tokens=256)messages = [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Explain what W8A8 quantization means."},]# Apply chat templatefrom transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained("nakue/SmolLM2-1.7B-W8A8-instruct")prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)outputs = llm.generate([prompt], sampling_params)print(outputs[0].outputs[0].text)
With Transformers (CPU / non-quantized path)
python
from transformers import AutoModelForCausalLM, AutoTokenizerimport torchmodel_id = "nakue/SmolLM2-1.7B-W8A8-instruct"tokenizer = AutoTokenizer.from_pretrained(model_id)model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")messages = [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "What is the capital of France?"},]inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)outputs = model.generate(inputs, max_new_tokens=128)print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))
Quantization Recipe
Produced with llm-compressor using static per-tensor INT8 quantization for both weights and activations:
python
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshotfrom llmcompressor.modifiers.quantization import QuantizationModifierrecipe = QuantizationModifier(targets="Linear",scheme="W8A8",ignore=["lm_head"],)oneshot(model=model,dataset="ultrachat",recipe=recipe,max_seq_length=2048,num_calibration_samples=512,)
Limitations
- Activations are quantized statically (calibrated on
ultrachat); accuracy may degrade on domains far from calibration distribution. lm_headis excluded from quantization (left in BF16) to preserve output logit precision.- Best served via vLLM with
compressed-tensorssupport; Transformers inference falls back to dequantized BF16.
License
This model inherits the Apache 2.0 license from the base model.
Citation
If you use this model, please also cite the original SmolLM2:
markdown
@misc{smollm2,title={SmolLM2: When Smol Goes Big},author={HuggingFaceTB},year={2024},url={https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct}}
Quantized by nakue as part of an LLM inference optimization portfolio.
Model provider
nakue
Model tree
Base
HuggingFaceTB/SmolLM2-1.7B-Instruct
Quantized
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information