nakue/SmolLM2-1.7B-W8A8-instruct API & Inference Endpoint

Model Details

Property	Value
Base model	HuggingFaceTB/SmolLM2-1.7B-Instruct
Architecture	LlamaForCausalLM
Parameters	~1.7B
Quantization	W8A8 (INT8 weights + INT8 activations)
Format	`compressed-tensors` (Safetensors)
Calibration dataset	`ultrachat` (512 samples)
Quantization tool	llm-compressor

Motivation

W8A8 quantization reduces memory footprint and enables use of INT8 tensor core throughput on modern NVIDIA GPUs, without the accuracy degradation typical of weight-only schemes like W4A16. This model is useful for:

Serving on memory-constrained GPUs (e.g., T4, L4, A10G)
High-throughput batched inference via vLLM's INT8 kernel path
Benchmarking quantization accuracy vs. latency trade-offs

How to Use

With vLLM (recommended)

python
from vllm import LLM, SamplingParams

llm = LLM(
    model="nakue/SmolLM2-1.7B-W8A8-instruct",
    quantization="compressed-tensors",
    dtype="bfloat16",
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=256)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain what W8A8 quantization means."},
]

# Apply chat template
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("nakue/SmolLM2-1.7B-W8A8-instruct")
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)

With Transformers (CPU / non-quantized path)

python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "nakue/SmolLM2-1.7B-W8A8-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
outputs = model.generate(inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

Quantization Recipe

Produced with llm-compressor using static per-tensor INT8 quantization for both weights and activations:

python
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

recipe = QuantizationModifier(
    targets="Linear",
    scheme="W8A8",
    ignore=["lm_head"],
)

oneshot(
    model=model,
    dataset="ultrachat",
    recipe=recipe,
    max_seq_length=2048,
    num_calibration_samples=512,
)

Limitations

Activations are quantized statically (calibrated on ultrachat); accuracy may degrade on domains far from calibration distribution.
lm_head is excluded from quantization (left in BF16) to preserve output logit precision.
Best served via vLLM with compressed-tensors support; Transformers inference falls back to dequantized BF16.

License

This model inherits the Apache 2.0 license from the base model.

Citation

If you use this model, please also cite the original SmolLM2:

markdown
@misc{smollm2,
  title={SmolLM2: When Smol Goes Big},
  author={HuggingFaceTB},
  year={2024},
  url={https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct}
}

Quantized by nakue as part of an LLM inference optimization portfolio.

SmolLM2-1.7B-W8A8-instruct

Get help setting up a custom Dedicated Endpoints.

README