Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model Details

PropertyValue
Base modelHuggingFaceTB/SmolLM2-1.7B-Instruct
ArchitectureLlamaForCausalLM
Parameters~1.7B
QuantizationW8A8 (INT8 weights + INT8 activations)
Formatcompressed-tensors (Safetensors)
Calibration datasetultrachat (512 samples)
Quantization toolllm-compressor

Motivation

W8A8 quantization reduces memory footprint and enables use of INT8 tensor core throughput on modern NVIDIA GPUs, without the accuracy degradation typical of weight-only schemes like W4A16. This model is useful for:

  • Serving on memory-constrained GPUs (e.g., T4, L4, A10G)
  • High-throughput batched inference via vLLM's INT8 kernel path
  • Benchmarking quantization accuracy vs. latency trade-offs

How to Use

With vLLM (recommended)

python

from vllm import LLM, SamplingParams
llm = LLM(
model="nakue/SmolLM2-1.7B-W8A8-instruct",
quantization="compressed-tensors",
dtype="bfloat16",
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain what W8A8 quantization means."},
]
# Apply chat template
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("nakue/SmolLM2-1.7B-W8A8-instruct")
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)

With Transformers (CPU / non-quantized path)

python

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "nakue/SmolLM2-1.7B-W8A8-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
outputs = model.generate(inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

Quantization Recipe

Produced with llm-compressor using static per-tensor INT8 quantization for both weights and activations:

python

from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
recipe = QuantizationModifier(
targets="Linear",
scheme="W8A8",
ignore=["lm_head"],
)
oneshot(
model=model,
dataset="ultrachat",
recipe=recipe,
max_seq_length=2048,
num_calibration_samples=512,
)

Limitations

  • Activations are quantized statically (calibrated on ultrachat); accuracy may degrade on domains far from calibration distribution.
  • lm_head is excluded from quantization (left in BF16) to preserve output logit precision.
  • Best served via vLLM with compressed-tensors support; Transformers inference falls back to dequantized BF16.

License

This model inherits the Apache 2.0 license from the base model.

Citation

If you use this model, please also cite the original SmolLM2:

markdown

@misc{smollm2,
title={SmolLM2: When Smol Goes Big},
author={HuggingFaceTB},
year={2024},
url={https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct}
}

Quantized by nakue as part of an LLM inference optimization portfolio.

Model provider

nakue

Model tree

Base

HuggingFaceTB/SmolLM2-1.7B-Instruct

Quantized

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today