Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Quantization summary

  • Quantization backend: TorchAO
  • Quantization type: INT4 weight-only
  • Packing/layout used during working export: Int4TilePackedTo4dTensor
  • Targeted modules: language model attention and MLP Linear layers
  • Kept dense: embeddings, lm_head, vision tower, multimodal projector
  • Final checkpoint size: ~2.656 GiB
  • Original BF16 checkpoint size measured locally: ~14.339 GiB
  • Checkpoint size reduction: ~81.48%

Smoke test result

Test problem:

Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?

Expected answer: 72

Results:

ModelParsed answerCorrect
Original BF1672yes
TorchAO INT472yes

Local benchmark summary

MetricOriginal BF16TorchAO INT4
Checkpoint size GiB14.33912.6557
Load time seconds6.96869.8574
Generation latency seconds1.00921.0404
Generated tok/s64.407453.8238
Total tok/s including prompt165.4775151.8600
Peak VRAM allocated GiB15.994811.4769
Peak VRAM reserved GiB16.082014.9023
CPU RSS after load GiB8.04348.0436

Notes

This checkpoint is intended as a TorchAO quantized research artifact. The first validation target was successful reload, reduced checkpoint size, lower VRAM allocation, and correctness on a smoke arithmetic task.

The quantized model may not be faster than BF16 on every GPU/backend. For this checkpoint, the main gain is storage and allocated VRAM reduction.

Model provider

Kiffaz11

Kiffaz11

Model tree

Base

mistralai/Ministral-3-3B-Reasoning-2512

Quantized

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today