Kiffaz11/ministral3-3b-reasoning-torchao-int4 API & Inference Endpoint

Quantization summary

Quantization backend: TorchAO
Quantization type: INT4 weight-only
Packing/layout used during working export: Int4TilePackedTo4dTensor
Targeted modules: language model attention and MLP Linear layers
Kept dense: embeddings, lm_head, vision tower, multimodal projector
Final checkpoint size: ~2.656 GiB
Original BF16 checkpoint size measured locally: ~14.339 GiB
Checkpoint size reduction: ~81.48%

Smoke test result

Test problem:

Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?

Expected answer: 72

Results:

Model	Parsed answer	Correct
Original BF16	72	yes
TorchAO INT4	72	yes

Local benchmark summary

Metric	Original BF16	TorchAO INT4
Checkpoint size GiB	14.3391	2.6557
Load time seconds	6.9686	9.8574
Generation latency seconds	1.0092	1.0404
Generated tok/s	64.4074	53.8238
Total tok/s including prompt	165.4775	151.8600
Peak VRAM allocated GiB	15.9948	11.4769
Peak VRAM reserved GiB	16.0820	14.9023
CPU RSS after load GiB	8.0434	8.0436

Notes

This checkpoint is intended as a TorchAO quantized research artifact. The first validation target was successful reload, reduced checkpoint size, lower VRAM allocation, and correctness on a smoke arithmetic task.

The quantized model may not be faster than BF16 on every GPU/backend. For this checkpoint, the main gain is storage and allocated VRAM reduction.

ministral3-3b-reasoning-torchao-int4

Get help setting up a custom Dedicated Endpoints.

README

Quantization summary

Smoke test result

Local benchmark summary

Notes

Explore FriendliAI today

ministral3-3b-reasoning-torchao-int4