Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Quantization summary
- Quantization backend: TorchAO
- Quantization type: INT4 weight-only
- Packing/layout used during working export:
Int4TilePackedTo4dTensor - Targeted modules: language model attention and MLP Linear layers
- Kept dense: embeddings, lm_head, vision tower, multimodal projector
- Final checkpoint size: ~2.656 GiB
- Original BF16 checkpoint size measured locally: ~14.339 GiB
- Checkpoint size reduction: ~81.48%
Smoke test result
Test problem:
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
Expected answer: 72
Results:
| Model | Parsed answer | Correct |
|---|---|---|
| Original BF16 | 72 | yes |
| TorchAO INT4 | 72 | yes |
Local benchmark summary
| Metric | Original BF16 | TorchAO INT4 |
|---|---|---|
| Checkpoint size GiB | 14.3391 | 2.6557 |
| Load time seconds | 6.9686 | 9.8574 |
| Generation latency seconds | 1.0092 | 1.0404 |
| Generated tok/s | 64.4074 | 53.8238 |
| Total tok/s including prompt | 165.4775 | 151.8600 |
| Peak VRAM allocated GiB | 15.9948 | 11.4769 |
| Peak VRAM reserved GiB | 16.0820 | 14.9023 |
| CPU RSS after load GiB | 8.0434 | 8.0436 |
Notes
This checkpoint is intended as a TorchAO quantized research artifact. The first validation target was successful reload, reduced checkpoint size, lower VRAM allocation, and correctness on a smoke arithmetic task.
The quantized model may not be faster than BF16 on every GPU/backend. For this checkpoint, the main gain is storage and allocated VRAM reduction.
Model provider
Kiffaz11
Model tree
Base
mistralai/Ministral-3-3B-Reasoning-2512
Quantized
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information