Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Overview
This is a 24-billion-parameter multimodal (text + image) instruction-tuned language model, structurally identical to mistralai/Mistral-Small-3.2-24B-Instruct-2506. The only thing that has changed are the language-model attention and MLP weights, which have been adapted toward a specific domain via a low-rank LoRA fine-tune and then merged back into the full BF16 weight tensors. Everything else about the model — its vision tower, multimodal projector, tokenizer, chat template, image processor, and overall architecture — is byte-identical to the base.
Practically, that means:
- It's a drop-in replacement for the vanilla base model. Any serving command, prompt template, or inference pipeline that works against the base model works against this directory with no changes other than the model path.
- Image input still works exactly as it does on the base. The Pixtral vision encoder and the multimodal projector were frozen during training, so vision behavior is preserved.
- Behavior diverges from base only on the kind of conversations the fine-tuning dataset covers. On out-of-domain prompts, output should be close to the base model. On in-domain prompts, the model has shifted toward the training data's style and conventions.
If you're already running vanilla Mistral Small 3.2 24B in your stack, deploying this is a matter of swapping the model path. No tokenizer change, no chat-template change, no extra launch flags.
What's different vs the vanilla base
| Layer | Status vs base | Notes |
|---|---|---|
Language-model attention (q,k,v,o) | Adapted | LoRA rank 16, alpha 32, then merged back into BF16 |
Language-model MLP (gate,up,down) | Adapted | Same LoRA config |
| Vision tower (Pixtral encoder) | Unchanged | Frozen during training; bit-identical to base |
| Multimodal projector | Unchanged | Frozen during training; bit-identical to base |
| Token embedding table | Unchanged | No new tokens added |
| LM head | Unchanged | Vocab unchanged |
Tokenizer (tokenizer.json) | Unchanged | This is the Unsloth HF port — same vocab as Mistral's native tekken.json, but in HF format so AutoTokenizer works directly |
Chat template (chat_template.jinja) | Unchanged | Original mistral_small template |
| Image processor | Unchanged | Same preprocessor_config.json / processor_config.json |
Architecture (config.json) | Unchanged | Mistral3ForConditionalGeneration — same dimensions, layer counts, context length |
| Weight dtype | Unchanged | BF16 (no on-disk quantization) |
| License | Unchanged | Apache 2.0 (inherited from base) |
In summary: same model in shape and capability, with the language-model weights tilted by the fine-tune. Trainable parameters were 92.4 M out of 24.1 B (0.38%), so the deviation from base is bounded — this is a refinement of style on top of the existing capabilities, not a reshaping of what the model can do.
How the fine-tune was made
| Setting | Value |
|---|---|
| Base checkpoint | unsloth/Mistral-Small-3.2-24B-Instruct-2506-bnb-4bit (weight-identical to mistralai/Mistral-Small-3.2-24B-Instruct-2506; redistributed by Unsloth with an HF-format tokenizer) |
| Method | QLoRA (NF4 4-bit base weights, BF16 compute, double-quant enabled) |
| Trainable adapter | LoRA, rank 16, alpha 32, dropout 0 |
| LoRA targets | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj (LM only) |
| Frozen | Vision tower, multimodal projector, embeddings, LM head |
| Dataset | ~8,000 conversation examples (OpenAI messages format) + 1,000-conversation held-out validation set + 1,000-conversation held-out test set |
| Sequence cutoff | 2,048 tokens |
| Optimizer | AdamW (Torch) |
| Learning rate | 5e-5, cosine schedule, no warmup |
| Effective batch size | 16 (per-device 2 × grad-accum 8) |
| Epochs trained | 3 |
| Selected checkpoint | Epoch 2 — eval loss trajectory was 1.025 → 1.017 → 1.082; epoch 3 was overfit |
| Final merge | LoRA → BF16 via Unsloth's save_pretrained_merged(save_method="merged_16bit") |
| Training framework | LLaMA-Factory 0.9.4 + Unsloth backend, transformers 4.57.1 |
| Hardware | Single RTX A6000 (48 GB) |
Quality caveat (one minor thing worth knowing)
The LoRA → BF16 merge was performed against a 4-bit-quantized base, not against the original BF16 base. This is a memory-driven choice — merging a 24B BF16 model on CPU OOM-kills inside Docker even on hosts with plenty of RAM, and the GPU path through Unsloth requires the 4-bit base.
In practice this introduces a small dequantization artifact in the merged weights (NF4 → BF16 round-trip) that is well below typical quantization noise for any runtime quant scheme you might apply on top (AWQ, GPTQ, FP8, etc.). If you ever need a "perfect" merge — for instance, if you're serving at full BF16 and benchmarking against the base directly — we can redo the merge against the original BF16 weights on GPU and ship a fresh directory. Ask if you need that.
Deployment
Verify the directory loads
bash
# File inventory: should show 10 safetensor shards + the JSON sidecarsls -lh FrndoBrain-1.0.1-24b/# Total size should be ~45 GBdu -sh FrndoBrain-1.0.1-24b/# Smoke test in transformers (loads in ~1 min, no inference yet)python -c "from transformers import AutoModelForImageTextToText, AutoProcessorm = AutoModelForImageTextToText.from_pretrained('FrndoBrain-1.0.1-24b', torch_dtype='bfloat16')p = AutoProcessor.from_pretrained('FrndoBrain-1.0.1-24b')print('OK:', type(m).__name__, '|', sum(x.numel() for x in m.parameters())/1e9, 'B params')"
If transformers loads cleanly, vLLM will too.
Running in vLLM
This is Mistral3ForConditionalGeneration — vLLM auto-detects the architecture from config.json. Use vLLM ≥ 0.8.x; earlier versions don't have the Mistral 3.x multimodal class.
Example launch (adjust for your serving setup):
bash
vllm serve /path/to/FrndoBrain-1.0.1-24b \--served-model-name FrndoBrain-1.0.1-24b \--dtype bfloat16 \--max-model-len 32768 \--limit-mm-per-prompt image=4
A few notes:
--tokenizer-mode mistralis not required and should not be passed — this directory ships an HF-formattokenizer.json, not Mistral's nativetekken.json. The default tokenizer path is correct.chat_template.jinjais picked up automatically by vLLM when serving via/v1/chat/completions. No--chat-templateflag needed unless you're intentionally overriding it.- The base model's full context is 131,072 tokens; the LoRA was trained at
cutoff_len=2048but the merged model still accepts the full base context. Long-context behavior on this fine-tune has not been benchmarked. - The directory name itself is not referenced anywhere inside the files — rename it freely on your end.
File inventory
| File | Purpose |
|---|---|
model-0000{1..10}-of-00010.safetensors | BF16 weights, 10 shards (~4.5 GB each) |
model.safetensors.index.json | Shard manifest |
config.json | Model config (architecture, dimensions, vision config) |
tokenizer.json | Fast tokenizer, HF format |
tokenizer_config.json | Tokenizer settings |
special_tokens_map.json | Special-token IDs |
chat_template.jinja | mistral_small chat template (unchanged from base) |
preprocessor_config.json | Image preprocessing parameters |
processor_config.json | Combined processor config |
Model provider
BigBlueCeiling
Model tree
Base
mistralai/Mistral-Small-3.2-24B-Instruct-2506
Fine-tuned
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information