This is the differentiator vs ALIA-40b-instruct-2601. The model emits OpenAI-style <tool_call>{...}</tool_call> JSON blocks when given a tools array. Compatible with vLLM's --tool-call-parser hermes and llama.cpp's hermes parser.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current temperature for a given location.",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string", "description": "City and country"}},
"required": ["location"],
},
},
}]
resp = client.chat.completions.create(
model="alia-fc-nvfp4",
messages=[{"role": "user", "content": "What's the weather in Paris?"}],
tools=tools,
)
print(resp.choices[0].message.tool_calls)
Recommended runtime: vLLM
pip install vllm # 0.20+
# DGX Spark / Blackwell: CUDA 13 nvcc required for flashinfer JIT on sm_120
export CUDA_HOME=/usr/local/cuda-13.0
export PATH=/usr/local/cuda-13.0/bin:$PATH
vllm serve montevive/ALIA-40b-fc-2605-NVFP4 \
--dtype bfloat16 \
--max-model-len 4096 \
--gpu-memory-utilization 0.55 \
--served-model-name alia-fc-nvfp4 \
--tool-call-parser hermes \
--enable-auto-tool-choice
Or offline:
from vllm import LLM, SamplingParams
llm = LLM(
model="montevive/ALIA-40b-fc-2605-NVFP4",
dtype="bfloat16",
max_model_len=4096,
gpu_memory_utilization=0.55,
)
sp = SamplingParams(temperature=0.1, top_p=0.9, max_tokens=200)
out = llm.chat(
[[{"role": "user", "content": "What's the temperature in Madrid?"}]],
sp,
chat_template_kwargs={"tools": tools},
)
print(out[0].outputs[0].text)
Recommended sampling (per the base model card): temperature between 0 and 0.2; avoid repetition penalties — they degrade instruction-following and tool-call validity.
Architecturally identical to ALIA-40b-instruct-2601, so per-token throughput matches: ~8.7 tok/s steady-state single-prompt with vLLM 0.20+, ~10 tok/s with llama.cpp --jinja on the same hardware. Continuous batching across multiple users scales considerably higher.
Chat template
The FC model uses ChatML with OpenAI-format tool-call rendering. tokenizer.apply_chat_template(..., tools=tools) produces the canonical prompt; vLLM honors this automatically. The model emits:
<tool_call>
{"name": "get_weather", "arguments": {"location": "Madrid, Spain"}}
</tool_call>
Configure vLLM with --tool-call-parser hermes so the OpenAI-style tool_calls field is populated on the response.
Compatibility matrix
Table with columns: Runtime, Native NVFP4 on Blackwell, Tool calling, Notes| Runtime | Native NVFP4 on Blackwell | Tool calling | Notes |
|---|
| vLLM (recommended) | ✅ Yes | ✅ via --tool-call-parser hermes | Production-ready. CUDA 13 nvcc required on host for first-run JIT. |
| TensorRT-LLM | ✅ Yes | ✅ (parse from response) | Same compressed-tensors format. Heavier setup. |
HuggingFace transformers | ❌ Dequant-to-BF16 at load | ✅ via |
Quantization details
- Tool: llmcompressor 0.10+ + compressed-tensors 0.14+
- Scheme: NVFP4 (
nvfp4-pack-quantized)
- Weights: 4-bit float, group_size=16, symmetric, scale_dtype=float8_e4m3fn
- Input activations: 4-bit float, dynamic local, group_size=16, symmetric
- Ignored layers:
lm_head
- Calibration: 512 samples of
HuggingFaceH4/ultrachat_200k (train_sft split), max_seq_len 2048
- Calibration runtime: ~1.5 h on dual RTX 3090 Ti (sequential per-layer offload)
Calibration caveat
ultrachat_200k is English-only synthetic chat. The originally planned mix with Salesforce/xlam-function-calling-60k was blocked at run time because that dataset is gated on HF, so calibration fell back to the proven 2601 recipe. The model's FC distribution is already encoded in its SFT weights, but per-tensor scales were computed against a chat-only activation distribution, which may underweight tool-call activation patterns. BSC themselves note the FC fine-tune is "primarily evaluated and optimized for English" — so English tool calling and chat should be well-served, but multilingual tool calling (Spanish/Catalan/Basque/Galician) may show more quantization degradation than prose. Evaluate on your own multilingual tool-call task before deploying.
License & attribution
Released under the same Apache 2.0 license as the source.
-
Base model: BSC-LT/ALIA-40b-fc-2605 by Barcelona Supercomputing Center (BSC). Please cite their work if you use this model in research:
@misc{alia-40b-fc-2605,
author = {Barcelona Supercomputing Center},
title = {ALIA-40b-fc-2605},
year = {2026},
url = {https://huggingface.co/BSC-LT/ALIA-40b-fc-2605}
}
-
NVFP4 quantization: Montevive AI.
Limitations
Inherits all limitations of the base ALIA-40b-fc-2605 model:
- Not safety-aligned. Instruction-tuned but lacks value alignment per BSC's card.
- Tool calling is English-optimized. BSC's BFCL numbers (Non-Live Multiple AST 94.5%, Live Multiple AST 74.4%) are English; multilingual coverage is on BSC's roadmap.
- Multi-turn tool calling is weak (BFCL Multi-Turn Base 15.5%); avoid long agentic loops without external scaffolding.
Plus standard NVFP4 caveats:
- NVFP4 inference quality is below BF16/FP8. Empirically fine on 40B but verify on your task.
- NVFP4 + Blackwell + vLLM is recent — expect API churn. Tested against vLLM 0.20.0 / torch 2.11+cu130 / flashinfer 0.6.8.post1.
transformers users will fall back to BF16 (no native NVFP4 in stock transformers).