montevive

ALIA-40b-fc-2605-NVFP4

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Tool calling

This is the differentiator vs ALIA-40b-instruct-2601. The model emits OpenAI-style <tool_call>{...}</tool_call> JSON blocks when given a tools array. Compatible with vLLM's --tool-call-parser hermes and llama.cpp's hermes parser.

python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current temperature for a given location.",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string", "description": "City and country"}},
            "required": ["location"],
        },
    },
}]

resp = client.chat.completions.create(
    model="alia-fc-nvfp4",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=tools,
)
print(resp.choices[0].message.tool_calls)

Recommended runtime: vLLM

bash
pip install vllm   # 0.20+

# DGX Spark / Blackwell: CUDA 13 nvcc required for flashinfer JIT on sm_120
export CUDA_HOME=/usr/local/cuda-13.0
export PATH=/usr/local/cuda-13.0/bin:$PATH

vllm serve montevive/ALIA-40b-fc-2605-NVFP4 \
    --dtype bfloat16 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.55 \
    --served-model-name alia-fc-nvfp4 \
    --tool-call-parser hermes \
    --enable-auto-tool-choice

Or offline:

python
from vllm import LLM, SamplingParams

llm = LLM(
    model="montevive/ALIA-40b-fc-2605-NVFP4",
    dtype="bfloat16",
    max_model_len=4096,
    gpu_memory_utilization=0.55,
)
sp = SamplingParams(temperature=0.1, top_p=0.9, max_tokens=200)
out = llm.chat(
    [[{"role": "user", "content": "What's the temperature in Madrid?"}]],
    sp,
    chat_template_kwargs={"tools": tools},
)
print(out[0].outputs[0].text)

Recommended sampling (per the base model card): temperature between 0 and 0.2; avoid repetition penalties — they degrade instruction-following and tool-call validity.

Performance (DGX Spark, GB10 Blackwell, NVFP4)

Architecturally identical to ALIA-40b-instruct-2601, so per-token throughput matches: ~8.7 tok/s steady-state single-prompt with vLLM 0.20+, ~10 tok/s with llama.cpp --jinja on the same hardware. Continuous batching across multiple users scales considerably higher.

Chat template

The FC model uses ChatML with OpenAI-format tool-call rendering. tokenizer.apply_chat_template(..., tools=tools) produces the canonical prompt; vLLM honors this automatically. The model emits:

markdown
<tool_call>
{"name": "get_weather", "arguments": {"location": "Madrid, Spain"}}
</tool_call>

Configure vLLM with --tool-call-parser hermes so the OpenAI-style tool_calls field is populated on the response.

Compatibility matrix

Table with columns: Runtime, Native NVFP4 on Blackwell, Tool calling, Notes
Runtime	Native NVFP4 on Blackwell	Tool calling	Notes
vLLM (recommended)	✅ Yes	✅ via `--tool-call-parser hermes`	Production-ready. CUDA 13 nvcc required on host for first-run JIT.
TensorRT-LLM	✅ Yes	✅ (parse from response)	Same `compressed-tensors` format. Heavier setup.
HuggingFace `transformers`	❌ Dequant-to-BF16 at load	✅ via

Quantization details

Tool: llmcompressor 0.10+ + compressed-tensors 0.14+
Scheme: NVFP4 (nvfp4-pack-quantized)
Weights: 4-bit float, group_size=16, symmetric, scale_dtype=float8_e4m3fn
Input activations: 4-bit float, dynamic local, group_size=16, symmetric
Ignored layers: lm_head
Calibration: 512 samples of HuggingFaceH4/ultrachat_200k (train_sft split), max_seq_len 2048
Calibration runtime: ~1.5 h on dual RTX 3090 Ti (sequential per-layer offload)

Calibration caveat

ultrachat_200k is English-only synthetic chat. The originally planned mix with Salesforce/xlam-function-calling-60k was blocked at run time because that dataset is gated on HF, so calibration fell back to the proven 2601 recipe. The model's FC distribution is already encoded in its SFT weights, but per-tensor scales were computed against a chat-only activation distribution, which may underweight tool-call activation patterns. BSC themselves note the FC fine-tune is "primarily evaluated and optimized for English" — so English tool calling and chat should be well-served, but multilingual tool calling (Spanish/Catalan/Basque/Galician) may show more quantization degradation than prose. Evaluate on your own multilingual tool-call task before deploying.

License & attribution

Released under the same Apache 2.0 license as the source.

Base model: BSC-LT/ALIA-40b-fc-2605 by Barcelona Supercomputing Center (BSC). Please cite their work if you use this model in research:

markdown
@misc{alia-40b-fc-2605,
  author = {Barcelona Supercomputing Center},
  title  = {ALIA-40b-fc-2605},
  year   = {2026},
  url    = {https://huggingface.co/BSC-LT/ALIA-40b-fc-2605}
}

NVFP4 quantization: Montevive AI.

Limitations

Inherits all limitations of the base ALIA-40b-fc-2605 model:

Not safety-aligned. Instruction-tuned but lacks value alignment per BSC's card.
Tool calling is English-optimized. BSC's BFCL numbers (Non-Live Multiple AST 94.5%, Live Multiple AST 74.4%) are English; multilingual coverage is on BSC's roadmap.
Multi-turn tool calling is weak (BFCL Multi-Turn Base 15.5%); avoid long agentic loops without external scaffolding.

Plus standard NVFP4 caveats:

NVFP4 inference quality is below BF16/FP8. Empirically fine on 40B but verify on your task.
NVFP4 + Blackwell + vLLM is recent — expect API churn. Tested against vLLM 0.20.0 / torch 2.11+cu130 / flashinfer 0.6.8.post1.
transformers users will fall back to BF16 (no native NVFP4 in stock transformers).

Model provider

montevive

Model tree

Base

BSC-LT/ALIA-40b-fc-2605

Quantized

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Tool calling

python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current temperature for a given location.",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string", "description": "City and country"}},
            "required": ["location"],
        },
    },
}]

resp = client.chat.completions.create(
    model="alia-fc-nvfp4",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=tools,
)
print(resp.choices[0].message.tool_calls)

Recommended runtime: vLLM

bash
pip install vllm   # 0.20+

# DGX Spark / Blackwell: CUDA 13 nvcc required for flashinfer JIT on sm_120
export CUDA_HOME=/usr/local/cuda-13.0
export PATH=/usr/local/cuda-13.0/bin:$PATH

vllm serve montevive/ALIA-40b-fc-2605-NVFP4 \
    --dtype bfloat16 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.55 \
    --served-model-name alia-fc-nvfp4 \
    --tool-call-parser hermes \
    --enable-auto-tool-choice

Or offline:

python
from vllm import LLM, SamplingParams

llm = LLM(
    model="montevive/ALIA-40b-fc-2605-NVFP4",
    dtype="bfloat16",
    max_model_len=4096,
    gpu_memory_utilization=0.55,
)
sp = SamplingParams(temperature=0.1, top_p=0.9, max_tokens=200)
out = llm.chat(
    [[{"role": "user", "content": "What's the temperature in Madrid?"}]],
    sp,
    chat_template_kwargs={"tools": tools},
)
print(out[0].outputs[0].text)

Recommended sampling (per the base model card): temperature between 0 and 0.2; avoid repetition penalties — they degrade instruction-following and tool-call validity.

Performance (DGX Spark, GB10 Blackwell, NVFP4)

Chat template

The FC model uses ChatML with OpenAI-format tool-call rendering. tokenizer.apply_chat_template(..., tools=tools) produces the canonical prompt; vLLM honors this automatically. The model emits:

markdown
<tool_call>
{"name": "get_weather", "arguments": {"location": "Madrid, Spain"}}
</tool_call>

Configure vLLM with --tool-call-parser hermes so the OpenAI-style tool_calls field is populated on the response.

Compatibility matrix

Table with columns: Runtime, Native NVFP4 on Blackwell, Tool calling, Notes
Runtime	Native NVFP4 on Blackwell	Tool calling	Notes
vLLM (recommended)	✅ Yes	✅ via `--tool-call-parser hermes`	Production-ready. CUDA 13 nvcc required on host for first-run JIT.
TensorRT-LLM	✅ Yes	✅ (parse from response)	Same `compressed-tensors` format. Heavier setup.
HuggingFace `transformers`	❌ Dequant-to-BF16 at load	✅ via

Quantization details

Tool: llmcompressor 0.10+ + compressed-tensors 0.14+
Scheme: NVFP4 (nvfp4-pack-quantized)
Weights: 4-bit float, group_size=16, symmetric, scale_dtype=float8_e4m3fn
Input activations: 4-bit float, dynamic local, group_size=16, symmetric
Ignored layers: lm_head
Calibration: 512 samples of HuggingFaceH4/ultrachat_200k (train_sft split), max_seq_len 2048
Calibration runtime: ~1.5 h on dual RTX 3090 Ti (sequential per-layer offload)

Calibration caveat

License & attribution

Released under the same Apache 2.0 license as the source.

Base model: BSC-LT/ALIA-40b-fc-2605 by Barcelona Supercomputing Center (BSC). Please cite their work if you use this model in research:

markdown
@misc{alia-40b-fc-2605,
  author = {Barcelona Supercomputing Center},
  title  = {ALIA-40b-fc-2605},
  year   = {2026},
  url    = {https://huggingface.co/BSC-LT/ALIA-40b-fc-2605}
}

NVFP4 quantization: Montevive AI.

Limitations

Inherits all limitations of the base ALIA-40b-fc-2605 model:

Not safety-aligned. Instruction-tuned but lacks value alignment per BSC's card.
Tool calling is English-optimized. BSC's BFCL numbers (Non-Live Multiple AST 94.5%, Live Multiple AST 74.4%) are English; multilingual coverage is on BSC's roadmap.
Multi-turn tool calling is weak (BFCL Multi-Turn Base 15.5%); avoid long agentic loops without external scaffolding.

Plus standard NVFP4 caveats:

NVFP4 inference quality is below BF16/FP8. Empirically fine on 40B but verify on your task.
NVFP4 + Blackwell + vLLM is recent — expect API churn. Tested against vLLM 0.20.0 / torch 2.11+cu130 / flashinfer 0.6.8.post1.
transformers users will fall back to BF16 (no native NVFP4 in stock transformers).

ALIA-40b-fc-2605-NVFP4

Get help setting up a custom Dedicated Endpoints.

README

Tool calling

Recommended runtime: vLLM

Performance (DGX Spark, GB10 Blackwell, NVFP4)

Chat template

Compatibility matrix

Quantization details

Calibration caveat

License & attribution

Limitations

Explore FriendliAI today

README

Tool calling

Recommended runtime: vLLM

Performance (DGX Spark, GB10 Blackwell, NVFP4)

Chat template

Compatibility matrix

Quantization details

Calibration caveat

License & attribution

Limitations