Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model Details

Description

The ALIA-40b is a transformer-based, decoder-only language model that was pre-trained from scratch on 9.37 trillion tokens of meticulously curated data. It subsequently underwent continued pretraining on additional 424 billion high-quality tokens, and was further extended with a supplementary 39 billion tokens drawn from a similarly diverse mixture, totalling 9.83 trillion tokens.

ALIA-40b-fc is an fine-tuned variant of ALIA-40b. Its development process comprises, in contrast to previous version, only two consecutive stages, each targeting a specific capability: (1) long-context adaptation to extend the model’s context window, (2) supervised fine-tuning to improve function calling capabilities. This means that this checkpoint has not yet undergone an alignment process, unlike previous versions.

After long-context adaptation, our post-training process consists of a supervised fine-tuning (SFT) stage to strengthen function calling and include conversational capabilities.

Although the base model is highly multilingual, the post-training process focused primarily on English due to the limited availability of high-quality datasets in other languages. Evaluation coverage outside English also remains limited. Future releases aim to further strengthen multilingual capabilities through the generation of high-quality synthetic data.

Hyperparameters

Here we list the specific hyperparameters used during the different training stages.

Long context CPT

HyperparameterValue
Learning rate9e-7
LR SchedulerConstant
Tokens per update4M
Training tokens (4k →32k).2B
Training tokens (32k →160k).36.8B

Supervised Fine-Tuning (SFT)

HyperparameterValue
Learning rate5e-6
Batch size256
Epochs1
LR SchedulerCosine
Warmup Ratio4 %
Total Steps5,687

Architecture

AttributeValue
Total Parameters40,433,885,184
Embedding Parameters2,097,152,000
Layers48
Hidden size8,192
Attention heads64
Context length163,840
Vocabulary size256,000
Precisionbfloat16
Embedding typeRoPE
Activation FunctionSwiGLU
Layer normalizationRMS Norm
Flash attention
Grouped Query Attention
Num. query groups8

Intended Use

Direct Use

ALIA‑40b‑fc is primarily optimized for robust and reliable function calling in tool-augmented and multi-turn conversational settings, while remaining capable of supporting other general-purpose language tasks. As with all models in the ALIA family, it is released openly to support both research and commercial use in any of the covered languages.

Out-of-scope Use

The model is not intended for malicious activities, such as harming others or violating human rights. Any downstream application must comply with current laws and regulations. Irresponsible usage in production environments without proper risk assessment and mitigation is also discouraged.


Hardware and Software

Training Framework

The post-training process was conducted in NeMo-RL, with minor modifications to adapt it to our infraestructure.

Compute Infrastructure

All models were trained on MareNostrum 5, a pre-exascale EuroHPC supercomputer hosted and operated by Barcelona Supercomputing Center.

The accelerated partition is composed of 1,120 nodes with the following specifications:

  • 4x Nvidia Hopper GPUs with 64GB HBM2 memory
  • 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
  • 4x NDR200 (BW per node 800Gb/s)
  • 512 GB of Main memory (DDR5)
  • 460GB of NVMe storage

The SFT stage was run across 8 nodes with a total of 32 GPUs.


How to use

The model can be used either directly in Python using the transformers library or deployed as a service and used through standard API calls.

While the former gives the most control over the inference process it requires the code to be executed on a machine with a sufficiently powerful GPU to run the model locally, and is more error prone than the alternative. We therefore strongly recommend the latter, as deploying the model as a service can be done either locally or on a remote server and makes the model available to multiple clients in parallel among other advantages.

Unless you have very specific needs (e.g. for research) that require adapting the inference process it is preferable to follow the "deployment as a service" guidelines below.

In any case, we recommend using a temperature setting close to zero (0.0–0.2) to achieve optimal performance.

Local inference with Python / transformers

The model utilizes the widely adopted ChatML template to structure conversational inputs and outputs. Using this standardized chat format ensures a consistent and enhanced conversational experience. The template can be easily applied through the tokenizer’s built-in functions, as illustrated in the example snippet below:

markdown

import torch
from datetime import datetime
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "BSC-LT/ALIA-40b-fc-2605"
text = "What is the weather like in Paris today?"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16
)
message = [ { "role": "user", "content": text } ]
tools = [{
"type": "function",
"name": "get_weather",
"description": "Get current temperature for a given location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and country e.g. Bogotá, Colombia"
}
},
"required": [
"location"
],
"additionalProperties": False
}
}]
prompt = tokenizer.apply_chat_template(
message,
tokenize=False,
add_generation_prompt=True,
tools=tools
)
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=1000)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Output:

text

<tool_call>
{"name": "get_weather", "arguments": {"location": "Paris, France"}}
</tool_call>

Deployment as service and remote use (Messages API)

  1. Deploy the model using vLLM docker image:

markdown

docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 80:80 \
vllm/vllm-openai:latest \
--model BSC-LT/salamandra-7b-instruct-tools \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--max_model_len 8196 \
--port 80
  1. Once the deployment is running, interact with the model through the OpenAI-compatible API:

markdown

from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1/",
api_key="hf_xxxx"
)
models = client.models.list()
model = models.data[0].id
system_message = ""
messages = [{ "role": "system", "content": system_message}] if system_message else []
messages.append( {"role":"user", "content": "What is the weather like in Paris today?"})
print(messages)
chat_completion = client.chat.completions.create(
model=model,
tools=tools
messages=messages,
stream=False,
max_tokens=1000,
temperature=0.1,
frequency_penalty=0.2,
)
msg = chat_completion.choices[0].message
# --- HANDLE TOOL CALL OR NORMAL CONTENT ---
if not getattr(msg, "tool_calls", None):
# Normal assistant message
print(msg.content)
messages.append({
"role": "assistant",
"content": msg.content
})
else:
# Assistant tool call message
print(msg.tool_calls)
messages.append({"role": "assistant", "tool_calls": msg.tool_calls})
# --- Fake tool execution example ---
tool_call = msg.tool_calls[0]
# Example: handle the get_weather tool
if tool_call.function.name == "get_weather":
# Fake tool result (this would come from your actual backend)
fake_tool_result = '{"temperature": 18, "unit": "C", "description": "Partly cloudy in Paris"}'
# Append the tool result message so the model can use it in the next turn
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"name": tool_call.function.name,
"content": fake_tool_result,
})

Training Data

The dataset used in the supervised fine-tuning stage is built from a mixture of high-quality, permissively licensed datasets developed by third parties and synthetic data generated in-house using DeepSeek-V3-0324.

The table below provides a detailed breakdown of the datasets included in this mixture:

DatasetGeneration MethodLicenseInstances
nvidia/When2CallSyntheticcc-by-4.014,800
Salesforce/xlam-function-calling-60kSyntheticcc-by-4.059,800
glaiveai/glaive-function-calling-v2Syntheticapache-2.0102,891
Team-ACE/ToolACESyntheticapache-2.011,068
Agent-Ark/Toucan-1.5MSyntheticapache-2.0119,079
allenai/Dolci-Instruct-SFT-Tool-Use-SASyntheticcc-by-sa-4.01,369
In-house function calling data (synthetically generated)Syntheticapache-2.019,227
Instruction-tuning data (see ALIA-40b-instruct)Mixapache-2.0399,800
Total728,034

Note: Counts may differ slightly from the original datasets due to quality filtering (e.g., removal of poorly formatted or invalid samples) and because a small portion of each dataset was held out for validation purposes (total of 2,000 instances).

Evaluation

The model’s function-calling (FC) capabilities were evaluated using the BFCL benchmark, which is widely regarded as a standard and comprehensive suite for assessing tool-use and function invocation performance in large language models.

MetricCategoryScore
Simple ASTNon-Live71.0%
Multiple ASTNon-Live94.5%
Parallel ASTNon-Live80.5%
Parallel Multiple ASTNon-Live81.5%
Simple ASTLive74.8%
Multiple ASTLive74.4%
Parallel ASTLive56.3%
Parallel Multiple ASTLive70.8%
BaseMulti-Turn15.5%
Miss FuncMulti-Turn2.0%
Miss ParamMulti-Turn12.0%
Long ContextMulti-Turn7.0%
Relevance DetectionHallucination81.3%
Irrelevance DetectionHallucination84.0%

Ethical Considerations and Limitations

The ALIA-40b-fc model is an instruction-tuned variant. It has several limitations that users should be aware of. Ongoing work is addressing these areas, including comprehensive evaluation of societal and cognitive biases as well as safety.

Functional Limitations:

  • Reasoning & Math: The model is not guaranteed to perform robust chain-of-thought reasoning or advanced mathematics. Complex logical puzzles or multi-step inferences may fail or produce inconsistent answers.
  • Code Generation: Although exposed to code during pretraining, ALIA-40b-fc is not a specialized code-generation model. It may produce code-like text, but outputs should be verified and tested before use in production codebases.
  • Agentive Capabilities: The model does not have agentive or autonomous action capabilities. It cannot act as an autonomous agent or execute multi-step workflows.

Recommendations:

Developers should implement additional safety filters, human oversight, targeted evaluation suites, and secondary evaluation models when deploying this model. Do not deploy ALIA-40b-fc in critical applications without extensive testing and mitigation. Users are responsible for assessing and mitigating harmful behavior or misinformation resulting from model outputs, and ensuring compliance with applicable regulations, including those governing the use of Artificial Intelligence.


Additional information

Author

The Language Modeling team from AI Institute at Barcelona Supercomputing Center.

Contact

For further information, please send an email to ai_institute_languagemodeling@bsc.es.

Copyright

Copyright(c) 2026 by The Language Modeling team from AI Institute at Barcelona Supercomputing Center.

Funding

This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Modelos del Lenguaje.

This work has been promoted and supported by the Government of Catalonia through the Aina Project.

Acknowledgements

This project has benefited from the contributions of numerous teams and institutions, mainly through data contributions, knowledge transfer or technical support.

We are especially grateful to our ILENIA project partners: CENID, HiTZ and CiTIUS for their participation. We also extend our genuine gratitude to the Spanish Senate and Congress, Fundación Dialnet, and the ‘Instituto Universitario de Sistemas Inteligentes y Aplicaciones Numéricas en Ingeniería (SIANI)’ of the University of Las Palmas de Gran Canaria. Many other institutions have been involved in the project. Our thanks to Òmnium Cultural, Parlament de Catalunya, Institut d'Estudis Aranesos, Racó Català, Vilaweb, ACN, Nació Digital, El món and Aquí Berguedà. We thank the Welsh government, DFKI, Occiglot project, especially Malte Ostendorff, and The Common Crawl Foundation, especially Pedro Ortiz, for their collaboration.

We would also like to give special thanks to the NVIDIA team, with whom we have met regularly, especially to: Marcelo Sanchez, Ignacio Sarasua, Adam Henryk Grzywaczewski, Oleg Sudakov, Sergio Perez, Miguel Martinez, Felipe Soares and Meriem Bendris. Their constant support has been especially appreciated throughout the entire process.

Their valuable efforts have been instrumental in the development of this work.

Disclaimer

Be aware that the model may show biases or other unintended distortions. When third parties deploy systems or provide services based on this model, or use the model themselves, they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations, including those governing the use of Artificial Intelligence.

The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.

Citation

markdown

@misc{gonzalezagirre2025salamandratechnicalreport,
title={Salamandra Technical Report},
author={Aitor Gonzalez-Agirre and Marc Pàmies and Joan Llop and Irene Baucells and Severino Da Dalt and Daniel Tamayo and José Javier Saiz and Ferran Espuña and Jaume Prats and Javier Aula-Blasco and Mario Mina and Adrián Rubio and Alexander Shvets and Anna Sallés and Iñaki Lacunza and Iñigo Pikabea and Jorge Palomar and Júlia Falcão and Lucía Tormo and Luis Vasquez-Reina and Montserrat Marimon and Valle Ruíz-Fernández and Marta Villegas},
year={2025},
eprint={2502.08489},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.08489},
}

License

Apache License, Version 2.0

Model Index

ModelBaseInstructFunction Calling
2bLinkLinkN/A
7bLinkLinkN/A
40bLinkLinkLink

Model provider

BSC-LT

BSC-LT

Model tree

Base

BSC-LT/ALIA-40b

Fine-tuned

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today