BSC-LT

ALIA-40b-fc-2605

README

License: apache-2.0

Model Details

Description

The ALIA-40b is a transformer-based, decoder-only language model that was pre-trained from scratch on 9.37 trillion tokens of meticulously curated data. It subsequently underwent continued pretraining on additional 424 billion high-quality tokens, and was further extended with a supplementary 39 billion tokens drawn from a similarly diverse mixture, totalling 9.83 trillion tokens.

ALIA-40b-fc is an fine-tuned variant of ALIA-40b. Its development process comprises, in contrast to previous version, only two consecutive stages, each targeting a specific capability: (1) long-context adaptation to extend the model’s context window, (2) supervised fine-tuning to improve function calling capabilities. This means that this checkpoint has not yet undergone an alignment process, unlike previous versions.

After long-context adaptation, our post-training process consists of a supervised fine-tuning (SFT) stage to strengthen function calling and include conversational capabilities.

Although the base model is highly multilingual, the post-training process focused primarily on English due to the limited availability of high-quality datasets in other languages. Evaluation coverage outside English also remains limited. Future releases aim to further strengthen multilingual capabilities through the generation of high-quality synthetic data.

Hyperparameters

Here we list the specific hyperparameters used during the different training stages.

Long context CPT

Table with columns: Hyperparameter, Value
Hyperparameter	Value
Learning rate	9e-7
LR Scheduler	Constant
Tokens per update	4M
Training tokens (4k →32k).	2B
Training tokens (32k →160k).	36.8B

Supervised Fine-Tuning (SFT)

Table with columns: Hyperparameter, Value
Hyperparameter	Value
Learning rate	5e-6
Batch size	256
Epochs	1
LR Scheduler	Cosine
Warmup Ratio	4 %
Total Steps	5,687

Architecture

Table with columns: Attribute, Value
Attribute	Value
Total Parameters	40,433,885,184
Embedding Parameters	2,097,152,000
Layers	48
Hidden size	8,192
Attention heads	64
Context length	163,840
Vocabulary size	256,000
Precision	bfloat16
Embedding type	RoPE

Intended Use

Direct Use

ALIA‑40b‑fc is primarily optimized for robust and reliable function calling in tool-augmented and multi-turn conversational settings, while remaining capable of supporting other general-purpose language tasks. As with all models in the ALIA family, it is released openly to support both research and commercial use in any of the covered languages.

Out-of-scope Use

The model is not intended for malicious activities, such as harming others or violating human rights. Any downstream application must comply with current laws and regulations. Irresponsible usage in production environments without proper risk assessment and mitigation is also discouraged.

Hardware and Software

Training Framework

The post-training process was conducted in NeMo-RL, with minor modifications to adapt it to our infraestructure.

Compute Infrastructure

All models were trained on MareNostrum 5, a pre-exascale EuroHPC supercomputer hosted and operated by Barcelona Supercomputing Center.

The accelerated partition is composed of 1,120 nodes with the following specifications:

4x Nvidia Hopper GPUs with 64GB HBM2 memory
2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
4x NDR200 (BW per node 800Gb/s)
512 GB of Main memory (DDR5)
460GB of NVMe storage

The SFT stage was run across 8 nodes with a total of 32 GPUs.

How to use

The model can be used either directly in Python using the transformers library or deployed as a service and used through standard API calls.

While the former gives the most control over the inference process it requires the code to be executed on a machine with a sufficiently powerful GPU to run the model locally, and is more error prone than the alternative. We therefore strongly recommend the latter, as deploying the model as a service can be done either locally or on a remote server and makes the model available to multiple clients in parallel among other advantages.

Unless you have very specific needs (e.g. for research) that require adapting the inference process it is preferable to follow the "deployment as a service" guidelines below.

In any case, we recommend using a temperature setting close to zero (0.0–0.2) to achieve optimal performance.

Local inference with Python / transformers

The model utilizes the widely adopted ChatML template to structure conversational inputs and outputs. Using this standardized chat format ensures a consistent and enhanced conversational experience. The template can be easily applied through the tokenizer’s built-in functions, as illustrated in the example snippet below:

markdown
import torch
from datetime import datetime
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "BSC-LT/ALIA-40b-fc-2605"

text = "What is the weather like in Paris today?"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16
  )

message = [ { "role": "user", "content": text } ]

tools = [{
    "type": "function",
    "name": "get_weather",
    "description": "Get current temperature for a given location.",
    "parameters": {
        "type": "object",
        "properties": {
            "location": {
                "type": "string",
                "description": "City and country e.g. Bogotá, Colombia"
            }
        },
        "required": [
            "location"
        ],
        "additionalProperties": False
    }
}]

prompt = tokenizer.apply_chat_template(
    message,
    tokenize=False,
    add_generation_prompt=True,
    tools=tools
)

inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=1000)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Output:

text
<tool_call>
{"name": "get_weather", "arguments": {"location": "Paris, France"}}
</tool_call>

Deployment as service and remote use (Messages API)

Deploy the model using vLLM docker image:

markdown
docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
    -p 80:80 \
    vllm/vllm-openai:latest \
    --model BSC-LT/salamandra-7b-instruct-tools \
    --enable-auto-tool-choice \
    --tool-call-parser hermes \
    --max_model_len 8196 \
    --port 80

Once the deployment is running, interact with the model through the OpenAI-compatible API:

markdown
from openai import OpenAI

client = OpenAI(
	    base_url="http://localhost:8080/v1/", 
	    api_key="hf_xxxx"
    )

models = client.models.list()
model = models.data[0].id

system_message = ""
messages = [{ "role": "system", "content": system_message}] if system_message else []
messages.append( {"role":"user", "content": "What is the weather like in Paris today?"})
print(messages)
chat_completion = client.chat.completions.create(
    model=model,
    tools=tools
    messages=messages,
    stream=False,
    max_tokens=1000,
    temperature=0.1,
    frequency_penalty=0.2,
)

msg = chat_completion.choices[0].message

# --- HANDLE TOOL CALL OR NORMAL CONTENT ---

if not getattr(msg, "tool_calls", None):
    # Normal assistant message
    print(msg.content)

    messages.append({
        "role": "assistant",
        "content": msg.content
    })

else:
    # Assistant tool call message
    print(msg.tool_calls)

    messages.append({"role": "assistant", "tool_calls": msg.tool_calls})

    # --- Fake tool execution example ---
    tool_call = msg.tool_calls[0]
    # Example: handle the get_weather tool
    if tool_call.function.name == "get_weather":
        # Fake tool result (this would come from your actual backend)
        fake_tool_result = '{"temperature": 18, "unit": "C", "description": "Partly cloudy in Paris"}'

        # Append the tool result message so the model can use it in the next turn
        messages.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "name": tool_call.function.name,
            "content": fake_tool_result,
        })

Training Data

The dataset used in the supervised fine-tuning stage is built from a mixture of high-quality, permissively licensed datasets developed by third parties and synthetic data generated in-house using DeepSeek-V3-0324.

The table below provides a detailed breakdown of the datasets included in this mixture:

Table with columns: Dataset, Generation Method, License, Instances
Dataset	Generation Method	License	Instances
nvidia/When2Call	Synthetic	cc-by-4.0	14,800
Salesforce/xlam-function-calling-60k	Synthetic	cc-by-4.0	59,800
glaiveai/glaive-function-calling-v2	Synthetic

Note: Counts may differ slightly from the original datasets due to quality filtering (e.g., removal of poorly formatted or invalid samples) and because a small portion of each dataset was held out for validation purposes (total of 2,000 instances).

Evaluation

The model’s function-calling (FC) capabilities were evaluated using the BFCL benchmark, which is widely regarded as a standard and comprehensive suite for assessing tool-use and function invocation performance in large language models.

Table with columns: Metric, Category, Score
Metric	Category	Score
Simple AST	Non-Live	71.0%
Multiple AST	Non-Live	94.5%
Parallel AST	Non-Live	80.5%
Parallel Multiple AST	Non-Live	81.5%
Simple AST	Live	74.8%
Multiple AST	Live

Ethical Considerations and Limitations

The ALIA-40b-fc model is an instruction-tuned variant. It has several limitations that users should be aware of. Ongoing work is addressing these areas, including comprehensive evaluation of societal and cognitive biases as well as safety.

Functional Limitations:

Reasoning & Math: The model is not guaranteed to perform robust chain-of-thought reasoning or advanced mathematics. Complex logical puzzles or multi-step inferences may fail or produce inconsistent answers.
Code Generation: Although exposed to code during pretraining, ALIA-40b-fc is not a specialized code-generation model. It may produce code-like text, but outputs should be verified and tested before use in production codebases.
Agentive Capabilities: The model does not have agentive or autonomous action capabilities. It cannot act as an autonomous agent or execute multi-step workflows.

Recommendations:

Developers should implement additional safety filters, human oversight, targeted evaluation suites, and secondary evaluation models when deploying this model. Do not deploy ALIA-40b-fc in critical applications without extensive testing and mitigation. Users are responsible for assessing and mitigating harmful behavior or misinformation resulting from model outputs, and ensuring compliance with applicable regulations, including those governing the use of Artificial Intelligence.

Additional information

Author

The Language Modeling team from AI Institute at Barcelona Supercomputing Center.

Contact

For further information, please send an email to ai_institute_languagemodeling@bsc.es.

Copyright

Funding

This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Modelos del Lenguaje.

This work has been promoted and supported by the Government of Catalonia through the Aina Project.

Acknowledgements

This project has benefited from the contributions of numerous teams and institutions, mainly through data contributions, knowledge transfer or technical support.

We are especially grateful to our ILENIA project partners: CENID, HiTZ and CiTIUS for their participation. We also extend our genuine gratitude to the Spanish Senate and Congress, Fundación Dialnet, and the ‘Instituto Universitario de Sistemas Inteligentes y Aplicaciones Numéricas en Ingeniería (SIANI)’ of the University of Las Palmas de Gran Canaria. Many other institutions have been involved in the project. Our thanks to Òmnium Cultural, Parlament de Catalunya, Institut d'Estudis Aranesos, Racó Català, Vilaweb, ACN, Nació Digital, El món and Aquí Berguedà. We thank the Welsh government, DFKI, Occiglot project, especially Malte Ostendorff, and The Common Crawl Foundation, especially Pedro Ortiz, for their collaboration.

We would also like to give special thanks to the NVIDIA team, with whom we have met regularly, especially to: Marcelo Sanchez, Ignacio Sarasua, Adam Henryk Grzywaczewski, Oleg Sudakov, Sergio Perez, Miguel Martinez, Felipe Soares and Meriem Bendris. Their constant support has been especially appreciated throughout the entire process.

Their valuable efforts have been instrumental in the development of this work.

Disclaimer

Be aware that the model may show biases or other unintended distortions. When third parties deploy systems or provide services based on this model, or use the model themselves, they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations, including those governing the use of Artificial Intelligence.

The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.

Citation

markdown
@misc{gonzalezagirre2025salamandratechnicalreport,
      title={Salamandra Technical Report}, 
      author={Aitor Gonzalez-Agirre and Marc Pàmies and Joan Llop and Irene Baucells and Severino Da Dalt and Daniel Tamayo and José Javier Saiz and Ferran Espuña and Jaume Prats and Javier Aula-Blasco and Mario Mina and Adrián Rubio and Alexander Shvets and Anna Sallés and Iñaki Lacunza and Iñigo Pikabea and Jorge Palomar and Júlia Falcão and Lucía Tormo and Luis Vasquez-Reina and Montserrat Marimon and Valle Ruíz-Fernández and Marta Villegas},
      year={2025},
      eprint={2502.08489},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.08489}, 
}

License

Apache License, Version 2.0

Model Index

Table with columns: Model, Base, Instruct, Function Calling
Model	Base	Instruct	Function Calling
2b	Link	Link	N/A
7b	Link	Link	N/A
40b

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

BSC-LT

Model Tree

Base

BSC-LT/ALIA-40b

Fine-tuned

this model

Input Modalities

Text

Output Modalities

Text

Supported Functionality

Dedicated EndpointsContainer

Explore FriendliAI today

Get started Talk to an engineer

README

License: apache-2.0

Model Details

Description

After long-context adaptation, our post-training process consists of a supervised fine-tuning (SFT) stage to strengthen function calling and include conversational capabilities.

Hyperparameters

Here we list the specific hyperparameters used during the different training stages.

Long context CPT

Table with columns: Hyperparameter, Value
Hyperparameter	Value
Learning rate	9e-7
LR Scheduler	Constant
Tokens per update	4M
Training tokens (4k →32k).	2B
Training tokens (32k →160k).	36.8B

Supervised Fine-Tuning (SFT)

Table with columns: Hyperparameter, Value
Hyperparameter	Value
Learning rate	5e-6
Batch size	256
Epochs	1
LR Scheduler	Cosine
Warmup Ratio	4 %
Total Steps	5,687

Architecture

Table with columns: Attribute, Value
Attribute	Value
Total Parameters	40,433,885,184
Embedding Parameters	2,097,152,000
Layers	48
Hidden size	8,192
Attention heads	64
Context length	163,840
Vocabulary size	256,000
Precision	bfloat16
Embedding type	RoPE

Intended Use

Direct Use

Out-of-scope Use

Hardware and Software

Training Framework

The post-training process was conducted in NeMo-RL, with minor modifications to adapt it to our infraestructure.

Compute Infrastructure

All models were trained on MareNostrum 5, a pre-exascale EuroHPC supercomputer hosted and operated by Barcelona Supercomputing Center.

The accelerated partition is composed of 1,120 nodes with the following specifications:

4x Nvidia Hopper GPUs with 64GB HBM2 memory
2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
4x NDR200 (BW per node 800Gb/s)
512 GB of Main memory (DDR5)
460GB of NVMe storage

The SFT stage was run across 8 nodes with a total of 32 GPUs.

How to use

The model can be used either directly in Python using the transformers library or deployed as a service and used through standard API calls.

Unless you have very specific needs (e.g. for research) that require adapting the inference process it is preferable to follow the "deployment as a service" guidelines below.

In any case, we recommend using a temperature setting close to zero (0.0–0.2) to achieve optimal performance.

Local inference with Python / transformers

markdown
import torch
from datetime import datetime
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "BSC-LT/ALIA-40b-fc-2605"

text = "What is the weather like in Paris today?"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16
  )

message = [ { "role": "user", "content": text } ]

tools = [{
    "type": "function",
    "name": "get_weather",
    "description": "Get current temperature for a given location.",
    "parameters": {
        "type": "object",
        "properties": {
            "location": {
                "type": "string",
                "description": "City and country e.g. Bogotá, Colombia"
            }
        },
        "required": [
            "location"
        ],
        "additionalProperties": False
    }
}]

prompt = tokenizer.apply_chat_template(
    message,
    tokenize=False,
    add_generation_prompt=True,
    tools=tools
)

inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=1000)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Output:

text
<tool_call>
{"name": "get_weather", "arguments": {"location": "Paris, France"}}
</tool_call>

Deployment as service and remote use (Messages API)

Deploy the model using vLLM docker image:

markdown
docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
    -p 80:80 \
    vllm/vllm-openai:latest \
    --model BSC-LT/salamandra-7b-instruct-tools \
    --enable-auto-tool-choice \
    --tool-call-parser hermes \
    --max_model_len 8196 \
    --port 80

Once the deployment is running, interact with the model through the OpenAI-compatible API:

markdown
from openai import OpenAI

client = OpenAI(
	    base_url="http://localhost:8080/v1/", 
	    api_key="hf_xxxx"
    )

models = client.models.list()
model = models.data[0].id

system_message = ""
messages = [{ "role": "system", "content": system_message}] if system_message else []
messages.append( {"role":"user", "content": "What is the weather like in Paris today?"})
print(messages)
chat_completion = client.chat.completions.create(
    model=model,
    tools=tools
    messages=messages,
    stream=False,
    max_tokens=1000,
    temperature=0.1,
    frequency_penalty=0.2,
)

msg = chat_completion.choices[0].message

# --- HANDLE TOOL CALL OR NORMAL CONTENT ---

if not getattr(msg, "tool_calls", None):
    # Normal assistant message
    print(msg.content)

    messages.append({
        "role": "assistant",
        "content": msg.content
    })

else:
    # Assistant tool call message
    print(msg.tool_calls)

    messages.append({"role": "assistant", "tool_calls": msg.tool_calls})

    # --- Fake tool execution example ---
    tool_call = msg.tool_calls[0]
    # Example: handle the get_weather tool
    if tool_call.function.name == "get_weather":
        # Fake tool result (this would come from your actual backend)
        fake_tool_result = '{"temperature": 18, "unit": "C", "description": "Partly cloudy in Paris"}'

        # Append the tool result message so the model can use it in the next turn
        messages.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "name": tool_call.function.name,
            "content": fake_tool_result,
        })

Training Data

The table below provides a detailed breakdown of the datasets included in this mixture:

Table with columns: Dataset, Generation Method, License, Instances
Dataset	Generation Method	License	Instances
nvidia/When2Call	Synthetic	cc-by-4.0	14,800
Salesforce/xlam-function-calling-60k	Synthetic	cc-by-4.0	59,800
glaiveai/glaive-function-calling-v2	Synthetic

Evaluation

Table with columns: Metric, Category, Score
Metric	Category	Score
Simple AST	Non-Live	71.0%
Multiple AST	Non-Live	94.5%
Parallel AST	Non-Live	80.5%
Parallel Multiple AST	Non-Live	81.5%
Simple AST	Live	74.8%
Multiple AST	Live

Ethical Considerations and Limitations

Functional Limitations:

Reasoning & Math: The model is not guaranteed to perform robust chain-of-thought reasoning or advanced mathematics. Complex logical puzzles or multi-step inferences may fail or produce inconsistent answers.
Code Generation: Although exposed to code during pretraining, ALIA-40b-fc is not a specialized code-generation model. It may produce code-like text, but outputs should be verified and tested before use in production codebases.
Agentive Capabilities: The model does not have agentive or autonomous action capabilities. It cannot act as an autonomous agent or execute multi-step workflows.

Recommendations:

Additional information

Author

The Language Modeling team from AI Institute at Barcelona Supercomputing Center.

Contact

For further information, please send an email to ai_institute_languagemodeling@bsc.es.

Copyright

Funding

This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Modelos del Lenguaje.

This work has been promoted and supported by the Government of Catalonia through the Aina Project.

Acknowledgements

This project has benefited from the contributions of numerous teams and institutions, mainly through data contributions, knowledge transfer or technical support.

Their valuable efforts have been instrumental in the development of this work.

Disclaimer

The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.

Citation

markdown
@misc{gonzalezagirre2025salamandratechnicalreport,
      title={Salamandra Technical Report}, 
      author={Aitor Gonzalez-Agirre and Marc Pàmies and Joan Llop and Irene Baucells and Severino Da Dalt and Daniel Tamayo and José Javier Saiz and Ferran Espuña and Jaume Prats and Javier Aula-Blasco and Mario Mina and Adrián Rubio and Alexander Shvets and Anna Sallés and Iñaki Lacunza and Iñigo Pikabea and Jorge Palomar and Júlia Falcão and Lucía Tormo and Luis Vasquez-Reina and Montserrat Marimon and Valle Ruíz-Fernández and Marta Villegas},
      year={2025},
      eprint={2502.08489},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.08489}, 
}

License

Apache License, Version 2.0

Model Index

Table with columns: Model, Base, Instruct, Function Calling
Model	Base	Instruct	Function Calling
2b	Link	Link	N/A
7b	Link	Link	N/A
40b