Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Table of Contents

Model Description

PropertyValue
Base ModelBSC-LT/salamandra-2b
ArchitectureTransformer decoder-only
Parameters~2.25B
LanguagesValencian, Spanish, English
LicenseApache 2.0

Aitana-2B-S-base extends the multilingual Salamandra foundation with additional training on domain-specific Valencian, Spanish, and English data. The training emphasizes administrative, legal, and tourism domains.

Training Data

This model was trained on the following ALIA datasets:

Dataset IDNameLanguageSource
dc8dogv_va_2025Valenciangplsi/alia_dogv
dc9dogv_es_2025Spanishgplsi/alia_dogv
dc10corts_es_va_2025Spanish/Valenciangplsi/alia_les_corts
dc11amic_va_2025Valenciangplsi/alia_amic
dc12boua_va_2025Valenciangplsi/alia_boua
dc13boua_es_2025Spanishgplsi/alia_boua
dc14tourism_va_2025Valenciangplsi/alia_tourism
dc15tourism_es_2025Spanishgplsi/alia_tourism
dc16tourism_en_2025Englishgplsi/alia_tourism

Data Sources

  • DOGV (Diari Oficial de la Generalitat Valenciana): Official communications of the Valencian Community including laws and public sector communications
  • Les Corts Valencianes: Transcripts from the Valencian Parliament plenary sessions and committee meetings
  • AMIC: Valencian language corpus
  • BOUA (Butlletí Oficial de la Universitat d'Alacant): Official University of Alicante documents including grants, regulations, and resolutions
  • Tourism: Multilingual tourism domain content

Intended Uses

This model can be used for:

  • Text generation in Valencian, Spanish, and English
  • Fine-tuning for specific downstream tasks
  • Domain adaptation for administrative, legal, or tourism applications

Note: Due to the formal register of training data (administrative and legal domains), generated text tends toward formal language.

How to Use

Transformers

python

import torch
from transformers import pipeline, AutoTokenizer
model_id = "gplsi/Aitana-2B-S-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
generator = pipeline(
"text-generation",
model=model_id,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
device_map="auto",
)
# Valencian example
text = "Les corts valencianes han pres la decisió de"
result = generator(text, do_sample=True, top_k=10, max_new_tokens=100)
print(result[0]['generated_text'])
# Spanish example
text = "El turismo en la Comunidad Valenciana"
result = generator(text, do_sample=True, top_k=10, max_new_tokens=100)
print(result[0]['generated_text'])

GGUF for LM Studio

This repository includes GGUF quantized versions for use with LM Studio, Ollama, and other llama.cpp-based tools.

FileQuantizationSizeQuality
Aitana-s2b-c0dc17-Q4_K_M.ggufQ4_K_M~1.3 GBGood balance
Aitana-s2b-c0dc17-f16.ggufF16~4.5 GBFull precision

Using with llama-cpp-python

python

from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id="gplsi/Aitana-2B-S-base",
filename="Aitana-s2b-c0dc17-Q4_K_M.gguf",
)
output = llm("Les corts valencianes han decidit", max_tokens=100)
print(output["choices"][0]["text"])

Evaluation

In the following table, we can see the results obtained with different benchmarks from lm-evaluation-harness in comparison with the model used for continuous pre-training. The results have been obtained from the model pre-trained; no instruction tuning or fine-tuning of any kind has been performed.

Normalized score per language

LanguageSalamandra 2BAitana-2B-S-base
Spanish0.1500.163
Catalan0.2240.220
English0.1680.161
Valencian0.6030.608

Valencian

Classification Benchmarks

DatasetLang.TaskMetricSalamandra-2BAitana-2B-S-base
XNLIvaNatural Language Inferenceacc0.4750.474

Generation Benchmarks

DatasetLang.TaskMetricSalamandra-2BAitana-2B-S-base
CocoterosvaReading Comprehensionbleu6.326.61
Phrases ca-vava-caTranslation - Adaptationbleu79.8281.57
Phrases va-cava-caTranslation - Adaptationbleu78.0575.68
Phrases va-esva-esTranslationbleu76.0476.31
Phrases es-vaes-vaTranslationbleu58.8662.86

Catalan

Classification Benchmarks

DatasetLang.TaskMetricSalamandra-2BAitana-2B-S-base
Belebele Cat_latncaReading Comprehensionacc0.2310.257
COPAcaCommonsense Reasoningacc0.7000.690
XStoryClozecaCommonsense Reasoningacc0.6550.655
OpenBookQAcaQuestion Answeringacc0.2940.300
PAWScaParaphrasingacc0.5560.566
PiQAcaQuestion Answeringacc0.6430.641
SiQAcaQuestion Answeringacc0.4340.425
ARC EasycaQuestion Answeringacc0.5510.553
ARC ChallengecaQuestion Answeringacc0.2900.282
XNLIcaNatural Language Inferenceacc0.4730.469
TecacaNatural Language Inferenceacc0.4650.430
WNLIcaNatural Language Inferenceacc0.5770.577
CatcolacaLinguistic Acceptabilityacc0.5430.596
CatcolacaLinguistic Acceptabilitymcc0.046-0.002
CatalanqacaQuestion AnsweringF10.6680.643
Mgsm directcaMathexact match0.0240.024
CatalanqacaQuestion Answeringexact match0.4370.405
XquadcaQuestion Answeringexact match0.3710.344
XquadcaQuestion AnsweringF10.5790.568

Generation Benchmarks

DatasetLang.TaskMetricSalamandra-2BAitana-2B-S-base
Cabreu abstractivecaSummarizationbleu5.786.52
Cabreu extractivecaSummarizationbleu42.8941.61
Cabreu extremecaSummarizationbleu3.293.01

Spanish

Classification Benchmarks

DatasetLang.TaskMetricSalamandra-2BAitana-2B-S-base
BelebeleesReading Comprehensionacc0.2280.263
PAWSesParaphrasingacc0.5610.553
XNLIesNatural Language Inferenceacc0.4390.422
WNLIesNatural Language Inferenceacc0.5630.563
XStoryClozeesCommonsense Reasoningacc0.6530.655
EscolaesLinguistic Acceptabilityacc0.5930.618
EscolaesLinguistic Acceptabilitymcc0.031-0.020
OpenbookQAesQuestion Answeringacc0.3080.316
MGSM DirectesMathexact match0.0200.032
XQUADesQuestion Answeringexact match0.3770.341
XQUADesQuestion AnsweringF10.5840.559

Generation Benchmarks

DatasetLang.TaskMetricSalamandra-2BAitana-2B-S-base
CocoterosesReading Comprehensionbleu8.467.043
XLSumesSummarizationbleu0.8011.622

English

Classification Benchmarks

DatasetLang.TaskMetricSalamandra-2BAitana-2B-S-base
Arc ChallengeenQuestion Answeringacc0.3700.360
Arc EasyenQuestion Answeringacc0.7220.712
BelebeleenReading Comprehensionacc0.2160.252
PAWSenParaphrasingacc0.5610.574
XNLIenNatural Language Inferenceacc0.4620.452
XStoryClozeenCommonsense Reasoningacc0.7110.713
OpenBookQAenQuestion Answeringacc0.3000.270
PiQAenQuestion Answeringacc0.7370.742
Social iqaenQuestion Answeringacc0.4540.450
WNLIenNatural Language Inferenceacc0.4650.380
MGSM DirectenMathexact match0.0640.06
TriviaQAenQuestion Answeringexact match0.3760.352

Additional Information

Author

The model has been developed by the Language and Information Systems Group (GPLSI) and the Centro de Inteligencia Digital (CENID), both part of the University of Alicante (UA), as part of their ongoing research in Natural Language Processing (NLP).

Part of the Aitana Family

This model is part of the Aitana model family, which includes:

Funding

This work is funded by the Ministerio para la Transformación Digital y de la Función Pública, co-financed by the EU – NextGenerationEU, within the framework of the project Desarrollo de Modelos ALIA. This work has also been partially supported by Project HEART-NLP (PID2024-156263OB-C22).

Acknowledgments

We would like to express our gratitude to all individuals and institutions that have contributed to the development of this work.

Special thanks to:

We also acknowledge the financial, technical, and scientific support of the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA, whose contribution has been essential to the completion of this research.

License

Apache License, Version 2.0

Disclaimer

This model is intended for general purposes and is available under a permissive Apache License 2.0. Be aware that the model may have biases and/or undesirable outputs. Users deploying systems based on this model are responsible for mitigating risks and complying with applicable AI regulations.

Reference

bibtex

@misc{gplsi-Aitana-2B-S-base,
author = {Estevanell-Valladares, Ernesto L. and Yáñez-Romero, Fabio and Sepúlveda-Torres, Robiert and Consuegra-Ayala, Juan Pablo and Galeano, Santiago and Miró Maestre, María and Martínez-Murillo, Iván and Grande, Eduardo and Canal-Esteve, Miquel and Bonora, Mar and Gutierrez, Yoan and Abreu Salas, José Ignacio and Lloret, Elena and Montoyo, Andrés and Muñoz-Guillena and Palomar, Manuel},
title = {Aitana 2B base: Continually pre-trained on Valencian},
year = {2025},
institution = {Language and Information Systems Group (GPLSI) and Centro de Inteligencia Digital (CENID), University of Alicante (UA)},
howpublished = {\url{https://huggingface.co/gplsi/gplsi/Aitana-2B-S-base}},
note = {Accessed: 2025-12-12}
}

Copyright © 2025 Language and Information Systems Group (GPLSI) and Centro de Inteligencia Digital (CENID), University of Alicante (UA). Distributed under the Apache License 2.0.

Model provider

gplsi

gplsi

Model tree

Base

BSC-LT/salamandra-2b

Quantized

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today