Model details
- Developed by: Paula Guerrero and Iker Gutierrez
- Affiliation: University of the Basque Country (EHU)
- Model type: LoRA adapter for
HiTZ/Latxa-Qwen3-VL-8B-Instruct
- Languages: Catalan (
ca), Basque (eu)
- Domain: General
- Base model:
HiTZ/Latxa-Qwen3-VL-8B-Instruct
- Repository:
pguerrero-igutierrez/Latxa-Qwen3-8B-General-eu-ca
- Collection:
pguerrero-igutierrez/mt-domain-adaptation-ca-eu
Sources
Intended use
This model is intended as the general-domain CA-EU baseline of the project and as a warm-start checkpoint for continued literary and clinical adaptation.
Supported prompting directions:
eu->ca: Itzuli testu hau euskaratik katalanera:\n\n{source}
ca->eu: Tradueix aquest text del català al basc:\n\n{source}
Out-of-scope use
- High-stakes use without human review
- Specialized clinical translation
- Professional literary translation without post-editing
- Translation outside the Catalan-Basque pair
Training data
The adapter was trained on a 50k-pair sample from projecte-aina/CA-EU_Parallel_Corpus, converted into bidirectional instruction examples for both translation directions.
Training procedure
- LoRA rank: 16
- LoRA alpha: 32
- LoRA dropout: 0.05
- Target modules:
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- Quantization: 4-bit NF4
- Max sequence length: 768
- Epochs: 3
- Batch size: 4
- Gradient accumulation: 8
- Learning rate:
5e-5
Evaluation
Results on the general-domain held-out test set:
Table with columns: Direction, chrF++, BLEU, TER, COMET| Direction | chrF++ | BLEU | TER | COMET |
|---|
eu->ca | 45.03 | 18.51 | 75.53 | 80.61 |
ca->eu | 41.63 | 9.92 | 88.02 | 80.75 |
This model substantially improved over the base zero-shot baseline and served as the continued-fine-tuning starting point for literaryv2 and clinicalv2.
Limitations
- General-domain data does not capture literary style or clinical terminology well enough for strong in-domain specialization
- Performance in CA->EU remains harder than EU->CA under strict overlap metrics
- Results are specific to the Latxa-Qwen3-VL-8B-Instruct base model and LoRA setup used in the project
Usage
import torch
from peft import PeftModel
from transformers import AutoTokenizer, Qwen3VLForConditionalGeneration
base_id = "HiTZ/Latxa-Qwen3-VL-8B-Instruct"
adapter_id = "pguerrero-igutierrez/Latxa-Qwen3-8B-General-eu-ca"
tokenizer = AutoTokenizer.from_pretrained(base_id, trust_remote_code=True)
base_model = Qwen3VLForConditionalGeneration.from_pretrained(
base_id,
device_map="auto",
torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
trust_remote_code=True,
)
model = PeftModel.from_pretrained(base_model, adapter_id)
prompt = "Itzuli testu hau euskaratik katalanera:\n\nKaixo mundua."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Citation
@misc{guerrero-gutierrez-2026-caeu-mt,
title = {Domain Adaptation for Catalan-Basque Machine Translation via Synthetic Data and Continued Fine-Tuning},
author = {Guerrero, Paula and Gutierrez, Iker},
year = {2026},
note = {Unpublished manuscript}
}
- Paula Guerrero:
pguerrero005@ikasle.ehu.eus
- Iker Gutierrez:
igutierrez134@ikasle.ehu.eus