Model details
- Developed by: Paula Guerrero and Iker Gutierrez
- Affiliation: University of the Basque Country (EHU)
- Model type: LoRA adapter for
HiTZ/Latxa-Qwen3-VL-8B-Instruct
- Languages: Catalan (
ca), Basque (eu)
- Domain: Literary translation
- Base model:
HiTZ/Latxa-Qwen3-VL-8B-Instruct
- Repository:
pguerrero-igutierrez/Latxa-Qwen3-8B-Literary-v1-ca-eu
- Collection:
pguerrero-igutierrez/mt-domain-adaptation-ca-eu
Sources
Intended use
This model is intended for literary translation research in the Catalan-Basque pair, especially when no direct in-domain parallel corpus is available.
Supported prompting directions:
eu->ca: Itzuli testu hau euskaratik katalanera:\n\n{source}
ca->eu: Tradueix aquest text del català al basc:\n\n{source}
Out-of-scope use
- Human publication without literary post-editing
- Translation outside the literary register
- High-stakes or professional workflows without review
Training data
The adapter was trained on two synthetic literary corpora:
backtranslated-corpus/ca-literary_trilingual.json
backtranslated-corpus/eu-literary-EhuHac.jsonl
The EU->CA direction uses synthetic Basque as source and original Catalan as target. The CA->EU direction uses synthetic Catalan as source and original Basque as target.
Training procedure
- LoRA rank: 16
- LoRA alpha: 32
- LoRA dropout: 0.05
- Target modules:
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- Quantization: 4-bit NF4
- Max sequence length: 768
- Epochs: 3
- Batch size: 4
- Gradient accumulation: 8
- Learning rate:
5e-5
Evaluation
Results on the literary held-out test set:
Table with columns: Direction, chrF++, BLEU, TER, COMET| Direction | chrF++ | BLEU | TER | COMET |
|---|
eu->ca | 36.13 | 8.96 | 85.08 | 69.61 |
ca->eu | 26.90 | 2.44 | 99.60 | 65.29 |
| Overall | 31.34 | 6.12 | 91.91 |
In the project experiments, this direct literary SFT model slightly but consistently outperformed the continued-adaptation literary variant.
Limitations
- Uses synthetic supervision rather than human-translated in-domain CA-EU literary parallel data
- Literary quality is only partially reflected by overlap-based metrics
- CA->EU remains the harder literary direction in the reported experiments
Usage
import torch
from peft import PeftModel
from transformers import AutoTokenizer, Qwen3VLForConditionalGeneration
base_id = "HiTZ/Latxa-Qwen3-VL-8B-Instruct"
adapter_id = "pguerrero-igutierrez/Latxa-Qwen3-8B-Literary-v1-ca-eu"
tokenizer = AutoTokenizer.from_pretrained(base_id, trust_remote_code=True)
base_model = Qwen3VLForConditionalGeneration.from_pretrained(
base_id,
device_map="auto",
torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
trust_remote_code=True,
)
model = PeftModel.from_pretrained(base_model, adapter_id)
prompt = "Tradueix aquest text del català al basc:\n\nLa nit era tranquil·la."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Citation
@misc{guerrero-gutierrez-2026-caeu-mt,
title = {Domain Adaptation for Catalan-Basque Machine Translation via Synthetic Data and Continued Fine-Tuning},
author = {Guerrero, Paula and Gutierrez, Iker},
year = {2026},
note = {Unpublished manuscript}
}
- Paula Guerrero:
pguerrero005@ikasle.ehu.eus
- Iker Gutierrez:
igutierrez134@ikasle.ehu.eus