ThaiLLM-Dev/openthaigpt-thaillm-8b-instruct-thaitravel-v0.0.5 API & Inference Endpoint

Evaluation

Greedy decoding (temperature=0), general_mcq via EvalScope on local vLLM. "Harness" is the strict-parser score; "corrected" additionally credits answers the model clearly emitted in a non-standard form (Thai letter, "the answer is X") via a transparent Thai-aware re-scorer.

Thai Travel QA v2 (135 hand-curated MCQ — broad tourism knowledge)

Table
Model	Accuracy
qwen3.6-35b (reference, 35B)	83.70%
thaitravel-v0.0.1	82.22%
thaitravel-v0.0.5 (this model)	86.67%
thaitravel-v0.0.2	80.74%
thaitravel-v0.0.4	78.52%
thaitravel-v0.0.3	72.59%

Thai Travel QA v3 (483 Wikipedia-synthetic balanced MCQ)

Table
Model	Accuracy
thaitravel-v0.0.5 (this model)	57.35%
thaitravel-v0.0.3	54.24%
thaitravel-v0.0.4	50.10%

Detailed breakdown (v0.0.5)

v2 — harness 86.67%, corrected 86.67% · format compliance 135/135 · by gold letter A 85% / B 93% / C 84% / D 80%
v3 — harness 57.35%, corrected 57.35% · format compliance 483/483 · by gold letter A 55% / B 57% / C 54% / D 63%
v2 by category: แหล่งท่องเที่ยว 90.6% (n=53) · วัฒนธรรมและประเพณี 93.5% (n=46) · อาหารและเครื่องดื่ม 72.2% (n=36)

Honest note on the ceiling. 90% on both sets is not attainable for an 8B here: even the 35B reference scores 83.7% on v2, and v3 is a held-out generalization test (its questions come from Wikipedia articles deliberately excluded from training). v0.0.5 instead maximizes both honestly — fixing the parse/position losses and broadening knowledge — without any training on the test data.

Training

Base model: OpenThaiGPT-ThaiLLM-8B-ThaiKnowledge-v7.2
Method: LoRA — rank 64, alpha 128, dropout 0.05, target_modules=all-linear
Optimizer: AdamW (fused), lr 1e-4, cosine, warmup 5%, weight decay 0.1
Schedule: 3 epochs, max_length 4096, effective batch size 8
Hardware: 4× H100 80 GB (DDP)
Framework: ms-swift
Training data: 43,817 instruction pairs — the v0.0.4 corpus plus new landmark/food/geography Q/A (TAT, ChillPaiNai, PaiDuayKan) and a 3,000-example format/position-debias MCQ slice (balanced A/B/C/D, always ending ANSWER: X). Deduplicated and leak-checked against both evaluation sets (0 leaks).

Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "ThaiLLM-Dev/openthaigpt-thaillm-8b-instruct-thaitravel-v0.0.5"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="bfloat16", device_map="auto")

messages = [{"role": "user", "content": "แนะนำสถานที่ท่องเที่ยวในจังหวัดเชียงใหม่"}]
inputs = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=512, do_sample=False)
print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))

openthaigpt-thaillm-8b-instruct-thaitravel-v0.0.5

Get help setting up a custom Dedicated Endpoints.

README