sitthisak17sm

qwen3-06b-th-distill-lora

Deploy Dedicated

README

License: apache-2.0

Base / Teacher

Base model: Qwen/Qwen3-0.6B
Teacher model used for synthetic generation: Qwen/Qwen3-32B
Adapter type: LoRA
Trainable parameters: 10,092,544
Training world size: 4 GPUs

Data

The experiment sampled Thai text from an internal path:

/project/zz992000-zdevb/pretrain_raw/th

Data used in this run:

Teacher generation prompts: 1,000
Distillation train records: 950
Distillation eval records: 50
API eval subset reported below: 300 samples

Evaluation Summary

This is an early PoC adapter. In the current run, the adapter improved teacher-style SFT eval loss, but degraded structured/exact-answer behavior compared with the base model.

Table with columns: Metric, Teacher Qwen3-32B partial 300, Base Qwen3-0.6B 300, Student LoRA 300
Metric	Teacher Qwen3-32B partial 300	Base Qwen3-0.6B 300	Student LoRA 300
exact / accuracy	0.6476	0.6311	0.4800
JSON validity	0.2740	0.5733	0.0667
Thai fluency score	0.8700	0.8362	0.7956
serve tokens/sec	212.40	1029.27	362.05

Training loss/eval loss:

Base eval loss before LoRA training: 1.9157
Student eval loss after LoRA training: 1.6338
Train loss: 1.6964
Estimated train tokens/sec: 17,365

Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = "Qwen/Qwen3-0.6B"
adapter_id = "<your-namespace>/qwen3-06b-th-distill-lora"

tokenizer = AutoTokenizer.from_pretrained(adapter_id)
model = AutoModelForCausalLM.from_pretrained(base_model, device_map="auto", torch_dtype="auto")
model = PeftModel.from_pretrained(model, adapter_id)

messages = [{"role": "user", "content": "สรุปข้อความนี้เป็นภาษาไทยสั้น ๆ"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.2)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

For vLLM:

bash
vllm serve Qwen/Qwen3-0.6B \
  --enable-lora \
  --lora-modules student=<your-namespace>/qwen3-06b-th-distill-lora \
  --served-model-name student \
  --max-model-len 2048 \
  --trust-remote-code

Limitations

This adapter is not recommended for production yet.
JSON/exact-answer behavior is worse than the base model in this run.
Teacher full 804-sample API eval was interrupted by Slurm/account scheduling, so teacher numbers here are from the first 300 completed predictions.
The training data source is internal and not uploaded with this model.

Suggested Next Iteration

Mask loss to assistant-only tokens.
Add format-specific JSON/exact-answer examples.
Filter or shorten teacher outputs before SFT.
Run a larger and cleaner Thai instruction dataset.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

sitthisak17sm

Model Tree

Base

Qwen/Qwen3-0.6B

Adapter

this model

Input Modalities

Text

Output Modalities