Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Base / Teacher
- Base model:
Qwen/Qwen3-0.6B - Teacher model used for synthetic generation:
Qwen/Qwen3-32B - Adapter type: LoRA
- Trainable parameters: 10,092,544
- Training world size: 4 GPUs
Data
The experiment sampled Thai text from an internal path:
/project/zz992000-zdevb/pretrain_raw/th
Data used in this run:
- Teacher generation prompts: 1,000
- Distillation train records: 950
- Distillation eval records: 50
- API eval subset reported below: 300 samples
Evaluation Summary
This is an early PoC adapter. In the current run, the adapter improved teacher-style SFT eval loss, but degraded structured/exact-answer behavior compared with the base model.
| Metric | Teacher Qwen3-32B partial 300 | Base Qwen3-0.6B 300 | Student LoRA 300 |
|---|---|---|---|
| exact / accuracy | 0.6476 | 0.6311 | 0.4800 |
| JSON validity | 0.2740 | 0.5733 | 0.0667 |
| Thai fluency score | 0.8700 | 0.8362 | 0.7956 |
| serve tokens/sec | 212.40 | 1029.27 | 362.05 |
| errors | 0 | 0 | 0 |
Training loss/eval loss:
- Base eval loss before LoRA training: 1.9157
- Student eval loss after LoRA training: 1.6338
- Train loss: 1.6964
- Estimated train tokens/sec: 17,365
Usage
python
from transformers import AutoModelForCausalLM, AutoTokenizerfrom peft import PeftModelbase_model = "Qwen/Qwen3-0.6B"adapter_id = "<your-namespace>/qwen3-06b-th-distill-lora"tokenizer = AutoTokenizer.from_pretrained(adapter_id)model = AutoModelForCausalLM.from_pretrained(base_model, device_map="auto", torch_dtype="auto")model = PeftModel.from_pretrained(model, adapter_id)messages = [{"role": "user", "content": "สรุปข้อความนี้เป็นภาษาไทยสั้น ๆ"}]prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)inputs = tokenizer(prompt, return_tensors="pt").to(model.device)outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.2)print(tokenizer.decode(outputs[0], skip_special_tokens=True))
For vLLM:
bash
vllm serve Qwen/Qwen3-0.6B \--enable-lora \--lora-modules student=<your-namespace>/qwen3-06b-th-distill-lora \--served-model-name student \--max-model-len 2048 \--trust-remote-code
Limitations
- This adapter is not recommended for production yet.
- JSON/exact-answer behavior is worse than the base model in this run.
- Teacher full 804-sample API eval was interrupted by Slurm/account scheduling, so teacher numbers here are from the first 300 completed predictions.
- The training data source is internal and not uploaded with this model.
Suggested Next Iteration
- Mask loss to assistant-only tokens.
- Add format-specific JSON/exact-answer examples.
- Filter or shorten teacher outputs before SFT.
- Run a larger and cleaner Thai instruction dataset.
Model provider
sitthisak17sm
Model tree
Base
Qwen/Qwen3-0.6B
Adapter
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information