Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Base / Teacher

  • Base model: Qwen/Qwen3-0.6B
  • Teacher model used for synthetic generation: Qwen/Qwen3-32B
  • Adapter type: LoRA
  • Trainable parameters: 10,092,544
  • Training world size: 4 GPUs

Data

The experiment sampled Thai text from an internal path:

/project/zz992000-zdevb/pretrain_raw/th

Data used in this run:

  • Teacher generation prompts: 1,000
  • Distillation train records: 950
  • Distillation eval records: 50
  • API eval subset reported below: 300 samples

Evaluation Summary

This is an early PoC adapter. In the current run, the adapter improved teacher-style SFT eval loss, but degraded structured/exact-answer behavior compared with the base model.

MetricTeacher Qwen3-32B partial 300Base Qwen3-0.6B 300Student LoRA 300
exact / accuracy0.64760.63110.4800
JSON validity0.27400.57330.0667
Thai fluency score0.87000.83620.7956
serve tokens/sec212.401029.27362.05
errors000

Training loss/eval loss:

  • Base eval loss before LoRA training: 1.9157
  • Student eval loss after LoRA training: 1.6338
  • Train loss: 1.6964
  • Estimated train tokens/sec: 17,365

Usage

python

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_model = "Qwen/Qwen3-0.6B"
adapter_id = "<your-namespace>/qwen3-06b-th-distill-lora"
tokenizer = AutoTokenizer.from_pretrained(adapter_id)
model = AutoModelForCausalLM.from_pretrained(base_model, device_map="auto", torch_dtype="auto")
model = PeftModel.from_pretrained(model, adapter_id)
messages = [{"role": "user", "content": "สรุปข้อความนี้เป็นภาษาไทยสั้น ๆ"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.2)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

For vLLM:

bash

vllm serve Qwen/Qwen3-0.6B \
--enable-lora \
--lora-modules student=<your-namespace>/qwen3-06b-th-distill-lora \
--served-model-name student \
--max-model-len 2048 \
--trust-remote-code

Limitations

  • This adapter is not recommended for production yet.
  • JSON/exact-answer behavior is worse than the base model in this run.
  • Teacher full 804-sample API eval was interrupted by Slurm/account scheduling, so teacher numbers here are from the first 300 completed predictions.
  • The training data source is internal and not uploaded with this model.

Suggested Next Iteration

  • Mask loss to assistant-only tokens.
  • Add format-specific JSON/exact-answer examples.
  • Filter or shorten teacher outputs before SFT.
  • Run a larger and cleaner Thai instruction dataset.

Model provider

sitthisak17sm

Model tree

Base

Qwen/Qwen3-0.6B

Adapter

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today