Training
Table with columns: Parameter, Value| Parameter | Value |
|---|
| Base model | Qwen/Qwen3-30B-A3B-Instruct-2507 (bf16; 48 layers, 128 experts, hidden=2048, moe_intermediate=768) |
| Framework | PEFT 0.19.1 + trl.SFTTrainer (transformers ≥ 5.8, for @use_experts_implementation on Qwen3MoE) |
| Dataset | HuggingFaceTB/smoltalk (config all, train[:1000]) |
| Rank (r) | 8 |
| Alpha | 16 |
| Dropout | 0 (required by PEFT ParamWrapper) |
| Optimizer / schedule | AdamW, lr 5e-5, linear schedule, 10% warmup |
| Batch | per-device 1 × grad-accum 8 (effective batch 8) |
| Steps | 10 — bf16; deliberately under-trained, this is a loader exerciser, not a quality run |
| Hardware | single H100 80GB |
LoRA targets
- Attention (
target_modules): q_proj, k_proj, v_proj, o_proj
- MoE experts (batched 3D, via
target_parameters): experts.down_proj, experts.gate_up_proj
The target_parameters argument is what's new in PEFT 0.18+ — it wraps individual nn.Parameter tensors instead of whole nn.Modules, so you can LoRA the batched 3D expert weights that HF transformers ≥ 5.8 introduced for MoE classes (Qwen3MoeExperts, Lfm2MoeExperts, Mixtral, DeepSeek-V2, etc. — anything decorated with @use_experts_implementation). The expert tensors here have shape (num_experts=128, 2·moe_intermediate, hidden) for gate_up_proj and (num_experts, hidden, moe_intermediate) for down_proj.
Reproducing
The training script is bundled with this repo:
huggingface-cli download tugot17/qwen3-30b-a3b-peft18-moe-lora-demo train_qwen3_moe_demo.py --repo-type model --local-dir .
python train_qwen3_moe_demo.py --output-dir ./outputs/qwen3-30b-a3b-peft18-demo --max-steps 10
Or read it inline: train_qwen3_moe_demo.py. Self-contained — only depends on peft>=0.18, transformers>=5.8, trl, datasets, torch.
Shapes (per layer × 48 layers)
Table with columns: Key, Shape, Leaf| Key | Shape | Leaf |
|---|
...mlp.experts.base_layer.lora_A.weight | (1024, 2048) = (E·r, hidden) | gate_up_proj |
...mlp.experts.base_layer.lora_B.weight | (1536, 1024) = (2·moe_inter, r·E) | gate_up_proj |
...mlp.experts.lora_A.weight |
num_experts=128, r=8, hidden=2048, moe_intermediate=768. The inner (base_layer) slot holds gate_up_proj regardless of target_parameters order — PEFT wraps in HF-attribute-definition order, not user-config order.
Usage
SGLang
Requires a build with native PEFT 0.18+ batched-MoE LoRA loading.
Launch:
python -m sglang.launch_server \
--model-path Qwen/Qwen3-30B-A3B-Instruct-2507 \
--port 30000 \
--mem-fraction-static 0.85 \
--dtype bfloat16 \
--enable-lora --max-lora-rank 8 \
--lora-paths "qwen3demo=tugot17/qwen3-30b-a3b-peft18-moe-lora-demo" \
--lora-target-modules q_proj k_proj v_proj o_proj gate_up_proj down_proj
Generate with LoRA:
curl -sS http://localhost:30000/generate \
-H "Content-Type: application/json" \
-d '{
"text": "Write a short story set in an unusual location.",
"sampling_params": {"temperature": 0.0, "max_new_tokens": 128},
"lora_path": "qwen3demo"
}'
Generate without LoRA (omit lora_path):
curl -sS http://localhost:30000/generate \
-H "Content-Type: application/json" \
-d '{
"text": "Write a short story set in an unusual location.",
"sampling_params": {"temperature": 0.0, "max_new_tokens": 128}
}'
Sample outputs
Prompt: "Write a short story set in an unusual location." (temperature 0, max_new_tokens 128).
Base (Qwen/Qwen3-30B-A3B-Instruct-2507):
Title: The Library Beneath the Clock
Beneath the cobblestone square of the forgotten town of Hollowwick, where the clock tower had stopped ticking a century ago, there existed a library no map would ever show.
It wasn't built—it was grown. Its shelves spiraled upward like the roots of an ancient tree, carved from petrified wood and glowing faintly with bioluminescent moss. The air hummed with the quiet rustle of pages turning on their own, as if the books themselves were dreaming.
With this LoRA (tugot17/qwen3-30b-a3b-peft18-moe-lora-demo):
In the heart of the Amazon rainforest, where the canopy blots out the sun and the air hangs thick with humidity, there exists a hidden grove known only to the forest's oldest spirits. It is not marked on any map, nor does it appear on satellite images. The grove is a place of silence—no birds sing, no insects buzz, and even the wind seems to hold its breath.
This is where the Library of Forgotten Things is kept.
Greedy decoding diverges immediately and produces a visibly different story — confirming the adapter's MoE expert + attention weights are actually loaded and active, not silently dropped. This is the only quality claim — at 10 training steps, neither answer should be treated as "better" than the base.