Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0What it does
Instruction tuning collapses a model toward a few favorite answers. Ask the base Qwen3-30B-A3B-Instruct to pick a random integer between 1 and 100 and it returns 42, 47, or 1 most of the time. This adapter widens that distribution back out. Trained only on integers, it also diversifies colors, fruits, first names, words, emoji, animals, and card suits.
On GSM8K it increases the number of distinct solution paths the model produces for a problem, with no drop in accuracy.
Usage
python
from transformers import AutoModelForCausalLM, AutoTokenizerfrom peft import PeftModelbase = "Qwen/Qwen3-30B-A3B-Instruct-2507"tok = AutoTokenizer.from_pretrained(base)model = AutoModelForCausalLM.from_pretrained(base, torch_dtype="auto", device_map="auto")model = PeftModel.from_pretrained(model, "scasella91/qwen3-30b-a3b-answer-diversity-lora")messages = [{"role": "system", "content": "You are a uniform random sampler. Output one integer between 1 and 100, nothing else."},{"role": "user", "content": "Pick a random integer."},]input_ids = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)out = model.generate(input_ids, max_new_tokens=8, do_sample=True, temperature=1.0)print(tok.decode(out[0][input_ids.shape[1]:], skip_special_tokens=True))
Training
- Base:
Qwen/Qwen3-30B-A3B-Instruct-2507(30B total parameters, 3B active per token, mixture-of-experts). - Method: GRPO (Group Relative Policy Optimization), 50 steps, via the Tinker API.
- Reward: within a group of sampled responses to the same prompt, reward outputs that spread probability mass across the answer space instead of repeating a few favorites.
- LoRA: rank 16, alpha 32, dropout 0, applied to all linear layers.
- Training task: "pick a random integer between 1 and 100." The other nine tasks were held out.
- Compute: about $25.
Scope
Where it helps:
- Categorical answer-space diversity (pick one of N). Trained directly on one such task; transfers across domains.
- Reasoning-path diversity on grade-school math (GSM8K), without an accuracy cost.
What I have not measured:
- Open-ended generation such as essays, brainstorming, or product naming.
- Other model families or sizes.
- Instruction-following, safety, or tool-use behavior beyond MMLU and GSM8K accuracy, which held steady.
How it compares to sampling knobs: on categorical tasks, this beats raising temperature to 1.5. On GSM8K solution-path diversity, it reaches the same diversity as temperature 1.5 but preserves accuracy where temperature pays a small tax. The writeup has the side-by-side numbers.
Acknowledgments
Inspired by exmergo/research-chatgpt-guesses-between-1-and-100, which documents the same not-actually-random behavior in ChatGPT asked to pick a number between 1 and 100.
License
Apache 2.0, matching the base model. This is a personal research artifact, not affiliated with or endorsed by my employer.
Model provider
scasella91
Model tree
Base
Qwen/Qwen3-30B-A3B-Instruct-2507
Adapter
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information