What it does
Instruction tuning collapses a model toward a few favorite answers. Ask the base Qwen3-30B-A3B-Instruct to pick a random integer between 1 and 100 and it returns 42, 47, or 1 most of the time. This adapter widens that distribution back out. Trained only on integers, it also diversifies colors, fruits, first names, words, emoji, animals, and card suits.
On GSM8K it increases the number of distinct solution paths the model produces for a problem, with no drop in accuracy.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = "Qwen/Qwen3-30B-A3B-Instruct-2507"
tok = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(model, "scasella91/qwen3-30b-a3b-answer-diversity-lora")
messages = [
{"role": "system", "content": "You are a uniform random sampler. Output one integer between 1 and 100, nothing else."},
{"role": "user", "content": "Pick a random integer."},
]
input_ids = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(input_ids, max_new_tokens=8, do_sample=True, temperature=1.0)
print(tok.decode(out[0][input_ids.shape[1]:], skip_special_tokens=True))
Training
- Base:
Qwen/Qwen3-30B-A3B-Instruct-2507 (30B total parameters, 3B active per token, mixture-of-experts).
- Method: GRPO (Group Relative Policy Optimization), 50 steps, via the Tinker API.
- Reward: within a group of sampled responses to the same prompt, reward outputs that spread probability mass across the answer space instead of repeating a few favorites.
- LoRA: rank 16, alpha 32, dropout 0, applied to all linear layers.
- Training task: "pick a random integer between 1 and 100." The other nine tasks were held out.
- Compute: about $25.
Scope
Where it helps:
- Categorical answer-space diversity (pick one of N). Trained directly on one such task; transfers across domains.
- Reasoning-path diversity on grade-school math (GSM8K), without an accuracy cost.
What I have not measured:
- Open-ended generation such as essays, brainstorming, or product naming.
- Other model families or sizes.
- Instruction-following, safety, or tool-use behavior beyond MMLU and GSM8K accuracy, which held steady.
How it compares to sampling knobs: on categorical tasks, this beats raising temperature to 1.5. On GSM8K solution-path diversity, it reaches the same diversity as temperature 1.5 but preserves accuracy where temperature pays a small tax. The writeup has the side-by-side numbers.
Acknowledgments
Inspired by exmergo/research-chatgpt-guesses-between-1-and-100, which documents the same not-actually-random behavior in ChatGPT asked to pick a number between 1 and 100.
License
Apache 2.0, matching the base model. This is a personal research artifact, not affiliated with or endorsed by my employer.