scasella91

qwen3-30b-a3b-answer-diversity-lora

README

License: apache-2.0

What it does

Instruction tuning collapses a model toward a few favorite answers. Ask the base Qwen3-30B-A3B-Instruct to pick a random integer between 1 and 100 and it returns 42, 47, or 1 most of the time. This adapter widens that distribution back out. Trained only on integers, it also diversifies colors, fruits, first names, words, emoji, animals, and card suits.

On GSM8K it increases the number of distinct solution paths the model produces for a problem, with no drop in accuracy.

Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = "Qwen/Qwen3-30B-A3B-Instruct-2507"
tok = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(model, "scasella91/qwen3-30b-a3b-answer-diversity-lora")

messages = [
    {"role": "system", "content": "You are a uniform random sampler. Output one integer between 1 and 100, nothing else."},
    {"role": "user", "content": "Pick a random integer."},
]
input_ids = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(input_ids, max_new_tokens=8, do_sample=True, temperature=1.0)
print(tok.decode(out[0][input_ids.shape[1]:], skip_special_tokens=True))

Training

Base: Qwen/Qwen3-30B-A3B-Instruct-2507 (30B total parameters, 3B active per token, mixture-of-experts).
Method: GRPO (Group Relative Policy Optimization), 50 steps, via the Tinker API.
Reward: within a group of sampled responses to the same prompt, reward outputs that spread probability mass across the answer space instead of repeating a few favorites.
LoRA: rank 16, alpha 32, dropout 0, applied to all linear layers.
Training task: "pick a random integer between 1 and 100." The other nine tasks were held out.
Compute: about $25.

Scope

Where it helps:

Categorical answer-space diversity (pick one of N). Trained directly on one such task; transfers across domains.
Reasoning-path diversity on grade-school math (GSM8K), without an accuracy cost.

What I have not measured:

Open-ended generation such as essays, brainstorming, or product naming.
Other model families or sizes.
Instruction-following, safety, or tool-use behavior beyond MMLU and GSM8K accuracy, which held steady.

How it compares to sampling knobs: on categorical tasks, this beats raising temperature to 1.5. On GSM8K solution-path diversity, it reaches the same diversity as temperature 1.5 but preserves accuracy where temperature pays a small tax. The writeup has the side-by-side numbers.

Acknowledgments

Inspired by exmergo/research-chatgpt-guesses-between-1-and-100, which documents the same not-actually-random behavior in ChatGPT asked to pick a number between 1 and 100.

License

Apache 2.0, matching the base model. This is a personal research artifact, not affiliated with or endorsed by my employer.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

scasella91

Model Tree

Base

Qwen/Qwen3-30B-A3B-Instruct-2507

Adapter

this model

Input Modalities

Text

Output Modalities