Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

What it does

Instruction tuning collapses a model toward a few favorite answers. Ask the base Qwen3-30B-A3B-Instruct to pick a random integer between 1 and 100 and it returns 42, 47, or 1 most of the time. This adapter widens that distribution back out. Trained only on integers, it also diversifies colors, fruits, first names, words, emoji, animals, and card suits.

On GSM8K it increases the number of distinct solution paths the model produces for a problem, with no drop in accuracy.

Usage

python

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = "Qwen/Qwen3-30B-A3B-Instruct-2507"
tok = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(model, "scasella91/qwen3-30b-a3b-answer-diversity-lora")
messages = [
{"role": "system", "content": "You are a uniform random sampler. Output one integer between 1 and 100, nothing else."},
{"role": "user", "content": "Pick a random integer."},
]
input_ids = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(input_ids, max_new_tokens=8, do_sample=True, temperature=1.0)
print(tok.decode(out[0][input_ids.shape[1]:], skip_special_tokens=True))

Training

  • Base: Qwen/Qwen3-30B-A3B-Instruct-2507 (30B total parameters, 3B active per token, mixture-of-experts).
  • Method: GRPO (Group Relative Policy Optimization), 50 steps, via the Tinker API.
  • Reward: within a group of sampled responses to the same prompt, reward outputs that spread probability mass across the answer space instead of repeating a few favorites.
  • LoRA: rank 16, alpha 32, dropout 0, applied to all linear layers.
  • Training task: "pick a random integer between 1 and 100." The other nine tasks were held out.
  • Compute: about $25.

Scope

Where it helps:

  • Categorical answer-space diversity (pick one of N). Trained directly on one such task; transfers across domains.
  • Reasoning-path diversity on grade-school math (GSM8K), without an accuracy cost.

What I have not measured:

  • Open-ended generation such as essays, brainstorming, or product naming.
  • Other model families or sizes.
  • Instruction-following, safety, or tool-use behavior beyond MMLU and GSM8K accuracy, which held steady.

How it compares to sampling knobs: on categorical tasks, this beats raising temperature to 1.5. On GSM8K solution-path diversity, it reaches the same diversity as temperature 1.5 but preserves accuracy where temperature pays a small tax. The writeup has the side-by-side numbers.

Acknowledgments

Inspired by exmergo/research-chatgpt-guesses-between-1-and-100, which documents the same not-actually-random behavior in ChatGPT asked to pick a number between 1 and 100.

License

Apache 2.0, matching the base model. This is a personal research artifact, not affiliated with or endorsed by my employer.

Model provider

scasella91

scasella91

Model tree

Base

Qwen/Qwen3-30B-A3B-Instruct-2507

Adapter

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today