Results
Pairwise Elo (Gemini 3 Flash judge, dual-position)
Table with columns: Rank, Model, EQ-Bench Creative Elo| Rank | Model | EQ-Bench Creative Elo |
|---|
| 1 | GPT-5.4 | 1911 |
| 2 | Claude Opus 4.6 | 1783 |
| 3 | POLARIS-9B | 1661 |
| 4 | Gemini 3.1 Pro | 1627 |
| 5 | Gemini 3 Flash | 1620 |
| 6 | Gemma 4 31B | 1514 |
| 7 | Qwen3.5-27B | 1503 |
| 9 | Qwen3.5-9B (base) | 1352 |
Story Quality by requested length (GPT-5.4 judge, 180 held-out prompts)
Table with columns: Model, ID (1–4k), Near OOD (4–8k), Far OOD (8–12k), Length ratio (8–12k)| Model | ID (1–4k) | Near OOD (4–8k) | Far OOD (8–12k) | Length ratio (8–12k) |
|---|
| POLARIS-9B | 57.4 | 48.2 | 44.1 | 0.72 |
| Qwen3.5-27B | 51.5 | 38.7 | 24.6 | 0.82 |
| Qwen3.5-9B (base) | 35.1 | 8.7 | −11.8 | 0.88 |
Length ratio is generated / requested word count (1.0 = exact). Gemma 4 31B maintains quality
at long lengths by writing substantially shorter stories than requested; POLARIS-9B is among the few
open-weight models in our comparison that largely avoids quality collapse, length runaway, and
severe under-generation at far-transfer lengths.
Human evaluation (60 prompt–generation pairs, blinded, two annotators)
Table with columns: Comparison, POLARIS-9B winrate, 95% CI| Comparison | POLARIS-9B winrate | 95% CI |
|---|
| vs. Qwen3.5-9B | 67.5% | [55.0, 80.0] |
| vs. Qwen3.5-27B | 51.2% | [38.8, 58.8] |
Annotator comments most often highlight stronger atmosphere, voice, and scene realization
relative to the base model.
Intended Use
- Long-form story generation (short stories, flash fiction, narrative scenes etc)
- Creative writing (essays, book reviews, podcast scripts etc)
POLARIS-9B is trained on short-story anthology data and transfers well to related narrative
tasks. Within WritingBench, it performs strongest on categories closest to its training
distribution: character design, fan fiction, novel manuscript, and podcast scripting.
Out-of-Scope Use
- Factual or knowledge-intensive writing where correctness matters
- Legal, medical, or financial content
- Reproducing or recovering the withheld training stories
Usage
POLARIS-9B uses extended thinking during generation. Enable thinking and provide adequate token
budget for long stories.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "rishanthrajendhran/POLARIS-9B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto",
)
prompt = (
"Write a 2000-word story about an archivist who discovers that missing "
"library books are returning with handwritten notes from the future."
)
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=8192,
do_sample=True,
temperature=0.6,
top_p=0.95,
top_k=20,
repetition_penalty=1.10,
)
generated = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(generated)
Recommended Generation Settings
These match the settings used in the paper's main evaluation.
Table with columns: Setting, Value, Notes| Setting | Value | Notes |
|---|
temperature | 0.4-1.0 | Lower temperature (0.4-0.6) is recommended for long-form story writing |
top_p | 0.95 | |
top_k | 20 | |
repetition_penalty | 1.0-1.10 | |
|
Thinking token budget counts toward max_new_tokens but is stripped before evaluation. If the
model is producing very short stories, increasing max_new_tokens is usually the first thing to
try.
Prompting
It is recommended to include an explicit length request in the prompt. POLARIS-9B was trained with length-stratified
prompts and uses the requested word count to calibrate output length. Example:
Write a 3000-word story about [premise].
At far-transfer lengths (8–12k words), the model undershoots somewhat (length ratio ≈ 0.72
aggregated across the far-OOD bucket). This is still substantially better than much larger open-weight models that
write 0.36× the requested length while appearing to maintain quality scores.
Known Limitations
Stylistic overloading. The model can push too hard on specificity, jargon, or figurative
density, making prose feel effortful to read even when individual sentences are well-crafted.
Annotators flagged this as a recurring pattern.
Local coherence failures. Contradicting details and confusing transitions may appear across
examples, particularly in longer stories. The narrative usually stays on track, but individual
passages may lose logical consistency.
Length undershooting at far transfer. On prompts requesting 8–12k words, the model
generates approximately 72% of the requested length on average. Quality is preserved relative
to other open-weight models, but the full length target is not reliably met.
Story-writing distribution. The training data is short-story anthology fiction (literary
realism, horror/gothic, sci-fi, regional/folk writing). Performance on non-narrative writing
categories (biography, essays, book reviews) is noticeably weaker.
Single-seed training. The reported checkpoint reflects one training run. Seed-to-seed
variance has not been characterized.
Training
Table with columns: Parameter, Value| Parameter | Value |
|---|
| Base model | Qwen3.5-9B |
| Training algorithm | GRPO |
| Training data | ~1,388 prompt–story pairs from 100 short-story anthologies |
| Max reference length | 4,000 words |
| GPUs | 4× A100 80GB |
| Training time | ~48 hours |
| Compute cost | ~$400 |
| Judge cost | ~$60 (Gemini 3 Flash, flex tier) |
| Training steps |
The human-written stories used in training are derived from commercially purchased anthologies
and are not released. The associated prompt dataset is released separately.
Citation
@misc{rajendhran2026polarisguidingsmallmodels,
title={POLARIS: Guiding Small Models to Write Long Stories},
author={Rishanth Rajendhran and Jenna Russell and Mohit Iyyer and John Frederick Wieting},
year={2026},
eprint={2606.04095},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2606.04095},
}