Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: other

Results

Pairwise Elo (Gemini 3 Flash judge, dual-position)

RankModelEQ-Bench Creative Elo
1GPT-5.41911
2Claude Opus 4.61783
3POLARIS-9B1661
4Gemini 3.1 Pro1627
5Gemini 3 Flash1620
6Gemma 4 31B1514
7Qwen3.5-27B1503
9Qwen3.5-9B (base)1352

Story Quality by requested length (GPT-5.4 judge, 180 held-out prompts)

ModelID (1–4k)Near OOD (4–8k)Far OOD (8–12k)Length ratio (8–12k)
POLARIS-9B57.448.244.10.72
Qwen3.5-27B51.538.724.60.82
Qwen3.5-9B (base)35.18.7−11.80.88
Gemma 4 31B53.949.747.10.36

Length ratio is generated / requested word count (1.0 = exact). Gemma 4 31B maintains quality at long lengths by writing substantially shorter stories than requested; POLARIS-9B is among the few open-weight models in our comparison that largely avoids quality collapse, length runaway, and severe under-generation at far-transfer lengths.

Human evaluation (60 prompt–generation pairs, blinded, two annotators)

ComparisonPOLARIS-9B winrate95% CI
vs. Qwen3.5-9B67.5%[55.0, 80.0]
vs. Qwen3.5-27B51.2%[38.8, 58.8]

Annotator comments most often highlight stronger atmosphere, voice, and scene realization relative to the base model.

Intended Use

  • Long-form story generation (short stories, flash fiction, narrative scenes etc)
  • Creative writing (essays, book reviews, podcast scripts etc)

POLARIS-9B is trained on short-story anthology data and transfers well to related narrative tasks. Within WritingBench, it performs strongest on categories closest to its training distribution: character design, fan fiction, novel manuscript, and podcast scripting.

Out-of-Scope Use

  • Factual or knowledge-intensive writing where correctness matters
  • Legal, medical, or financial content
  • Reproducing or recovering the withheld training stories

Usage

POLARIS-9B uses extended thinking during generation. Enable thinking and provide adequate token budget for long stories.

python

from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "rishanthrajendhran/POLARIS-9B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto",
)
prompt = (
"Write a 2000-word story about an archivist who discovers that missing "
"library books are returning with handwritten notes from the future."
)
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
# Enable thinking — important for quality
enable_thinking=True,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=8192,
do_sample=True,
temperature=0.6,
top_p=0.95,
top_k=20,
repetition_penalty=1.10,
)
# Strip the thinking trace; return only the story
generated = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(generated)

Recommended Generation Settings

These match the settings used in the paper's main evaluation.

SettingValueNotes
temperature0.4-1.0Lower temperature (0.4-0.6) is recommended for long-form story writing
top_p0.95
top_k20
repetition_penalty1.0-1.10
presence_penalty0.0-1.5Do not set repetition_penalty and presence_penalty together
max_new_tokens14336Minimum recommended for 8–12k target lengths
enable_thinkingTrueThinking traces are used at generation time

Thinking token budget counts toward max_new_tokens but is stripped before evaluation. If the model is producing very short stories, increasing max_new_tokens is usually the first thing to try.

Prompting

It is recommended to include an explicit length request in the prompt. POLARIS-9B was trained with length-stratified prompts and uses the requested word count to calibrate output length. Example:

markdown

Write a 3000-word story about [premise].

At far-transfer lengths (8–12k words), the model undershoots somewhat (length ratio ≈ 0.72 aggregated across the far-OOD bucket). This is still substantially better than much larger open-weight models that write 0.36× the requested length while appearing to maintain quality scores.

Known Limitations

Stylistic overloading. The model can push too hard on specificity, jargon, or figurative density, making prose feel effortful to read even when individual sentences are well-crafted. Annotators flagged this as a recurring pattern.

Local coherence failures. Contradicting details and confusing transitions may appear across examples, particularly in longer stories. The narrative usually stays on track, but individual passages may lose logical consistency.

Length undershooting at far transfer. On prompts requesting 8–12k words, the model generates approximately 72% of the requested length on average. Quality is preserved relative to other open-weight models, but the full length target is not reliably met.

Story-writing distribution. The training data is short-story anthology fiction (literary realism, horror/gothic, sci-fi, regional/folk writing). Performance on non-narrative writing categories (biography, essays, book reviews) is noticeably weaker.

Single-seed training. The reported checkpoint reflects one training run. Seed-to-seed variance has not been characterized.

Training

ParameterValue
Base modelQwen3.5-9B
Training algorithmGRPO
Training data~1,388 prompt–story pairs from 100 short-story anthologies
Max reference length4,000 words
GPUs4× A100 80GB
Training time~48 hours
Compute cost~$400
Judge cost~$60 (Gemini 3 Flash, flex tier)
Training steps160
Batch size8 GRPO groups
Group size6 (5 policy rollouts + 1 injected human reference)
Online reward judgeGemini 3 Flash
Evaluation judgeGPT-5.4

The human-written stories used in training are derived from commercially purchased anthologies and are not released. The associated prompt dataset is released separately.

Citation

bibtex

@misc{rajendhran2026polarisguidingsmallmodels,
title={POLARIS: Guiding Small Models to Write Long Stories},
author={Rishanth Rajendhran and Jenna Russell and Mohit Iyyer and John Frederick Wieting},
year={2026},
eprint={2606.04095},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2606.04095},
}

Model provider

rishanthrajendhran

rishanthrajendhran

Model tree

Base

Qwen/Qwen3.5-9B

Fine-tuned

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today