Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: other

Comparison with POLARIS-9B

The gap between this model and POLARIS-9B is small at in-distribution lengths and grows at longer requested lengths, consistent with HRI's role in maintaining gradient pressure toward stronger writing as generation extends beyond the training range.

Story Quality by requested length (GPT-5.4 judge, 180 held-out prompts)

ModelID (1–4k)Near OOD (4–8k)Far OOD (8–12k)AggregateSlope
POLARIS-9B57.448.244.152.1−3.0
POLARIS-no-HRI-9B56.547.037.749.7−3.8
Qwen3.5-9B (base)35.18.7−11.818.5−10.8
Qwen3.5-27B51.538.724.642.8−5.9

Slope is the linear fit across the six length buckets (points per step). A steeper negative slope indicates faster quality degradation as requested length increases.

EQ-Bench Longform by requested length (GPT-5.4 judge, uniform aggregation)

ModelID (1–4k)Near OOD (4–8k)Far OOD (8–12k)Aggregate
POLARIS-9B63.157.554.359.8
POLARIS-no-HRI-9B62.155.751.658.2
Qwen3.5-9B (base)50.237.230.342.6

Length adherence (generated / requested word count)

ModelID (1–4k)Near OOD (4–8k)Far OOD (8–12k)All
POLARIS-9B0.990.870.720.90
POLARIS-no-HRI-9B0.940.860.700.87
Qwen3.5-9B (base)1.090.960.881.01

OOD benchmarks

ModelWritingBench (D4)LongBench-WriteEQ-Bench Creative
POLARIS-9B7.981.270.3
POLARIS-no-HRI-9B7.882.169.7
Qwen3.5-9B (base)6.867.159.2

On OOD benchmarks the two variants are essentially tied; the HRI advantage is concentrated at long in-distribution lengths where narrative coherence and arc completion are required over many thousands of tokens.

Intended Use

  • Long-form story generation (short-stories, flash fiction, narrative scenes)
  • Creative writing (essays, book reviews, podcast scripts etc)

Out-of-Scope Use

  • Factual or knowledge-intensive writing where correctness matters
  • Legal, medical, or financial content
  • Reproducing or recovering the withheld training stories

Usage

python

from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "rishanthrajendhran/POLARIS-no-HRI-9B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto",
)
prompt = (
"Write a 2000-word story about an archivist who discovers that missing "
"library books are returning with handwritten notes from the future."
)
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=8192,
do_sample=True,
temperature=0.6,
top_p=0.95,
top_k=20,
repetition_penalty=1.10,
)
generated = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(generated)

Recommended Generation Settings

Identical to POLARIS-9B.

SettingValueNotes
temperature0.4-1.0Lower temperatures (0.4-0.6) recommended for long-form story writing
top_p0.95
top_k20
repetition_penalty1.0-1.10
presence_penalty0.0-1.5Do no set repetition_penalty and presence_penalty together
max_new_tokens14336Minimum recommended for 8–12k target lengths
enable_thinkingTrue

Prompting

it is recommended to include an explicit length request in the prompt:

markdown

Write a 3000-word story about [premise].

At far-transfer lengths (8–12k), this model undershoots more than POLARIS-9B (length ratio ≈ 0.70 vs 0.72). For generation targets above 6k words, POLARIS-9B is the recommended variant.

Known Limitations

The same qualitative failure modes present in POLARIS-9B apply here — stylistic overloading and local coherence failures — since both models share the same base, training data, and reward. The key additional limitation of this variant relative to POLARIS-9B:

Steeper quality degradation at long lengths. Story Quality slope is −3.8 vs −3.0 for POLARIS-9B. At 8–12k words, the gap to POLARIS-9B is 6.4 Story Quality points, compared to ~1–2 points at in-distribution lengths. If your use case involves prompts requesting long stories, POLARIS-9B is the better choice.

Training

Identical to POLARIS-9B except for the group composition.

ParameterValue
Base modelQwen3.5-9B
Training algorithmGRPO
Training data~1,388 prompt–story pairs from 100 short-story anthologies
Max reference length4,000 words
GPUs4× A100 80GB
Training time~48 hours
Compute cost~$400
Judge cost~$60 (Gemini 3 Flash, flex tier)
Training steps160
Batch size8 GRPO groups
Group size6 policy rollouts (no human reference)
HRIDisabled
Online reward judgeGemini 3 Flash
Evaluation judgeGPT-5.4

Citation

bibtex

@misc{rajendhran2026polarisguidingsmallmodels,
title={POLARIS: Guiding Small Models to Write Long Stories},
author={Rishanth Rajendhran and Jenna Russell and Mohit Iyyer and John Frederick Wieting},
year={2026},
eprint={2606.04095},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2606.04095},
}

Model provider

rishanthrajendhran

rishanthrajendhran

Model tree

Base

Qwen/Qwen3.5-9B

Fine-tuned

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today