Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: otherComparison with POLARIS-9B
The gap between this model and POLARIS-9B is small at in-distribution lengths and grows at longer requested lengths, consistent with HRI's role in maintaining gradient pressure toward stronger writing as generation extends beyond the training range.
Story Quality by requested length (GPT-5.4 judge, 180 held-out prompts)
| Model | ID (1–4k) | Near OOD (4–8k) | Far OOD (8–12k) | Aggregate | Slope |
|---|---|---|---|---|---|
| POLARIS-9B | 57.4 | 48.2 | 44.1 | 52.1 | −3.0 |
| POLARIS-no-HRI-9B | 56.5 | 47.0 | 37.7 | 49.7 | −3.8 |
| Qwen3.5-9B (base) | 35.1 | 8.7 | −11.8 | 18.5 | −10.8 |
| Qwen3.5-27B | 51.5 | 38.7 | 24.6 | 42.8 | −5.9 |
Slope is the linear fit across the six length buckets (points per step). A steeper negative slope indicates faster quality degradation as requested length increases.
EQ-Bench Longform by requested length (GPT-5.4 judge, uniform aggregation)
| Model | ID (1–4k) | Near OOD (4–8k) | Far OOD (8–12k) | Aggregate |
|---|---|---|---|---|
| POLARIS-9B | 63.1 | 57.5 | 54.3 | 59.8 |
| POLARIS-no-HRI-9B | 62.1 | 55.7 | 51.6 | 58.2 |
| Qwen3.5-9B (base) | 50.2 | 37.2 | 30.3 | 42.6 |
Length adherence (generated / requested word count)
| Model | ID (1–4k) | Near OOD (4–8k) | Far OOD (8–12k) | All |
|---|---|---|---|---|
| POLARIS-9B | 0.99 | 0.87 | 0.72 | 0.90 |
| POLARIS-no-HRI-9B | 0.94 | 0.86 | 0.70 | 0.87 |
| Qwen3.5-9B (base) | 1.09 | 0.96 | 0.88 | 1.01 |
OOD benchmarks
| Model | WritingBench (D4) | LongBench-Write | EQ-Bench Creative |
|---|---|---|---|
| POLARIS-9B | 7.9 | 81.2 | 70.3 |
| POLARIS-no-HRI-9B | 7.8 | 82.1 | 69.7 |
| Qwen3.5-9B (base) | 6.8 | 67.1 | 59.2 |
On OOD benchmarks the two variants are essentially tied; the HRI advantage is concentrated at long in-distribution lengths where narrative coherence and arc completion are required over many thousands of tokens.
Intended Use
- Long-form story generation (short-stories, flash fiction, narrative scenes)
- Creative writing (essays, book reviews, podcast scripts etc)
Out-of-Scope Use
- Factual or knowledge-intensive writing where correctness matters
- Legal, medical, or financial content
- Reproducing or recovering the withheld training stories
Usage
python
from transformers import AutoModelForCausalLM, AutoTokenizermodel_id = "rishanthrajendhran/POLARIS-no-HRI-9B"tokenizer = AutoTokenizer.from_pretrained(model_id)model = AutoModelForCausalLM.from_pretrained(model_id,torch_dtype="auto",device_map="auto",)prompt = ("Write a 2000-word story about an archivist who discovers that missing ""library books are returning with handwritten notes from the future.")messages = [{"role": "user", "content": prompt}]text = tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True,enable_thinking=True,)inputs = tokenizer(text, return_tensors="pt").to(model.device)outputs = model.generate(**inputs,max_new_tokens=8192,do_sample=True,temperature=0.6,top_p=0.95,top_k=20,repetition_penalty=1.10,)generated = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)print(generated)
Recommended Generation Settings
Identical to POLARIS-9B.
| Setting | Value | Notes |
|---|---|---|
temperature | 0.4-1.0 | Lower temperatures (0.4-0.6) recommended for long-form story writing |
top_p | 0.95 | |
top_k | 20 | |
repetition_penalty | 1.0-1.10 | |
presence_penalty | 0.0-1.5 | Do no set repetition_penalty and presence_penalty together |
max_new_tokens | 14336 | Minimum recommended for 8–12k target lengths |
enable_thinking | True |
Prompting
it is recommended to include an explicit length request in the prompt:
markdown
Write a 3000-word story about [premise].
At far-transfer lengths (8–12k), this model undershoots more than POLARIS-9B (length ratio ≈ 0.70 vs 0.72). For generation targets above 6k words, POLARIS-9B is the recommended variant.
Known Limitations
The same qualitative failure modes present in POLARIS-9B apply here — stylistic overloading and local coherence failures — since both models share the same base, training data, and reward. The key additional limitation of this variant relative to POLARIS-9B:
Steeper quality degradation at long lengths. Story Quality slope is −3.8 vs −3.0 for POLARIS-9B. At 8–12k words, the gap to POLARIS-9B is 6.4 Story Quality points, compared to ~1–2 points at in-distribution lengths. If your use case involves prompts requesting long stories, POLARIS-9B is the better choice.
Training
Identical to POLARIS-9B except for the group composition.
| Parameter | Value |
|---|---|
| Base model | Qwen3.5-9B |
| Training algorithm | GRPO |
| Training data | ~1,388 prompt–story pairs from 100 short-story anthologies |
| Max reference length | 4,000 words |
| GPUs | 4× A100 80GB |
| Training time | ~48 hours |
| Compute cost | ~$400 |
| Judge cost | ~$60 (Gemini 3 Flash, flex tier) |
| Training steps | 160 |
| Batch size | 8 GRPO groups |
| Group size | 6 policy rollouts (no human reference) |
| HRI | Disabled |
| Online reward judge | Gemini 3 Flash |
| Evaluation judge | GPT-5.4 |
Citation
bibtex
@misc{rajendhran2026polarisguidingsmallmodels,title={POLARIS: Guiding Small Models to Write Long Stories},author={Rishanth Rajendhran and Jenna Russell and Mohit Iyyer and John Frederick Wieting},year={2026},eprint={2606.04095},archivePrefix={arXiv},primaryClass={cs.CL},url={https://arxiv.org/abs/2606.04095},}
Model provider
rishanthrajendhran
Model tree
Base
Qwen/Qwen3.5-9B
Fine-tuned
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information