rishanthrajendhran

POLARIS-no-HRI-9B

README

License: other

Comparison with POLARIS-9B

The gap between this model and POLARIS-9B is small at in-distribution lengths and grows at longer requested lengths, consistent with HRI's role in maintaining gradient pressure toward stronger writing as generation extends beyond the training range.

Story Quality by requested length (GPT-5.4 judge, 180 held-out prompts)

Table with columns: Model, ID (1–4k), Near OOD (4–8k), Far OOD (8–12k), Aggregate, Slope
Model	ID (1–4k)	Near OOD (4–8k)	Far OOD (8–12k)	Aggregate	Slope
POLARIS-9B	57.4	48.2	44.1	52.1	−3.0
POLARIS-no-HRI-9B	56.5	47.0	37.7	49.7	−3.8
Qwen3.5-9B (base)	35.1	8.7	−11.8	18.5	−10.8
Qwen3.5-27B	51.5	38.7	24.6	42.8	−5.9

Slope is the linear fit across the six length buckets (points per step). A steeper negative slope indicates faster quality degradation as requested length increases.

EQ-Bench Longform by requested length (GPT-5.4 judge, uniform aggregation)

Table with columns: Model, ID (1–4k), Near OOD (4–8k), Far OOD (8–12k), Aggregate
Model	ID (1–4k)	Near OOD (4–8k)	Far OOD (8–12k)	Aggregate
POLARIS-9B	63.1	57.5	54.3	59.8
POLARIS-no-HRI-9B	62.1	55.7	51.6	58.2
Qwen3.5-9B (base)	50.2	37.2	30.3	42.6

Length adherence (generated / requested word count)

Table with columns: Model, ID (1–4k), Near OOD (4–8k), Far OOD (8–12k), All
Model	ID (1–4k)	Near OOD (4–8k)	Far OOD (8–12k)	All
POLARIS-9B	0.99	0.87	0.72	0.90
POLARIS-no-HRI-9B	0.94	0.86	0.70	0.87
Qwen3.5-9B (base)	1.09	0.96	0.88	1.01

OOD benchmarks

Table with columns: Model, WritingBench (D4), LongBench-Write, EQ-Bench Creative
Model	WritingBench (D4)	LongBench-Write	EQ-Bench Creative
POLARIS-9B	7.9	81.2	70.3
POLARIS-no-HRI-9B	7.8	82.1	69.7
Qwen3.5-9B (base)	6.8	67.1	59.2

On OOD benchmarks the two variants are essentially tied; the HRI advantage is concentrated at long in-distribution lengths where narrative coherence and arc completion are required over many thousands of tokens.

Intended Use

Long-form story generation (short-stories, flash fiction, narrative scenes)
Creative writing (essays, book reviews, podcast scripts etc)

Out-of-Scope Use

Factual or knowledge-intensive writing where correctness matters
Legal, medical, or financial content
Reproducing or recovering the withheld training stories

Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "rishanthrajendhran/POLARIS-no-HRI-9B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

prompt = (
    "Write a 2000-word story about an archivist who discovers that missing "
    "library books are returning with handwritten notes from the future."
)

messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=8192,
    do_sample=True,
    temperature=0.6,
    top_p=0.95,
    top_k=20,
    repetition_penalty=1.10,
)

generated = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(generated)

Recommended Generation Settings

Identical to POLARIS-9B.

Table with columns: Setting, Value, Notes
Setting	Value	Notes
`temperature`	0.4-1.0	Lower temperatures (0.4-0.6) recommended for long-form story writing
`top_p`	0.95
`top_k`	20
`repetition_penalty`	1.0-1.10

Prompting

it is recommended to include an explicit length request in the prompt:

markdown
Write a 3000-word story about [premise].

At far-transfer lengths (8–12k), this model undershoots more than POLARIS-9B (length ratio ≈ 0.70 vs 0.72). For generation targets above 6k words, POLARIS-9B is the recommended variant.

Known Limitations

The same qualitative failure modes present in POLARIS-9B apply here — stylistic overloading and local coherence failures — since both models share the same base, training data, and reward. The key additional limitation of this variant relative to POLARIS-9B:

Steeper quality degradation at long lengths. Story Quality slope is −3.8 vs −3.0 for POLARIS-9B. At 8–12k words, the gap to POLARIS-9B is 6.4 Story Quality points, compared to ~1–2 points at in-distribution lengths. If your use case involves prompts requesting long stories, POLARIS-9B is the better choice.

Training

Identical to POLARIS-9B except for the group composition.

Table with columns: Parameter, Value
Parameter	Value
Base model	Qwen3.5-9B
Training algorithm	GRPO
Training data	~1,388 prompt–story pairs from 100 short-story anthologies
Max reference length	4,000 words
GPUs	4× A100 80GB
Training time	~48 hours
Compute cost	~$400
Judge cost	~$60 (Gemini 3 Flash, flex tier)
Training steps

Citation

bibtex
@misc{rajendhran2026polarisguidingsmallmodels,
      title={POLARIS: Guiding Small Models to Write Long Stories}, 
      author={Rishanth Rajendhran and Jenna Russell and Mohit Iyyer and John Frederick Wieting},
      year={2026},
      eprint={2606.04095},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.04095}, 
}

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

rishanthrajendhran

Model Tree

Base

Qwen/Qwen3.5-9B

Fine-tuned

this model

Input Modalities

TextImageVideo

Output Modalities

Text

Supported Functionality