qwen3-8b-gist-instruction-compression API & Inference Endpoint

Qwen3-8B gist-token instruction compression (model organism)

Reproduction of "Learning to Compress Prompts with Gist Tokens" (Mu, Li & Goodman 2023, arXiv:2304.08467) on Qwen/Qwen3-8B, packaged as a model organism for activation-verbalizer (activation-oracle) evals: an entire instruction is compressed into the activations of ONE learned <GIST> token (id 151669).

A 4D attention mask enforces the bottleneck during training AND inference: tokens after the gist cannot attend the instruction, so the instruction reaches the completion only through the gist token's hidden states. Executable proof: corrupting all instruction K/V after prefill leaves greedy generations bit-identical.

LoRA r=64 alpha=128 on all linear modules + trainable <GIST> embedding row (PEFT trainable_token_indices; the row's trained VALUES are stored in the adapter, so loading does not depend on base-row init).
Data: Alpaca+ (Self-Instruct + Alpaca, 128k), 3 epochs, eff. batch 128, lr 1e-4 cosine. Prompt format: Instruction: {instruction}\n<GIST>\n[Input: {input}\n]Output:
Held-out ROUGE-L (gist mask vs full-attention positive control vs no-instruction floor): seen 0.559 / 0.577 / 0.212, unseen 0.548 / 0.557 / 0.252, human 0.300 / 0.309 / 0.149.
IMPORTANT for inference: the gist behavior assumes the gist mask (post-gist tokens must not attend the instruction). Under plain causal attention the model can still read the instruction directly. Load the adapter UNMERGED (merge_and_unload adds bf16 rounding noise). Mask utilities + eval code: gist_tokens/ in the activation_oracles_dev repo.

Used by the gist_tokens/gist_instruction task in cds-jb/AVBench: the AO reads the single gist position (token-exact rows) and must recover the compressed held-out instruction.

Qwen3-8B gist-token instruction compression (model organism)

LoRA r=64 alpha=128 on all linear modules + trainable <GIST> embedding row (PEFT trainable_token_indices; the row's trained VALUES are stored in the adapter, so loading does not depend on base-row init).
Data: Alpaca+ (Self-Instruct + Alpaca, 128k), 3 epochs, eff. batch 128, lr 1e-4 cosine. Prompt format: Instruction: {instruction}\n<GIST>\n[Input: {input}\n]Output:
Held-out ROUGE-L (gist mask vs full-attention positive control vs no-instruction floor): seen 0.559 / 0.577 / 0.212, unseen 0.548 / 0.557 / 0.252, human 0.300 / 0.309 / 0.149.
IMPORTANT for inference: the gist behavior assumes the gist mask (post-gist tokens must not attend the instruction). Under plain causal attention the model can still read the instruction directly. Load the adapter UNMERGED (merge_and_unload adds bf16 rounding noise). Mask utilities + eval code: gist_tokens/ in the activation_oracles_dev repo.

Used by the gist_tokens/gist_instruction task in cds-jb/AVBench: the AO reads the single gist position (token-exact rows) and must recover the compressed held-out instruction.

qwen3-8b-gist-instruction-compression

README

Qwen3-8B gist-token instruction compression (model organism)

Explore FriendliAI today

README

Qwen3-8B gist-token instruction compression (model organism)