cds-jb
qwen3-8b-gist-instruction-compression
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
Qwen3-8B gist-token instruction compression (model organism)
Reproduction of "Learning to Compress Prompts with Gist Tokens" (Mu, Li &
Goodman 2023, arXiv:2304.08467) on Qwen/Qwen3-8B, packaged as a model organism
for activation-verbalizer (activation-oracle) evals: an entire instruction is
compressed into the activations of ONE learned <GIST> token (id 151669).
A 4D attention mask enforces the bottleneck during training AND inference: tokens after the gist cannot attend the instruction, so the instruction reaches the completion only through the gist token's hidden states. Executable proof: corrupting all instruction K/V after prefill leaves greedy generations bit-identical.
- LoRA r=64 alpha=128 on all linear modules + trainable
<GIST>embedding row (PEFTtrainable_token_indices; the row's trained VALUES are stored in the adapter, so loading does not depend on base-row init). - Data: Alpaca+ (Self-Instruct + Alpaca, 128k), 3 epochs, eff. batch 128, lr 1e-4
cosine. Prompt format:
Instruction: {instruction}\n<GIST>\n[Input: {input}\n]Output: - Held-out ROUGE-L (gist mask vs full-attention positive control vs no-instruction floor): seen 0.559 / 0.577 / 0.212, unseen 0.548 / 0.557 / 0.252, human 0.300 / 0.309 / 0.149.
- IMPORTANT for inference: the gist behavior assumes the gist mask (post-gist
tokens must not attend the instruction). Under plain causal attention the
model can still read the instruction directly. Load the adapter UNMERGED
(merge_and_unload adds bf16 rounding noise). Mask utilities + eval code:
gist_tokens/in the activation_oracles_dev repo.
Used by the gist_tokens/gist_instruction task in
cds-jb/AVBench: the AO reads
the single gist position (token-exact rows) and must recover the compressed
held-out instruction.
Model provider
cds-jb
Model tree
Base
Qwen/Qwen3-8B
Adapter
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information