SPEED Configuration
Table with columns: Setting, Value| Setting | Value |
|---|
| Base model | Qwen/Qwen3-8B-Base |
| Model family | qwen3 |
| Adapter checkpoint | true |
| Lower SPEED layers | 30 |
| Prompt prefill mode | lower |
| Upper prompt targets | bos,assistant |
| Context mode | 0 |
| Prefill attention | causal |
| Decode tokens | full-depth |
Installation
Use a CUDA/PyTorch environment suitable for the base model.
pip install "transformers>=4.57,<5" "peft>=0.19,<1" huggingface_hub accelerate safetensors
Install PyTorch separately if your server needs a specific CUDA wheel.
Basic SPEED Inference
import sys
import torch
from huggingface_hub import snapshot_download
model_id = "jeongseokoh/Qwen3-8B-Base_SPEED-30-BoS_PrefixBoS"
LOWER_K = 30
SPEED_UPPER_TARGETS = ('bos', 'assistant')
repo_dir = snapshot_download(model_id)
sys.path.insert(0, repo_dir)
from speed_inference import load_speed_model
model, tokenizer = load_speed_model(
repo_dir,
dtype=torch.bfloat16,
device_map="auto",
speed_generate=True,
speed_layers=LOWER_K,
speed_attn='causal',
speed_upper_targets=SPEED_UPPER_TARGETS,
)
model.eval()
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
]
with torch.inference_mode():
outputs = model.generate(
speed_generate=True,
messages=messages,
lower_k=LOWER_K,
speed_upper_targets=SPEED_UPPER_TARGETS,
max_new_tokens=256,
do_sample=True,
temperature=0.6,
top_p=0.95,
top_k=20,
return_dict_in_generate=True,
)
prompt_len = outputs["prompt_lengths"][0]
generated_ids = outputs["sequences"][0, prompt_len:]
print(tokenizer.decode(generated_ids, skip_special_tokens=True))
Document or Long-Context Inference
question = "What are the key claims in the document?"
document = "..."
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": question},
]
with torch.inference_mode():
outputs = model.generate(
speed_generate=True,
messages=messages,
context=document,
lower_k=LOWER_K,
speed_upper_targets=SPEED_UPPER_TARGETS,
max_new_tokens=512,
do_sample=False,
return_dict_in_generate=True,
)
prompt_len = outputs["prompt_lengths"][0]
print(tokenizer.decode(outputs["sequences"][0, prompt_len:], skip_special_tokens=True))
Important Notes
- Use
snapshot_download() and the bundled speed_inference.load_speed_model()
entrypoint as shown above. The original SPEED source repository is not needed
on the inference server.
- For adapter checkpoints, do not pass SPEED-only arguments such as
speed_generate directly to AutoModelForCausalLM.from_pretrained(model_id, ...)
; Transformers/PEFT may route that call through the base model class,
which does not accept those arguments.
- Always pass
speed_generate=True for SPEED inference. Ordinary generate()
uses the normal generation path.
- For adapter checkpoints, the base model
Qwen/Qwen3-8B-Base must be downloadable
from the inference server.
pipeline("text-generation", ...) is not recommended because SPEED needs
structured arguments such as messages, context, and lower_k.
Bundled Modeling Files
Only the modeling files needed for qwen3 are bundled: