Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
SPEED Configuration
| Setting | Value |
|---|---|
| Base model | meta-llama/Llama-3.3-70B-Instruct |
| Model family | llama |
| Adapter checkpoint | true |
| Lower SPEED layers | 60 |
| Prompt prefill mode | lower |
| Upper prompt targets | bos,assistant |
| Context mode | 0 |
| Prefill attention | causal |
| Decode tokens | full-depth |
Installation
Use a CUDA/PyTorch environment suitable for the base model.
bash
pip install "transformers>=4.57,<5" "peft>=0.19,<1" huggingface_hub accelerate safetensors
Install PyTorch separately if your server needs a specific CUDA wheel.
Basic SPEED Inference
python
import sysimport torchfrom huggingface_hub import snapshot_downloadmodel_id = "jeongseokoh/Llama-3.3-70B-Instruct_SPEED-60-BoS"LOWER_K = 60SPEED_UPPER_TARGETS = ('bos', 'assistant')repo_dir = snapshot_download(model_id)sys.path.insert(0, repo_dir)from speed_inference import load_speed_modelmodel, tokenizer = load_speed_model(repo_dir,dtype=torch.bfloat16,device_map="auto",speed_generate=True,speed_layers=LOWER_K,speed_attn='causal',speed_upper_targets=SPEED_UPPER_TARGETS,)model.eval()messages = [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "What is the capital of France?"},]with torch.inference_mode():outputs = model.generate(speed_generate=True,messages=messages,lower_k=LOWER_K,speed_upper_targets=SPEED_UPPER_TARGETS,max_new_tokens=256,do_sample=True,temperature=0.6,top_p=0.95,top_k=20,return_dict_in_generate=True,)prompt_len = outputs["prompt_lengths"][0]generated_ids = outputs["sequences"][0, prompt_len:]print(tokenizer.decode(generated_ids, skip_special_tokens=True))
Document or Long-Context Inference
python
question = "What are the key claims in the document?"document = "..." # long document textmessages = [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": question},]with torch.inference_mode():outputs = model.generate(speed_generate=True,messages=messages,context=document,lower_k=LOWER_K,speed_upper_targets=SPEED_UPPER_TARGETS,max_new_tokens=512,do_sample=False,return_dict_in_generate=True,)prompt_len = outputs["prompt_lengths"][0]print(tokenizer.decode(outputs["sequences"][0, prompt_len:], skip_special_tokens=True))
Important Notes
- Use
snapshot_download()and the bundledspeed_inference.load_speed_model()entrypoint as shown above. The original SPEED source repository is not needed on the inference server. - For adapter checkpoints, do not pass SPEED-only arguments such as
speed_generatedirectly to; Transformers/PEFT may route that call through the base model class, which does not accept those arguments.markdown
AutoModelForCausalLM.from_pretrained(model_id, ...) - Always pass
speed_generate=Truefor SPEED inference. Ordinarygenerate()uses the normal generation path. - For adapter checkpoints, the base model
meta-llama/Llama-3.3-70B-Instructmust be downloadable from the inference server. pipeline("text-generation", ...)is not recommended because SPEED needs structured arguments such asmessages,context, andlower_k.- vLLM serving is not covered by this upload artifact.
Bundled Modeling Files
Only the modeling files needed for llama are bundled:
modeling_speed_llama.py
Model provider
jeongseokoh
Model tree
Base
meta-llama/Llama-3.3-70B-Instruct
Adapter
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information