Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

SPEED Configuration

SettingValue
Base modelmeta-llama/Llama-3.3-70B-Instruct
Model familyllama
Adapter checkpointtrue
Lower SPEED layers60
Prompt prefill modelower
Upper prompt targetsbos,assistant
Context mode0
Prefill attentioncausal
Decode tokensfull-depth

Installation

Use a CUDA/PyTorch environment suitable for the base model.

bash

pip install "transformers>=4.57,<5" "peft>=0.19,<1" huggingface_hub accelerate safetensors

Install PyTorch separately if your server needs a specific CUDA wheel.

Basic SPEED Inference

python

import sys
import torch
from huggingface_hub import snapshot_download
model_id = "jeongseokoh/Llama-3.3-70B-Instruct_SPEED-60-BoS"
LOWER_K = 60
SPEED_UPPER_TARGETS = ('bos', 'assistant')
repo_dir = snapshot_download(model_id)
sys.path.insert(0, repo_dir)
from speed_inference import load_speed_model
model, tokenizer = load_speed_model(
repo_dir,
dtype=torch.bfloat16,
device_map="auto",
speed_generate=True,
speed_layers=LOWER_K,
speed_attn='causal',
speed_upper_targets=SPEED_UPPER_TARGETS,
)
model.eval()
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
]
with torch.inference_mode():
outputs = model.generate(
speed_generate=True,
messages=messages,
lower_k=LOWER_K,
speed_upper_targets=SPEED_UPPER_TARGETS,
max_new_tokens=256,
do_sample=True,
temperature=0.6,
top_p=0.95,
top_k=20,
return_dict_in_generate=True,
)
prompt_len = outputs["prompt_lengths"][0]
generated_ids = outputs["sequences"][0, prompt_len:]
print(tokenizer.decode(generated_ids, skip_special_tokens=True))

Document or Long-Context Inference

python

question = "What are the key claims in the document?"
document = "..." # long document text
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": question},
]
with torch.inference_mode():
outputs = model.generate(
speed_generate=True,
messages=messages,
context=document,
lower_k=LOWER_K,
speed_upper_targets=SPEED_UPPER_TARGETS,
max_new_tokens=512,
do_sample=False,
return_dict_in_generate=True,
)
prompt_len = outputs["prompt_lengths"][0]
print(tokenizer.decode(outputs["sequences"][0, prompt_len:], skip_special_tokens=True))

Important Notes

  • Use snapshot_download() and the bundled speed_inference.load_speed_model() entrypoint as shown above. The original SPEED source repository is not needed on the inference server.
  • For adapter checkpoints, do not pass SPEED-only arguments such as speed_generate directly to

    markdown

    AutoModelForCausalLM.from_pretrained(model_id, ...)
    ; Transformers/PEFT may route that call through the base model class, which does not accept those arguments.
  • Always pass speed_generate=True for SPEED inference. Ordinary generate() uses the normal generation path.
  • For adapter checkpoints, the base model meta-llama/Llama-3.3-70B-Instruct must be downloadable from the inference server.
  • pipeline("text-generation", ...) is not recommended because SPEED needs structured arguments such as messages, context, and lower_k.
  • vLLM serving is not covered by this upload artifact.

Bundled Modeling Files

Only the modeling files needed for llama are bundled:

  • modeling_speed_llama.py

Model provider

jeongseokoh

jeongseokoh

Model tree

Base

meta-llama/Llama-3.3-70B-Instruct

Adapter

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today