jeongseokoh

Llama-3.3-70B-Instruct_SPEED-60-BoS

SPEED Configuration

Table with columns: Setting, Value
Setting	Value
Base model	`meta-llama/Llama-3.3-70B-Instruct`
Model family	`llama`
Adapter checkpoint	`true`
Lower SPEED layers	`60`
Prompt prefill mode	`lower`
Upper prompt targets	`bos,assistant`
Context mode	`0`
Prefill attention	`causal`
Decode tokens	full-depth

Installation

Use a CUDA/PyTorch environment suitable for the base model.

bash
pip install "transformers>=4.57,<5" "peft>=0.19,<1" huggingface_hub accelerate safetensors

Install PyTorch separately if your server needs a specific CUDA wheel.

Basic SPEED Inference

python
import sys
import torch
from huggingface_hub import snapshot_download

model_id = "jeongseokoh/Llama-3.3-70B-Instruct_SPEED-60-BoS"
LOWER_K = 60
SPEED_UPPER_TARGETS = ('bos', 'assistant')

repo_dir = snapshot_download(model_id)
sys.path.insert(0, repo_dir)

from speed_inference import load_speed_model

model, tokenizer = load_speed_model(
    repo_dir,
    dtype=torch.bfloat16,
    device_map="auto",
    speed_generate=True,
    speed_layers=LOWER_K,
    speed_attn='causal',
    speed_upper_targets=SPEED_UPPER_TARGETS,
)
model.eval()

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
]

with torch.inference_mode():
    outputs = model.generate(
        speed_generate=True,
        messages=messages,
        lower_k=LOWER_K,
        speed_upper_targets=SPEED_UPPER_TARGETS,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.6,
        top_p=0.95,
        top_k=20,
        return_dict_in_generate=True,
    )

prompt_len = outputs["prompt_lengths"][0]
generated_ids = outputs["sequences"][0, prompt_len:]
print(tokenizer.decode(generated_ids, skip_special_tokens=True))

Document or Long-Context Inference

python
question = "What are the key claims in the document?"
document = "..."  # long document text

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": question},
]

with torch.inference_mode():
    outputs = model.generate(
        speed_generate=True,
        messages=messages,
        context=document,
        lower_k=LOWER_K,
        speed_upper_targets=SPEED_UPPER_TARGETS,
        max_new_tokens=512,
        do_sample=False,
        return_dict_in_generate=True,
    )

prompt_len = outputs["prompt_lengths"][0]
print(tokenizer.decode(outputs["sequences"][0, prompt_len:], skip_special_tokens=True))

Important Notes

Use snapshot_download() and the bundled speed_inference.load_speed_model() entrypoint as shown above. The original SPEED source repository is not needed on the inference server.
For adapter checkpoints, do not pass SPEED-only arguments such as speed_generate directly to
markdown
```
AutoModelForCausalLM.from_pretrained(model_id, ...)
```
; Transformers/PEFT may route that call through the base model class, which does not accept those arguments.
Always pass speed_generate=True for SPEED inference. Ordinary generate() uses the normal generation path.
For adapter checkpoints, the base model meta-llama/Llama-3.3-70B-Instruct must be downloadable from the inference server.
pipeline("text-generation", ...) is not recommended because SPEED needs structured arguments such as messages, context, and .

Bundled Modeling Files

Only the modeling files needed for llama are bundled:

modeling_speed_llama.py

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

jeongseokoh

Model Tree

Base

meta-llama/Llama-3.3-70B-Instruct

Adapter

this model

Input Modalities

Text

Output Modalities

Text

Supported Functionality

Dedicated EndpointsContainer

Explore FriendliAI today

Get started Talk to an engineer

SPEED Configuration

Table with columns: Setting, Value
Setting	Value
Base model	`meta-llama/Llama-3.3-70B-Instruct`
Model family	`llama`
Adapter checkpoint	`true`
Lower SPEED layers	`60`
Prompt prefill mode	`lower`
Upper prompt targets	`bos,assistant`
Context mode	`0`
Prefill attention	`causal`
Decode tokens	full-depth

Installation

Use a CUDA/PyTorch environment suitable for the base model.

bash
pip install "transformers>=4.57,<5" "peft>=0.19,<1" huggingface_hub accelerate safetensors

Install PyTorch separately if your server needs a specific CUDA wheel.

Basic SPEED Inference

python
import sys
import torch
from huggingface_hub import snapshot_download

model_id = "jeongseokoh/Llama-3.3-70B-Instruct_SPEED-60-BoS"
LOWER_K = 60
SPEED_UPPER_TARGETS = ('bos', 'assistant')

repo_dir = snapshot_download(model_id)
sys.path.insert(0, repo_dir)

from speed_inference import load_speed_model

model, tokenizer = load_speed_model(
    repo_dir,
    dtype=torch.bfloat16,
    device_map="auto",
    speed_generate=True,
    speed_layers=LOWER_K,
    speed_attn='causal',
    speed_upper_targets=SPEED_UPPER_TARGETS,
)
model.eval()

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
]

with torch.inference_mode():
    outputs = model.generate(
        speed_generate=True,
        messages=messages,
        lower_k=LOWER_K,
        speed_upper_targets=SPEED_UPPER_TARGETS,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.6,
        top_p=0.95,
        top_k=20,
        return_dict_in_generate=True,
    )

prompt_len = outputs["prompt_lengths"][0]
generated_ids = outputs["sequences"][0, prompt_len:]
print(tokenizer.decode(generated_ids, skip_special_tokens=True))

Document or Long-Context Inference

python
question = "What are the key claims in the document?"
document = "..."  # long document text

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": question},
]

with torch.inference_mode():
    outputs = model.generate(
        speed_generate=True,
        messages=messages,
        context=document,
        lower_k=LOWER_K,
        speed_upper_targets=SPEED_UPPER_TARGETS,
        max_new_tokens=512,
        do_sample=False,
        return_dict_in_generate=True,
    )

prompt_len = outputs["prompt_lengths"][0]
print(tokenizer.decode(outputs["sequences"][0, prompt_len:], skip_special_tokens=True))

Important Notes

Use snapshot_download() and the bundled speed_inference.load_speed_model() entrypoint as shown above. The original SPEED source repository is not needed on the inference server.
For adapter checkpoints, do not pass SPEED-only arguments such as speed_generate directly to
markdown
```
AutoModelForCausalLM.from_pretrained(model_id, ...)
```
; Transformers/PEFT may route that call through the base model class, which does not accept those arguments.
Always pass speed_generate=True for SPEED inference. Ordinary generate() uses the normal generation path.
For adapter checkpoints, the base model meta-llama/Llama-3.3-70B-Instruct must be downloadable from the inference server.
pipeline("text-generation", ...) is not recommended because SPEED needs structured arguments such as messages, context, and .

Bundled Modeling Files

Only the modeling files needed for llama are bundled:

modeling_speed_llama.py

Llama-3.3-70B-Instruct_SPEED-60-BoS

README

SPEED Configuration

Installation

Basic SPEED Inference

Document or Long-Context Inference

Important Notes

Bundled Modeling Files

Explore FriendliAI today

README

SPEED Configuration

Installation

Basic SPEED Inference

Document or Long-Context Inference

Important Notes

Bundled Modeling Files