Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model details

Base modelaws-prototyping/MegaBeam-Mistral-7B-512k
ArchitectureMistralForCausalLM (7B, 32 layers, GQA with 8 KV heads)
Max context524,288 tokens (rope_theta = 7.5e7)
QuantizationFP8 W8A8 — weights and input activations
StrategyPer-tensor, static, symmetric (minmax observer)
KV cacheFP8 (per-tensor, static, symmetric)
Ignored moduleslm_head (kept in higher precision)
Formatcompressed-tensors (float-quantized, v0.13.0)
Produced withllm-compressor

The full quantization recipe is included in recipe.yaml.

Usage

vLLM (recommended)

bash

vllm serve JongYeop/MegaBeam-Mistral-7B-512k-FP8-W8A8 \
--kv-cache-dtype fp8 \
--max-model-len 524288

python

from vllm import LLM, SamplingParams
llm = LLM(model="JongYeop/MegaBeam-Mistral-7B-512k-FP8-W8A8", kv_cache_dtype="fp8")
out = llm.generate("Summarize the following document:\n...", SamplingParams(max_tokens=256))
print(out[0].outputs[0].text)

Transformers

python

from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "JongYeop/MegaBeam-Mistral-7B-512k-FP8-W8A8"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
messages = [{"role": "user", "content": "Hello!"}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(inputs, max_new_tokens=128)[0]))

Note: FP8 W8A8 inference requires an FP8-capable GPU (NVIDIA Hopper / Ada Lovelace or newer, compute capability ≥ 8.9) and a recent compressed-tensors runtime.

Quantization reproduction

python

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
recipe = QuantizationModifier(
targets="Linear",
scheme="FP8", # W8A8 per-tensor float
ignore=["lm_head"],
kv_cache_scheme={"num_bits": 8, "type": "float", "symmetric": True, "strategy": "tensor"},
)
oneshot(model="aws-prototyping/MegaBeam-Mistral-7B-512k", recipe=recipe)

(See recipe.yaml for the exact configuration used.)

License

Inherits the Apache 2.0 license of the base model. Please also review the base model's card for intended use and limitations.

Model provider

JongYeop

JongYeop

Model tree

Base

aws-prototyping/MegaBeam-Mistral-7B-512k

Quantized

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today