JongYeop/MegaBeam-Mistral-7B-512k-FP8-W8A8 API & Inference Endpoint

Model details


Base model	`aws-prototyping/MegaBeam-Mistral-7B-512k`
Architecture	`MistralForCausalLM` (7B, 32 layers, GQA with 8 KV heads)
Max context	524,288 tokens (`rope_theta = 7.5e7`)
Quantization	FP8 W8A8 — weights and input activations
Strategy	Per-tensor, static, symmetric (`minmax` observer)
KV cache	FP8 (per-tensor, static, symmetric)
Ignored modules	`lm_head` (kept in higher precision)
Format	`compressed-tensors` (`float-quantized`, v0.13.0)
Produced with	`llm-compressor`

The full quantization recipe is included in recipe.yaml.

Usage

vLLM (recommended)

bash
vllm serve JongYeop/MegaBeam-Mistral-7B-512k-FP8-W8A8 \
    --kv-cache-dtype fp8 \
    --max-model-len 524288

python
from vllm import LLM, SamplingParams

llm = LLM(model="JongYeop/MegaBeam-Mistral-7B-512k-FP8-W8A8", kv_cache_dtype="fp8")
out = llm.generate("Summarize the following document:\n...", SamplingParams(max_tokens=256))
print(out[0].outputs[0].text)

Transformers

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "JongYeop/MegaBeam-Mistral-7B-512k-FP8-W8A8"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

messages = [{"role": "user", "content": "Hello!"}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(inputs, max_new_tokens=128)[0]))

Note: FP8 W8A8 inference requires an FP8-capable GPU (NVIDIA Hopper / Ada Lovelace or newer, compute capability ≥ 8.9) and a recent compressed-tensors runtime.

Quantization reproduction

python
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8",            # W8A8 per-tensor float
    ignore=["lm_head"],
    kv_cache_scheme={"num_bits": 8, "type": "float", "symmetric": True, "strategy": "tensor"},
)

oneshot(model="aws-prototyping/MegaBeam-Mistral-7B-512k", recipe=recipe)

(See recipe.yaml for the exact configuration used.)

License

Inherits the Apache 2.0 license of the base model. Please also review the base model's card for intended use and limitations.

MegaBeam-Mistral-7B-512k-FP8-W8A8

Get help setting up a custom Dedicated Endpoints.

README