Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Model details
| Base model | aws-prototyping/MegaBeam-Mistral-7B-512k |
| Architecture | MistralForCausalLM (7B, 32 layers, GQA with 8 KV heads) |
| Max context | 524,288 tokens (rope_theta = 7.5e7) |
| Quantization | FP8 W8A8 — weights and input activations |
| Strategy | Per-tensor, static, symmetric (minmax observer) |
| KV cache | FP8 (per-tensor, static, symmetric) |
| Ignored modules | lm_head (kept in higher precision) |
| Format | compressed-tensors (float-quantized, v0.13.0) |
| Produced with | llm-compressor |
The full quantization recipe is included in recipe.yaml.
Usage
vLLM (recommended)
bash
vllm serve JongYeop/MegaBeam-Mistral-7B-512k-FP8-W8A8 \--kv-cache-dtype fp8 \--max-model-len 524288
python
from vllm import LLM, SamplingParamsllm = LLM(model="JongYeop/MegaBeam-Mistral-7B-512k-FP8-W8A8", kv_cache_dtype="fp8")out = llm.generate("Summarize the following document:\n...", SamplingParams(max_tokens=256))print(out[0].outputs[0].text)
Transformers
python
from transformers import AutoModelForCausalLM, AutoTokenizermodel_id = "JongYeop/MegaBeam-Mistral-7B-512k-FP8-W8A8"tokenizer = AutoTokenizer.from_pretrained(model_id)model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")messages = [{"role": "user", "content": "Hello!"}]inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)print(tokenizer.decode(model.generate(inputs, max_new_tokens=128)[0]))
Note: FP8 W8A8 inference requires an FP8-capable GPU (NVIDIA Hopper / Ada Lovelace or newer, compute capability ≥ 8.9) and a recent
compressed-tensorsruntime.
Quantization reproduction
python
from llmcompressor import oneshotfrom llmcompressor.modifiers.quantization import QuantizationModifierrecipe = QuantizationModifier(targets="Linear",scheme="FP8", # W8A8 per-tensor floatignore=["lm_head"],kv_cache_scheme={"num_bits": 8, "type": "float", "symmetric": True, "strategy": "tensor"},)oneshot(model="aws-prototyping/MegaBeam-Mistral-7B-512k", recipe=recipe)
(See recipe.yaml for the exact configuration used.)
License
Inherits the Apache 2.0 license of the base model. Please also review the base model's card for intended use and limitations.
Model provider
JongYeop
Model tree
Base
aws-prototyping/MegaBeam-Mistral-7B-512k
Quantized
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information