Jackrong

Qwopus3.6-27B-Coder-FP8

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

Quantization

Source model: Jackrong/Qwopus3.6-27B-Coder
Output format: Hugging Face safetensors
Quantization: FP8 E4M3, dynamic activations, 128x128 weight blocks
Runtime target: vLLM FP8 loading path
MTP tensors: preserved; 7 MTP projection weights quantized to FP8 and indexed in mtp.safetensors

Validation

Validated locally on GB10 with vLLM before upload.

vLLM smoke test: passed, normal Python recursive factorial output, garbled=false, has_answer=true
Structural check: all 64 gate_proj.weight_scale_inv tensors present
Dtype check: language-model and MTP gate_proj.weight tensors are torch.float8_e4m3fn
30-question no-thinking batch validation: 30/30 completed, empty outputs 0, dangerous repetition flags 0, control-character flags 0
Validation artifacts: test_data/vllm_fp8_30q_no_think_reviewed_report.md and JSON results

Note: some validation prompts reached the 768-token cap because the answers were long; reviewed outputs did not show乱码, empty responses, or mechanical looping.

Loading Example

python
from vllm import LLM, SamplingParams

llm = LLM(
    model="Jackrong/Qwopus3.6-27B-Coder-FP8",
    trust_remote_code=True,
    max_model_len=8192,
)
outputs = llm.generate(["Write a Python factorial function."], SamplingParams(max_tokens=256))
print(outputs[0].outputs[0].text)