groxaxo/Code-Writer-V2-Obliterated API & Inference Endpoint

The pitch, in one breath

A vision-capable, long-context (up to 200,000 tokens), free writer-and-coder — quantized to FP8 so it runs on a pair of consumer GPUs without surrendering the spark. It writes prose that breathes and code that compiles, and it does both on hardware you can reach out and touch.

That is the whole idea. Everything below is just how we kept the promise.

What it is

Code Writer V2 — Obliterated is an FP8-Dynamic quantization of Qwen3.5-27B-Writer-V2-uncensored-heretic, merged with a purpose-trained coding LoRA (coding_mix_8k, checkpoint-25, rank-16 / alpha-32) and cast down to 8-bit floating point with surgical care.

Architecture: Qwen3.5 (qwen3_5) — a hybrid mind. 64 decoder layers, of which only 16 carry full attention while the rest run GDN linear attention. This is the secret of its long memory.
Modalities: a full vision tower rides along in BF16 (served text-only by default; vision is wired but untested — light the candle at your own pleasure).
Character: heretic by lineage and free by intent — it does not flinch, and it does not lecture. It simply does the work.

The craft beneath the curtain

Genius, said one famous man, is in the details. Here are ours — the parts most quantizations get wrong, and the parts we refused to:

We quantized only what should be quantized. The 256 text-model Linear layers (q/k/v/o_proj on the full-attention layers; gate/up/down_proj everywhere) became channel-wise FP8 weights with dynamic per-token activations — calibration-free, no dataset, no drift. Every one of them is 64-aligned, so it loads through vLLM's FP8 Marlin (W8A16) kernels on Ampere and newer.

We kept sacred what must stay whole. The lm_head, the entire GDN linear-attention subtree, and the whole vision tower remain in BF16. An earlier attempt quantized them by accident and the dimensions (2152, 48) shattered Marlin on Ampere. We learned. The recipe now guards them with regex, not hope: ignore: [lm_head, "re:.*linear_attn.*", "re:.*visual.*"].

The result is the rarest thing in this field: a quantization that is smaller, faster, and still itself.

Serving it (validated)

Built and smoke-tested on vLLM 0.19.1:

bash
vllm serve groxaxo/Code-Writer-V2-Obliterated \
  --tensor-parallel-size 2 \
  --dtype bfloat16 \
  --kv-cache-dtype fp8 \
  --max-model-len 200000 \
  --gpu-memory-utilization 0.92 \
  --reasoning-parser qwen3 \
  --disable-custom-all-reduce

A few hard-won truths:

Tensor parallel must be 2 (or 4). num_key_value_heads = 4 is not divisible by 3 — TP=3 is invalid.
200k context fits because only 16 of 64 layers grow their KV cache, and the KV cache itself is FP8. Expect ~1 full-length request in flight at once; shorter prompts pack far more densely.
No MTP head, no native tool-calling — this is a pure decoder, layers 0–63.

Sampling (official Qwen3.5-27B recommendations)

Table
Mode	temp	top_p	notes
instruct	1.0	0.95	top_k 20, min_p 0
general	0.7	0.80	top_k 20, min_p 0
coding	0.6	0.95	thinking on
thinking	1.0	0.95	thinking on
roleplay	1.0	0.95	top_k 20, min_p 0

What it's for

Writing — fiction, screenplay, copy, the long dark prose of the soul.
Code — the LoRA was trained for it; the temperament was kept for it.
Long work — 200k tokens means whole codebases, whole manuscripts, whole conversations held in a single thought.

What to know before you sail

It is free. Freedom is a tool; you are the hand that holds it. You own what you make with it.
Vision is present but unproven here — validate an image path before you trust it in production.
FP8 is faithful, not identical. For a golden reference, the BF16 parent stands behind it.

Provenance

Base: llmfan46/Qwen3.5-27B-Writer-V2-uncensored-heretic (BF16)
LoRA: coding_mix_8k checkpoint-25 (r16, α32), merged to BF16
Quant: llmcompressor 0.12.0 —
markdown
```
QuantizationModifier(targets=Linear, scheme=FP8_DYNAMIC)
```
, compressed-tensors float-quantized
Built: 2026-06-22

Real artists ship. So we shipped a poet that codes.

Now go make something.

Code-Writer-V2-Obliterated

Get help setting up a custom Dedicated Endpoints.

README