groxaxo

Code-Writer-V2-Obliterated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

The pitch, in one breath

A vision-capable, long-context (up to 200,000 tokens), free writer-and-coder — quantized to FP8 so it runs on a pair of consumer GPUs without surrendering the spark. It writes prose that breathes and code that compiles, and it does both on hardware you can reach out and touch.

That is the whole idea. Everything below is just how we kept the promise.


What it is

Code Writer V2 — Obliterated is an FP8-Dynamic quantization of Qwen3.5-27B-Writer-V2-uncensored-heretic, merged with a purpose-trained coding LoRA (coding_mix_8k, checkpoint-25, rank-16 / alpha-32) and cast down to 8-bit floating point with surgical care.

  • Architecture: Qwen3.5 (qwen3_5) — a hybrid mind. 64 decoder layers, of which only 16 carry full attention while the rest run GDN linear attention. This is the secret of its long memory.
  • Modalities: a full vision tower rides along in BF16 (served text-only by default; vision is wired but untested — light the candle at your own pleasure).
  • Character: heretic by lineage and free by intent — it does not flinch, and it does not lecture. It simply does the work.

The craft beneath the curtain

Genius, said one famous man, is in the details. Here are ours — the parts most quantizations get wrong, and the parts we refused to:

We quantized only what should be quantized. The 256 text-model Linear layers (q/k/v/o_proj on the full-attention layers; gate/up/down_proj everywhere) became channel-wise FP8 weights with dynamic per-token activations — calibration-free, no dataset, no drift. Every one of them is 64-aligned, so it loads through vLLM's FP8 Marlin (W8A16) kernels on Ampere and newer.

We kept sacred what must stay whole. The lm_head, the entire GDN linear-attention subtree, and the whole vision tower remain in BF16. An earlier attempt quantized them by accident and the dimensions (2152, 48) shattered Marlin on Ampere. We learned. The recipe now guards them with regex, not hope: ignore: [lm_head, "re:.*linear_attn.*", "re:.*visual.*"].

The result is the rarest thing in this field: a quantization that is smaller, faster, and still itself.


Serving it (validated)

Built and smoke-tested on vLLM 0.19.1:

bash

vllm serve groxaxo/Code-Writer-V2-Obliterated \
--tensor-parallel-size 2 \
--dtype bfloat16 \
--kv-cache-dtype fp8 \
--max-model-len 200000 \
--gpu-memory-utilization 0.92 \
--reasoning-parser qwen3 \
--disable-custom-all-reduce

A few hard-won truths:

  • Tensor parallel must be 2 (or 4). num_key_value_heads = 4 is not divisible by 3 — TP=3 is invalid.
  • 200k context fits because only 16 of 64 layers grow their KV cache, and the KV cache itself is FP8. Expect ~1 full-length request in flight at once; shorter prompts pack far more densely.
  • No MTP head, no native tool-calling — this is a pure decoder, layers 0–63.

Sampling (official Qwen3.5-27B recommendations)

Table
Modetemptop_pnotes
instruct1.00.95top_k 20, min_p 0
general0.70.80top_k 20, min_p 0
coding0.60.95thinking on
thinking1.00.95thinking on
roleplay1.00.95top_k 20, min_p 0

What it's for

  • Writing — fiction, screenplay, copy, the long dark prose of the soul.
  • Code — the LoRA was trained for it; the temperament was kept for it.
  • Long work — 200k tokens means whole codebases, whole manuscripts, whole conversations held in a single thought.

What to know before you sail

  • It is free. Freedom is a tool; you are the hand that holds it. You own what you make with it.
  • Vision is present but unproven here — validate an image path before you trust it in production.
  • FP8 is faithful, not identical. For a golden reference, the BF16 parent stands behind it.

Provenance

  • Base: llmfan46/Qwen3.5-27B-Writer-V2-uncensored-heretic (BF16)
  • LoRA: coding_mix_8k checkpoint-25 (r16, α32), merged to BF16
  • Quant: llmcompressor 0.12.0 —

    markdown

    QuantizationModifier(targets=Linear, scheme=FP8_DYNAMIC)
    , compressed-tensors float-quantized
  • Built: 2026-06-22

Real artists ship. So we shipped a poet that codes.

Now go make something.

Model provider

groxaxo

Model tree

Base

llmfan46/Qwen3.5-27B-Writer-V2-uncensored-heretic

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today