palmfuture

Qwopus3.6-27B-v2-GPTQ-Int4

README

License: apache-2.0

Quality

Table with columns: Metric, Value
Metric	Value
GPTQ success rate	100%
RTN fallback rate	0%
Loss mean	1.51e-04
Loss max	8.65e-04
Total modules	400

What's quantized vs kept bf16

Quantized (int4, g32 uniform):

Self-attention: q_proj, k_proj, v_proj, o_proj
Linear-attention (GDN/Mamba hybrid): in_proj_qkv, in_proj_z, out_proj
MLP: gate_proj, up_proj, down_proj

Kept bf16 (per Qwen3.6 FP8 recipe):

Linear-attention state dynamics: A_log, conv1d, dt_bias, in_proj_b, in_proj_a, in_proj_ba, linear_attn.norm
Attention norms: q_norm, k_norm
Layer norms: input_layernorm, post_attention_layernorm
Multi-token prediction head: mtp.* (15 keys)
Vision encoder: model.visual.* (333 keys — ViT blocks for image input)

Calibration recipe

Domain-mixed calibration set (same as base model quantization):

Table with columns: Source, Samples, Purpose
Source	Samples	Purpose
allenai/c4	102	General English text
allenai/tulu-3-sft-mixture	77	Instruction-following
codeparrot/codeparrot-clean-valid	51	Code generation
HuggingFaceH4/MATH-500

Verified compatibility

Tested on vLLM 0.21.0 with TP=4 on 4× RTX 3060 12GB:

Table with columns: Test, Result
Test	Result
Model load	✓ no shape mismatch
GPU KV cache pool	554,419 tokens
Max concurrency @ 262K context	2.11×
Multi-step reasoning	✓
Code generation	✓
`<think>` tag format	✓ preserved

Hardware used for quantization

GPUs: 4× NVIDIA RTX 3060 12GB
Motherboard: SuperMicro C9X299-RPGF (LGA 2066)
CPU: Intel i9-7900X (10c/20t, Skylake-X)
RAM: 32 GB DDR4-2666
Runtime: ~49 minutes wall-clock (identical to base model quantization)

Toolchain

Table with columns: Component, Version
Component	Version
GPTQModel	7.0.0
Transformers	5.9.0
PyTorch	2.12.0+cu130
TorchAO	0.17.0
Triton	3.7.0
Flash Linear Attention (FLA)	0.5.0
Python	3.13.13 (free-threading, no-GIL)
CUDA	13.0
cuDNN	9.2.0

Attribution

This quantization is a derivative work that builds on:

Jackrong/Qwopus3.6-27B-v2 — the reasoning-enhanced finetune via Trace Inversion + Curriculum SFT. All credit for reasoning quality and <think> tag training goes to Jackrong.
Kyle Hessling — hardware collaboration on the original Qwopus finetune.
Qwen Team (Alibaba DAMO) — base model Qwen3.6-27B.
palmfuture — GPTQ Int4 quantization recipe (uniform g32, vLLM-validated).

The original Qwopus3.6-27B-v2 model card by Jackrong follows below.

💡 1. Base Model, Training Library & Cooperation

[!TIP] Vision & Tool Calling Support: Qwopus3.6-27B-v2 natively supports vision and tool-use capabilities. To enable vision functionality, download mmproj.gguf from the GGUF Repository and place it in the same directory as the main .gguf file.

[!WARNING] Community Release Notice: Qwopus3.6-27B-v2 is an experimental community release and has not undergone complete safety evaluations or standard benchmarking. It is intended solely for research and exploration.

📖 2. Background & Motivation

⚡ 3. Reasoning Efficiency & MTP Speedup

📊 4. Evaluation & Benchmarks

🗺️ 5. Training & Data Pipeline Overview

The training process fuses Trace Inversion data augmentation with a Three-Stage Curriculum Learning pipeline. The core engineering focuses on expanding context length gradually while training on reconstructed reasoning traces to guarantee format stability.

text
[ 🗺️ Trace Inversion: Reconstructing Distillation Workflow ]

  A. Surrogate Model Training (Trace Inverter)
     Open-source Model (GLM-5.1 / DS-V4) ──► Complete Reasoning Chain ──► [ Qwen3-235B Compression ] ──► Reasoning Bubbles
                                              │                                   │
                                              └──────────► [ Training ] ◄─────────┘
                                                   (Base: Qwen3-4B-Instruct)
                                                   (Result: Trace-Inverter-4B)

  B. Inversion Phase: Reconstructing Claude-4.7-Max
     _______________________________________________________
    |                                                       |
    |  Claude-4.7-Max API ──► Compressed Bubbles + Answer   |
    |_______________________________________________________|
                      │
                      ▼
    [ 🧠 Trace-Inverter-4B (Logic Reconstructor) ] ──► Synthetic Deep Reasoning Trace (Learnable CoT)
                      │
                      ▼
    [ 🧩 Data Splicing ] ◄────────── (Original Prompt + Response)
    (Embed reconstructed CoT in <think> tags, splicing with original prompt/response)
                      │
                      ▼
             (Result: claude-opus-4.6/4.7 inverted sets)

  C. Final SFT Curriculum Pipeline
     ___________________________________________
    |                                           |
    |          Base Model (Qwen3.6-27B)         |
    |___________________________________________|
                      │
                      ▼
    [ 📦 Phase 1: Format Inception ] ──► [ 🛠️ Phase 2: Complexity Expansion ] ──► [ 🚀 Phase 3: Long-Context SFT ]
      ( < 4096 tokens )                     ( 4096 - 8192 tokens )                 ( 8192 - 32K tokens )
      (Short-context stable format)         (Medium-complexity reasoning)          (Long/Multi-turn / 10% replay)
                      │                                                                       │
                      └─────────────────────────────┬─────────────────────────────────────────┘
                                                    ▼
                                   _____________________________________________
                                  |                                             |
                                  |   🌟 Final Model: Qwopus3.6-27B-v2          |
                                  |_____________________________________________|

🎯 6. Three-Stage Curriculum Learning

To steadily scale up the reasoning quality under long-context inference, Qwopus3.6-27B-v2 adopts a Curriculum Learning strategy, progressively mixing longer and more complex reasoning templates:

🎨 7. Trace Inversion Case Studies (5 Key Domains Showcase)

To demonstrate how Trace Inversion reconstructs logical continuity and eliminates negative entropy, the following interactive panels show the contrast between raw compressed "Reasoning Bubbles" and the fully step-by-step reconstructed chain-of-thought (Learnable CoT) under 5 typical scenarios:

📐 Domain 1: Mathematics (Probability Calculation)

🚀 Domain 2: Physics (Kinematics)

💻 Domain 3: Coding (Algorithm Logic)

🧠 Domain 4: Logical Reasoning (Syllogism)

💡 Domain 5: Core Theory (Reasoning Bubble vs. Learnable CoT)

🤝 8. Collaboration & Training Details

This model is a collaborative milestone achieved with hardware engineer Kyle Hessling. You can follow him on X / Twitter: @KyleHessling1 to keep up with the latest hardware infrastructure and distributed training updates. 🙏

⚠️ 9. Known Training & Deployment Issues (IMPORTANT)

While the 27B dense model architecture is relatively stable, certain low-level framework compatibility issues may still surface during large-scale parameter updates and complex long-context training. It is highly recommended to monitor the following technical risk points during secondary fine-tuning and deployment:

[!CAUTION] Local Fine-Tuning & Deployment Warning: If you attempt to run secondary fine-tuning or merge adapter weights locally, please proceed with caution and be prepared to manually patch model definition files or pin dependency versions strictly.

📚 10. Resources & Guides

👉 GitHub Repository: Jackrong-llm-finetuning-guide Access the repository to dive into the codebase and reproduce our results locally or on Google Colab.

🙏 11. Acknowledgements

Special thanks to:

The Qwen team for providing the powerful Qwen3.6 base model.
Unsloth for providing the highly efficient fine-tuning framework.
Open-source datasets and community contributors.
Kyle Hessling for the close collaboration on this project.

📖 12. Citation

bibtex
@misc{jackrong_qwopus36_27b_v2,
  title        = {Qwopus3.6-27B-v2},
  author       = {Jackrong},
  year         = {2026},
  publisher    = {Hugging Face}
}

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider