iamrahulreddy/Quintus API & Inference Endpoint

Core Technical Points

Dense KD signal: the final training path streams the teacher's full vocabulary distribution live instead of relying on sparse cached top-k logits.
Base-student strategy: the student starts from Qwen/Qwen3-1.7B-Base, leaving more room for distillation before assistant-format tuning.
Assistant-only supervision: prompt text, chat headers, separators, and padding are masked out of the supervised target region.
Sequence packing: deterministic first-fit decreasing packing improves useful-token throughput at 4096-token context length.
Public benchmark controls: raw/chat prompt format, metric extraction, generation budget, and artifact hygiene are documented explicitly.

Training Summary

The release training path is a two-stage pipeline:

Online KD: train the 1.7B base student against live teacher logits from a Qwen3-8B teacher.
Targeted SFT: tune the distilled checkpoint for assistant-style interaction, persona consistency, and repetition control.

Reuse As A KD Framework

Quintus is released as a trained 1.7B assistant, but the repository is also a reusable reference pipeline for compact-model distillation. The same structure can be adapted to other teacher/student pairs with changes to the model IDs, tokenizer, dataset source, local paths, sequence length, batch schedule, and hardware-specific memory settings in configs/config.yaml.

The reusable pieces are split across the codebase: assistant-only masking, sequence packing, online full-vocabulary KD loss, checkpoint/resume metadata, validation, provenance checks, SFT, and evaluation. The final pattern is:

Distill a smaller base student from a stronger teacher with online KD.
Apply targeted SFT to recover assistant behavior, formatting, identity, and generation stability.

Quintus Architecture

Core KD objective:

L total ​ = α L CE ​ + (1 - α) L KD ​

For the final run,

α = 0.3, T = 2.0

Configuration snapshot:

Setting	Value
Teacher	`Qwen/Qwen3-8B`
Student	`Qwen/Qwen3-1.7B-Base`
Tokenizer	`Qwen/Qwen3-1.7B`
Data	~90K English-only samples from DistilQwen_100k
Max sequence length	4096
Epochs	1
Learning rate	`5.0e-6`
Weight decay	`0.1`
Warmup ratio	`0.05`
Online KD token chunk	2048
Micro batch	4
Gradient accumulation	2
Sequence packing	enabled, `pack_length = 4096`
Attention	FlashAttention-2 when available
Liger kernels	enabled for compatible Qwen-family ops
Optimizer	fused AdamW
`torch.compile`	disabled
Gradient checkpointing	disabled
Seed	25

[!NOTE] FlashAttention-2, Liger kernels, and fused AdamW are acceleration paths. Keep the baseline load path compatible with standard Transformers and vLLM APIs before publishing checkpoints. torch.compile stayed disabled because this KD shape showed high Inductor memory overhead, dynamic-shape graph breaks, recompile overhead, and checkpoint portability risk from _orig_mod. state dict prefixes when compiled modules are not unwrapped before saving.

[!TIP] The B200-oriented defaults are conservative for the 8B teacher to 1.7B student workload. Smaller teacher/student pairs may tolerate larger micro-batches, but full-vocabulary KD scales sharply with vocabulary width.

The editable run configuration lives in configs/config.yaml. Paths and Hub destinations are left as placeholders so each runner can set local directories and repository names directly.

Why Online KD Replaced Offline Top-K KD

Earlier experiments cached only the teacher's top-k logits. That made storage smaller, but with a Qwen vocabulary around 151K tokens, $k = 8$ exposes only:

∣ V ∣ k ​ = 151, 665 8 ​ \approx 5.3 \times 1 0 - 5 = 0.0053%

of the vocabulary support at each position. The sparse signal could perturb the student, but it did not consistently transfer deeper reasoning behavior.

The final online path keeps the teacher and student in memory together and computes KL divergence against the teacher's full-vocabulary distribution. Token chunking keeps that dense objective feasible without materializing a single large KL workspace.

Benchmark Scoreboard

The final public scoreboard compares Qwen/Qwen3-1.7B-Base, Qwen/Qwen3-1.7B-Instruct, and Quintus-1.7B.

Model Evaluation Scoreboard

The strongest signal is the reasoning crossover: Quintus beats both the base and official 1.7B instruct model on GSM8K, ARC-Challenge, and WinoGrande while remaining at the same parameter scale.

See docs/benchmarks.md for the numeric table and interpretation. See docs/evaluation_methodology.md for benchmark controls.

Evaluation Notes

Evaluation uses a mixture of EvalPlus and lm-evaluation-harness/vLLM style benchmarks. The repository keeps evaluation methodology separate because prompt format can change the result:

Raw completion comparisons are used for base capability.
Chat-template comparisons are used for assistant-format behavior.
Log-likelihood tasks such as ARC-Challenge and PIQA should usually stay raw.
GSM8K can differ between strict #### parsing and flexible number extraction.
Metric extraction must ignore stderr, aliases, and wrong filter keys.
Runtime versions, checkpoint identity, generation budget, and stale output cleanup are part of the evaluation contract.

The active benchmark runner is sft/evaluate.py. It covers EvalPlus code tasks and lm-evaluation-harness/vLLM tasks, including GSM8K 10-shot evaluation with an extended generation budget.

Repository Map

text
configs/        Public run profile and DeepSpeed Zero-2 template.
src/            Data prep, online KD, losses, packing, checkpoints, provenance.
sft/            Post-KD SFT, local chat, and consolidated evaluation runner.
docs/           Public architecture, training, evaluation, and release notes.
weight_audit/   Checkpoint structure and weight-divergence audit material.

Key files:

src/train.py: SFT, offline KD compatibility, and final online_kd training entry point.
src/download.py: model setup, dataset loading, schema normalization, tokenization, and assistant-only loss masks.
src/losses.py: CE/KD objective, including online full-vocab KD token chunking.
src/sequence_packing.py: deterministic first-fit decreasing sequence packing.
src/checkpoints.py: checkpoint save/resume metadata and packing compatibility checks.
src/provenance.py: tokenizer/model/data contract checks.
sft/train_sft.py: post-KD supervised fine-tuning.
sft/evaluate.py: EvalPlus and lm-evaluation-harness/vLLM benchmark runner.
sft/chat.py: local interactive chat wrapper.

Commands

Install the base dependencies:

bash
pip install -r requirements.txt

For training and benchmark runs, install the matching extras:

bash
pip install -r requirements-train.txt
pip install -r requirements-eval.txt

Inspect or prepare data/model assets:

bash
python -m src.download --help

Run the final KD path after editing configs/config.yaml for local paths and hardware:

bash
python -m src.train --phase online_kd

Hub checkpoint uploads are off by default for local runs. Pass --upload_last_checkpoint or the step/epoch upload flags only after setting the target repository and HF_TOKEN.

Run the consolidated benchmark suite:

bash
python sft/evaluate.py

Start local chat with a downloaded or local checkpoint:

bash
python sft/chat.py --model_path path/to/quintus/checkpoint

Interactive Chat

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

PUBLIC_REPO_ID = "iamrahulreddy/Quintus"

print(f"Loading Quintus from {PUBLIC_REPO_ID}...")
tokenizer = AutoTokenizer.from_pretrained(PUBLIC_REPO_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    PUBLIC_REPO_ID,
    device_map="auto",
    dtype=torch.float16,
    trust_remote_code=True,
)

stop_tokens = ["<|endoftext|>", "<|im_end|>"]
eos_token_ids = [tokenizer.eos_token_id] if tokenizer.eos_token_id is not None else []
for token in stop_tokens:
    token_id = tokenizer.convert_tokens_to_ids(token)
    if token_id is not None and token_id not in eos_token_ids:
        eos_token_ids.append(token_id)

streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

conversation_history = [
    {
        "role": "system",
        "content": (
            "You are Quintus, a highly capable AI assistant created by "
            "Muskula Rahul. You are helpful, precise, and logically sound."
        ),
    }
]

print()
print("Quintus Chat (type 'quit' to exit)")
print()

while True:
    try:
        user_input = input("You: ").strip()
        if user_input.lower() in ["quit", "exit"]:
            print("\nGoodbye!")
            break
        if not user_input:
            continue

        conversation_history.append({"role": "user", "content": user_input})

        prompt = tokenizer.apply_chat_template(
            conversation_history,
            tokenize=False,
            add_generation_prompt=True,
        )

        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

        print("Quintus: ", end="", flush=True)

        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=512,
                temperature=0.7,
                top_p=0.9,
                do_sample=True,
                streamer=streamer,
                pad_token_id=tokenizer.eos_token_id,
                eos_token_id=eos_token_ids,
            )

        generated_ids = outputs[0][inputs.input_ids.shape[-1]:]
        assistant_response = tokenizer.decode(
            generated_ids,
            skip_special_tokens=True,
        ).strip()
        conversation_history.append({"role": "assistant", "content": assistant_response})
        print()

    except KeyboardInterrupt:
        print("\n\nGoodbye!")
        break

Documentation

Documentation Index: recommended public reading order.
Architecture: end-to-end data flow, modules, and training phases.
Experiment Timeline: why the project moved from offline top-k KD to online full-vocabulary KD.
Training Playbook: memory rules, packing, kernels, checkpointing, and B200-oriented guidance.
Pipeline Hardening: silent-failure classes, artifact contracts, and safety checks.
Evaluation Methodology: raw/chat controls, parser traps, metric extraction, and qualitative evaluation rules.
Engineering Insights: condensed lessons and design decisions.
Benchmarks: verified scoreboard and interpretation.
Weight Audit: structural checkpoint sanity checks and weight-divergence summary.
Hugging Face Model Card: release-page copy for the public model card.

Limitations

Quintus is still a 1.7B model and inherits compact-model capacity limits.
Factual answers can be confidently wrong and should be verified.
Code generation may still contradict stated complexity or edge-case requirements.
Raw and chat-template results are not interchangeable.
Additional preference tuning or DPO would likely improve calibration, refusal behavior, and open-ended assistant polish.

Credits

Quintus builds on open model, dataset, and tooling work from the broader LLM community:

Qwen Team and the Qwen Hugging Face organization for the Qwen3 model family.
Qwen/Qwen3-8B, used as the distillation teacher.
Qwen/Qwen3-1.7B-Base, used as the base student checkpoint.
Qwen/Qwen3-1.7B, used for the tokenizer and chat-template contract.
Alibaba PAI for the DistilQwen_100k dataset used as the primary instruction source after filtering.
Hugging Face Transformers for model loading, tokenization, and generation APIs.
vLLM, EvalPlus, and lm-evaluation-harness for evaluation infrastructure.
FlashAttention and Liger Kernel for performance kernels used or validated during training.

License And Author

This software is distributed under the MIT License. Refer to the LICENSE file for full text.

Author: Muskula Rahul - @iamrahulreddy

Citation

If this model, codebase, or training pipeline is useful in your work, please cite this repository and acknowledge the upstream Qwen3 models.

Quintus

Get help setting up a custom Dedicated Endpoints.

README