Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: mitCore Technical Points
- Dense KD signal: the final training path streams the teacher's full vocabulary distribution live instead of relying on sparse cached top-k logits.
- Base-student strategy: the student starts from
Qwen/Qwen3-1.7B-Base, leaving more room for distillation before assistant-format tuning. - Assistant-only supervision: prompt text, chat headers, separators, and padding are masked out of the supervised target region.
- Sequence packing: deterministic first-fit decreasing packing improves useful-token throughput at 4096-token context length.
- Public benchmark controls: raw/chat prompt format, metric extraction, generation budget, and artifact hygiene are documented explicitly.
Training Summary
The release training path is a two-stage pipeline:
- Online KD: train the 1.7B base student against live teacher logits from a Qwen3-8B teacher.
- Targeted SFT: tune the distilled checkpoint for assistant-style interaction, persona consistency, and repetition control.
Reuse As A KD Framework
Quintus is released as a trained 1.7B assistant, but the repository is also a reusable reference pipeline for compact-model distillation. The same structure can be adapted to other teacher/student pairs with changes to the model IDs, tokenizer, dataset source, local paths, sequence length, batch schedule, and hardware-specific memory settings in configs/config.yaml.
The reusable pieces are split across the codebase: assistant-only masking, sequence packing, online full-vocabulary KD loss, checkpoint/resume metadata, validation, provenance checks, SFT, and evaluation. The final pattern is:
- Distill a smaller base student from a stronger teacher with online KD.
- Apply targeted SFT to recover assistant behavior, formatting, identity, and generation stability.

Core KD objective:
Ltotal=αLCE+(1−α)LKDFor the final run,
α=0.3,T=2.0Configuration snapshot:
| Setting | Value |
|---|---|
| Teacher | Qwen/Qwen3-8B |
| Student | Qwen/Qwen3-1.7B-Base |
| Tokenizer | Qwen/Qwen3-1.7B |
| Data | ~90K English-only samples from DistilQwen_100k |
| Max sequence length | 4096 |
| Epochs | 1 |
| Learning rate | 5.0e-6 |
| Weight decay | 0.1 |
| Warmup ratio | 0.05 |
| Online KD token chunk | 2048 |
| Micro batch | 4 |
| Gradient accumulation | 2 |
| Sequence packing | enabled, pack_length = 4096 |
| Attention | FlashAttention-2 when available |
| Liger kernels | enabled for compatible Qwen-family ops |
| Optimizer | fused AdamW |
torch.compile | disabled |
| Gradient checkpointing | disabled |
| Seed | 25 |
[!NOTE] FlashAttention-2, Liger kernels, and fused AdamW are acceleration paths. Keep the baseline load path compatible with standard Transformers and vLLM APIs before publishing checkpoints.
torch.compilestayed disabled because this KD shape showed high Inductor memory overhead, dynamic-shape graph breaks, recompile overhead, and checkpoint portability risk from_orig_mod.state dict prefixes when compiled modules are not unwrapped before saving.
[!TIP] The B200-oriented defaults are conservative for the 8B teacher to 1.7B student workload. Smaller teacher/student pairs may tolerate larger micro-batches, but full-vocabulary KD scales sharply with vocabulary width.
The editable run configuration lives in configs/config.yaml. Paths and Hub destinations are left as placeholders so each runner can set local directories and repository names directly.
Why Online KD Replaced Offline Top-K KD
Earlier experiments cached only the teacher's top-k logits. That made storage smaller, but with a Qwen vocabulary around 151K tokens, k=8 exposes only:
∣V∣k=151,6658≈5.3×10−5=0.0053%of the vocabulary support at each position. The sparse signal could perturb the student, but it did not consistently transfer deeper reasoning behavior.
The final online path keeps the teacher and student in memory together and computes KL divergence against the teacher's full-vocabulary distribution. Token chunking keeps that dense objective feasible without materializing a single large KL workspace.
Benchmark Scoreboard
The final public scoreboard compares Qwen/Qwen3-1.7B-Base,
Qwen/Qwen3-1.7B-Instruct, and Quintus-1.7B.

The strongest signal is the reasoning crossover: Quintus beats both the base and official 1.7B instruct model on GSM8K, ARC-Challenge, and WinoGrande while remaining at the same parameter scale.
See docs/benchmarks.md for the numeric table and interpretation. See docs/evaluation_methodology.md for benchmark controls.
Evaluation Notes
Evaluation uses a mixture of EvalPlus and lm-evaluation-harness/vLLM style
benchmarks. The repository keeps evaluation methodology separate because prompt
format can change the result:
- Raw completion comparisons are used for base capability.
- Chat-template comparisons are used for assistant-format behavior.
- Log-likelihood tasks such as ARC-Challenge and PIQA should usually stay raw.
- GSM8K can differ between strict
####parsing and flexible number extraction. - Metric extraction must ignore
stderr, aliases, and wrong filter keys. - Runtime versions, checkpoint identity, generation budget, and stale output cleanup are part of the evaluation contract.
The active benchmark runner is sft/evaluate.py. It covers
EvalPlus code tasks and lm-evaluation-harness/vLLM tasks, including GSM8K
10-shot evaluation with an extended generation budget.
Repository Map
text
configs/ Public run profile and DeepSpeed Zero-2 template.src/ Data prep, online KD, losses, packing, checkpoints, provenance.sft/ Post-KD SFT, local chat, and consolidated evaluation runner.docs/ Public architecture, training, evaluation, and release notes.weight_audit/ Checkpoint structure and weight-divergence audit material.
Key files:
- src/train.py: SFT, offline KD compatibility, and final
online_kdtraining entry point. - src/download.py: model setup, dataset loading, schema normalization, tokenization, and assistant-only loss masks.
- src/losses.py: CE/KD objective, including online full-vocab KD token chunking.
- src/sequence_packing.py: deterministic first-fit decreasing sequence packing.
- src/checkpoints.py: checkpoint save/resume metadata and packing compatibility checks.
- src/provenance.py: tokenizer/model/data contract checks.
- sft/train_sft.py: post-KD supervised fine-tuning.
- sft/evaluate.py: EvalPlus and
lm-evaluation-harness/vLLM benchmark runner. - sft/chat.py: local interactive chat wrapper.
Commands
Install the base dependencies:
bash
pip install -r requirements.txt
For training and benchmark runs, install the matching extras:
bash
pip install -r requirements-train.txtpip install -r requirements-eval.txt
Inspect or prepare data/model assets:
bash
python -m src.download --help
Run the final KD path after editing configs/config.yaml for local paths and hardware:
bash
python -m src.train --phase online_kd
Hub checkpoint uploads are off by default for local runs. Pass
--upload_last_checkpoint or the step/epoch upload flags only after setting the
target repository and HF_TOKEN.
Run the consolidated benchmark suite:
bash
python sft/evaluate.py
Start local chat with a downloaded or local checkpoint:
bash
python sft/chat.py --model_path path/to/quintus/checkpoint
Interactive Chat
python
import torchfrom transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamerPUBLIC_REPO_ID = "iamrahulreddy/Quintus"print(f"Loading Quintus from {PUBLIC_REPO_ID}...")tokenizer = AutoTokenizer.from_pretrained(PUBLIC_REPO_ID, trust_remote_code=True)model = AutoModelForCausalLM.from_pretrained(PUBLIC_REPO_ID,device_map="auto",dtype=torch.float16,trust_remote_code=True,)stop_tokens = ["<|endoftext|>", "<|im_end|>"]eos_token_ids = [tokenizer.eos_token_id] if tokenizer.eos_token_id is not None else []for token in stop_tokens:token_id = tokenizer.convert_tokens_to_ids(token)if token_id is not None and token_id not in eos_token_ids:eos_token_ids.append(token_id)streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)conversation_history = [{"role": "system","content": ("You are Quintus, a highly capable AI assistant created by ""Muskula Rahul. You are helpful, precise, and logically sound."),}]print()print("Quintus Chat (type 'quit' to exit)")print()while True:try:user_input = input("You: ").strip()if user_input.lower() in ["quit", "exit"]:print("\nGoodbye!")breakif not user_input:continueconversation_history.append({"role": "user", "content": user_input})prompt = tokenizer.apply_chat_template(conversation_history,tokenize=False,add_generation_prompt=True,)inputs = tokenizer(prompt, return_tensors="pt").to(model.device)print("Quintus: ", end="", flush=True)with torch.no_grad():outputs = model.generate(**inputs,max_new_tokens=512,temperature=0.7,top_p=0.9,do_sample=True,streamer=streamer,pad_token_id=tokenizer.eos_token_id,eos_token_id=eos_token_ids,)generated_ids = outputs[0][inputs.input_ids.shape[-1]:]assistant_response = tokenizer.decode(generated_ids,skip_special_tokens=True,).strip()conversation_history.append({"role": "assistant", "content": assistant_response})print()except KeyboardInterrupt:print("\n\nGoodbye!")break
Documentation
- Documentation Index: recommended public reading order.
- Architecture: end-to-end data flow, modules, and training phases.
- Experiment Timeline: why the project moved from offline top-k KD to online full-vocabulary KD.
- Training Playbook: memory rules, packing, kernels, checkpointing, and B200-oriented guidance.
- Pipeline Hardening: silent-failure classes, artifact contracts, and safety checks.
- Evaluation Methodology: raw/chat controls, parser traps, metric extraction, and qualitative evaluation rules.
- Engineering Insights: condensed lessons and design decisions.
- Benchmarks: verified scoreboard and interpretation.
- Weight Audit: structural checkpoint sanity checks and weight-divergence summary.
- Hugging Face Model Card: release-page copy for the public model card.
Limitations
- Quintus is still a 1.7B model and inherits compact-model capacity limits.
- Factual answers can be confidently wrong and should be verified.
- Code generation may still contradict stated complexity or edge-case requirements.
- Raw and chat-template results are not interchangeable.
- Additional preference tuning or DPO would likely improve calibration, refusal behavior, and open-ended assistant polish.
Credits
Quintus builds on open model, dataset, and tooling work from the broader LLM community:
- Qwen Team and the Qwen Hugging Face organization for the Qwen3 model family.
Qwen/Qwen3-8B, used as the distillation teacher.Qwen/Qwen3-1.7B-Base, used as the base student checkpoint.Qwen/Qwen3-1.7B, used for the tokenizer and chat-template contract.- Alibaba PAI for the
DistilQwen_100kdataset used as the primary instruction source after filtering. - Hugging Face Transformers for model loading, tokenization, and generation APIs.
- vLLM, EvalPlus, and lm-evaluation-harness for evaluation infrastructure.
- FlashAttention and Liger Kernel for performance kernels used or validated during training.
License And Author
This software is distributed under the MIT License. Refer to the LICENSE file for full text.
Author: Muskula Rahul - @iamrahulreddy
Citation
If this model, codebase, or training pipeline is useful in your work, please cite this repository and acknowledge the upstream Qwen3 models.
Model provider
iamrahulreddy
Model tree
Base
Qwen/Qwen3-1.7B-Base
Fine-tuned
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information