patnir41

kaetram-qwen3.5-2b-opd-r1

README

License: apache-2.0

Method

On-policy distillation: the student plays the game, and each emitted action token is scored with a reverse-KL advantage against the teacher, advantage = -(logp_student − logp_teacher). Training is PPO-clipped importance-sampling on those advantages (LoRA r=64, α=64, no rsLoRA, 7 projection modules, bf16, 1 epoch, advantage clamp ±3, early-turn step-weight 1.5). Round 1 initializes a fresh LoRA on base Qwen3.5-2B. Full construction is in the patnir41/kaetram-opd-2b dataset card.

Chain: base Qwen3.5-2B → r1 → (merge) → r2 → (merge) → r3.

Files

root: merged bf16 weights (Qwen3_5ForConditionalGeneration) — load directly with transformers/vLLM/SGLang.
adapter/: the LoRA adapter alone (apply on top of Qwen/Qwen3.5-2B).

This is a text-only fine-tune; the base architecture is multimodal-capable but no vision/audio path is trained or used. The included chat_template.jinja preserves <think> reasoning on every assistant turn.

Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer
m = AutoModelForCausalLM.from_pretrained("patnir41/kaetram-qwen3.5-2b-opd-r1", torch_dtype="bfloat16", device_map="auto")
t = AutoTokenizer.from_pretrained("patnir41/kaetram-qwen3.5-2b-opd-r1")

The model emits typed tool calls (observe, navigate, attack, gather, query_quest, …) and expects the Kaetram MCP tool harness; outside that harness it generates the same tool-call syntax as plain text.

Limitations

Trained for one narrow task (the Kaetram Core-3 benchmark) — not a general assistant. Inherits the round's known failure modes (occasional malformed tool-call syntax; the "Rick's Roll" quest stays unsolved across the whole program).

License & credits

Apache-2.0, inheriting Qwen3.5-2B (© 2026 Alibaba Cloud). Game environment and embedded game data (coordinates, NPC/mob/quest names) are from Kaetram-Open (MPL-2.0). See NOTICE. All training data was generated by Qwen self-play — no third-party proprietary model outputs were used.

Citation

bibtex
@misc{kaetram_opd_2b_r1_2026,
  title        = {Kaetram Qwen3.5-2B OPD (Round 1)},
  author       = {patnir41},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/patnir41/kaetram-qwen3.5-2b-opd-r1}}
}

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

patnir41

Model Tree

Base

Qwen/Qwen3.5-2B

Fine-tuned

this model

Input Modalities

Text

Image

Video

Output Modalities