patnir41

kaetram-qwen3.5-2b-opd-r1

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Method

On-policy distillation: the student plays the game, and each emitted action token is scored with a reverse-KL advantage against the teacher, advantage = -(logp_student − logp_teacher). Training is PPO-clipped importance-sampling on those advantages (LoRA r=64, α=64, no rsLoRA, 7 projection modules, bf16, 1 epoch, advantage clamp ±3, early-turn step-weight 1.5). Round 1 initializes a fresh LoRA on base Qwen3.5-2B. Full construction is in the patnir41/kaetram-opd-2b dataset card.

Chain: base Qwen3.5-2B → r1 → (merge) → r2 → (merge) → r3.

Files

  • root: merged bf16 weights (Qwen3_5ForConditionalGeneration) — load directly with transformers/vLLM/SGLang.
  • adapter/: the LoRA adapter alone (apply on top of Qwen/Qwen3.5-2B).

This is a text-only fine-tune; the base architecture is multimodal-capable but no vision/audio path is trained or used. The included chat_template.jinja preserves <think> reasoning on every assistant turn.

Usage

python

from transformers import AutoModelForCausalLM, AutoTokenizer
m = AutoModelForCausalLM.from_pretrained("patnir41/kaetram-qwen3.5-2b-opd-r1", torch_dtype="bfloat16", device_map="auto")
t = AutoTokenizer.from_pretrained("patnir41/kaetram-qwen3.5-2b-opd-r1")

The model emits typed tool calls (observe, navigate, attack, gather, query_quest, …) and expects the Kaetram MCP tool harness; outside that harness it generates the same tool-call syntax as plain text.

Limitations

Trained for one narrow task (the Kaetram Core-3 benchmark) — not a general assistant. Inherits the round's known failure modes (occasional malformed tool-call syntax; the "Rick's Roll" quest stays unsolved across the whole program).

License & credits

Apache-2.0, inheriting Qwen3.5-2B (© 2026 Alibaba Cloud). Game environment and embedded game data (coordinates, NPC/mob/quest names) are from Kaetram-Open (MPL-2.0). See NOTICE. All training data was generated by Qwen self-play — no third-party proprietary model outputs were used.

Citation

bibtex

@misc{kaetram_opd_2b_r1_2026,
title = {Kaetram Qwen3.5-2B OPD (Round 1)},
author = {patnir41},
year = {2026},
howpublished = {\url{https://huggingface.co/patnir41/kaetram-qwen3.5-2b-opd-r1}}
}

Model provider

patnir41

Model tree

Base

Qwen/Qwen3.5-2B

Fine-tuned

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today