Wenboz

TCOD-v1-OPD-Qwen2.5-3B-ALFWorld

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Training configuration

Table

Framework	trinity-rft (TCOD), verl FSDP backend
Algorithm	on-policy distillation (`multi_turn_opd` advantage, KL coef 1.0)
Optimizer LR	1e-6
Training steps	250 (`save_interval` 50)
batch_size / train_batch_size	16 / 64, repeat_times 1
Max prompt / response (train)	2048 / 512
Env steps per episode (train)	50
Staleness control	`max_staleness=2`, NCCL weight sync every step
Sequence parallel / grad clip	ulysses SP 2, grad_clip 1.0, `max_token_len_per_gpu=16384`
Rollout	vLLM, prefix caching on, CUDA graph on, 8×H100

Usage

Standard chat model; use it inside an ALFWorld agent loop with the GiGPO/verl-agent prompt template (the <think>...</think> <action>...</action> contract, last-2-step history window).

python
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("Wenboz/TCOD-v1-OPD-Qwen2.5-3B-ALFWorld")
model = AutoModelForCausalLM.from_pretrained("Wenboz/TCOD-v1-OPD-Qwen2.5-3B-ALFWorld", torch_dtype="bfloat16", device_map="auto")