🧠 Model Details
- Base Model:
ertghiu256/Qwen3.5-2b-ReMix (a fine‑tune of Qwen3.5‑2B)
- Fine‑tuning Method: GRPO,
loss_type = "dr_grpo"
- Framework: Unsloth + TRL’s
GRPOTrainer
- Max Optimal Length: 1024 tokens
- LoRA Rank: 16 (LoRA 16‑bit, not full fine‑tuning)
⚙️ Training Configuration
Trained with these GRPOConfig goodies:
learning_rate: 4e-6
adam_beta1: 0.9
adam_beta2: 0.99
weight_decay: 0.1
warmup_ratio: 0.1
lr_scheduler_type: cosine
optim: adamw_8bit
logging_steps: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
num_generations: 2
max_prompt_length: 1024
max_completion_length: 1024
max_steps: 100
max_grad_norm: 0.1
importance_sampling_level: sequence
mask_truncated_completions: False
loss_type: dr_grpo
Training ran on Unsloth’s Vision GRPO notebook. With just 100 steps of RL.
The model expects this structure:
REASONING_START = "<think>"
REASONING_END = "</think>"
SOLUTION_START = "<answer>"
SOLUTION_END = "</answer>"
Prompt in training (image + text):
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Your question here. Also first provide your reasoning... and then your final answer between <answer> and (put a single float here) </answer>"}
]
}
The model learned to spit out reasoning inside <think> and the final numeric answer inside <answer>. Short & clean. ✨
🎮 Recommended Inference Settings
🧘 Default (Balanced)
Table with columns: Parameter, Value| Parameter | Value |
|---|
temperature | 0.4 |
top_k | 30 |
repeat_penalty | 1.1 |
🧠 Reasoning Mode (More deterministic)
Table with columns: Parameter, Value| Parameter | Value |
|---|
temperature | 0.0 – 0.1 |
top_k | 60 |
repeat_penalty | 1.2 |
💻 Usage Example (Unsloth style)
from unsloth import FastVisionModel
from transformers import AutoProcessor
import torch
model, tokenizer = FastVisionModel.from_pretrained(
"ertghiu256/Qwen3.5-2b-ReMix-Vision-GRPO",
max_seq_length=1024,
load_in_4bit=False,
fast_inference=False,
)
processor = AutoProcessor.from_pretrained("ertghiu256/Qwen3.5-2b-ReMix-Vision-GRPO")
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What's the area? Also first provide your reasoning... and then your final answer between <answer> and (put a single float here) </answer>"}
]
}
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_tensors="pt",
return_dict=True,
)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
outputs = model.generate(
**inputs,
max_new_tokens=1024,
temperature=0.4,
top_k=30,
repetition_penalty=1.1,
do_sample=True,
)
response = processor.decode(outputs[0], skip_special_tokens=False)
print(response)
⚠️ Limitations
- Only 100 training steps – don’t expect much, but it’s okay!
- Batch size 1, generations = 2 – it’s a tiny RL
- 2B parameters = Sometimes can hallucinates answers
Uploaded finetuned model
- Developed by: ertghiu256
- License: apache-2.0
- Finetuned from model : ertghiu256/Qwen3.5-2b-ReMix
This qwen3_5 model was trained 2x faster with Unsloth and Huggingface's TRL library.