ertghiu256

Qwen3.5-2b-ReMix-Vision-GRPO

Deploy Dedicated

README

License: apache-2.0

🧠 Model Details

Base Model: ertghiu256/Qwen3.5-2b-ReMix (a fine‑tune of Qwen3.5‑2B)
Fine‑tuning Method: GRPO, loss_type = "dr_grpo"
Framework: Unsloth + TRL’s GRPOTrainer
Max Optimal Length: 1024 tokens
LoRA Rank: 16 (LoRA 16‑bit, not full fine‑tuning)

⚙️ Training Configuration

Trained with these GRPOConfig goodies:

yaml
learning_rate: 4e-6
adam_beta1: 0.9
adam_beta2: 0.99
weight_decay: 0.1
warmup_ratio: 0.1
lr_scheduler_type: cosine
optim: adamw_8bit
logging_steps: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
num_generations: 2
max_prompt_length: 1024
max_completion_length: 1024
max_steps: 100
max_grad_norm: 0.1
importance_sampling_level: sequence
mask_truncated_completions: False
loss_type: dr_grpo

Training ran on Unsloth’s Vision GRPO notebook. With just 100 steps of RL.

📝 Prompt Format

The model expects this structure:

python
REASONING_START = "<think>"
REASONING_END = "</think>"
SOLUTION_START = "<answer>"
SOLUTION_END = "</answer>"

Prompt in training (image + text):

python
{
    "role": "user",
    "content": [
        {"type": "image"},
        {"type": "text", "text": "Your question here. Also first provide your reasoning... and then your final answer between <answer> and (put a single float here) </answer>"}
    ]
}

The model learned to spit out reasoning inside <think> and the final numeric answer inside <answer>. Short & clean. ✨

🎮 Recommended Inference Settings

🧘 Default (Balanced)

Table with columns: Parameter, Value
Parameter	Value
`temperature`	0.4
`top_k`	30
`repeat_penalty`	1.1

🧠 Reasoning Mode (More deterministic)

Table with columns: Parameter, Value
Parameter	Value
`temperature`	0.0 – 0.1
`top_k`	60
`repeat_penalty`	1.2

💻 Usage Example (Unsloth style)

python
from unsloth import FastVisionModel
from transformers import AutoProcessor
import torch

model, tokenizer = FastVisionModel.from_pretrained(
    "ertghiu256/Qwen3.5-2b-ReMix-Vision-GRPO",
    max_seq_length=1024,   # 👈 stop from overthinking!
    load_in_4bit=False,
    fast_inference=False,
)

processor = AutoProcessor.from_pretrained("ertghiu256/Qwen3.5-2b-ReMix-Vision-GRPO")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What's the area? Also first provide your reasoning... and then your final answer between <answer> and (put a single float here) </answer>"}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
    return_dict=True,
)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

outputs = model.generate(
    **inputs,
    max_new_tokens=1024,
    temperature=0.4,
    top_k=30,
    repetition_penalty=1.1,
    do_sample=True,
)
response = processor.decode(outputs[0], skip_special_tokens=False)
print(response)

⚠️ Limitations

Only 100 training steps – don’t expect much, but it’s okay!
Batch size 1, generations = 2 – it’s a tiny RL
2B parameters = Sometimes can hallucinates answers

Uploaded finetuned model

Developed by: ertghiu256
License: apache-2.0
Finetuned from model : ertghiu256/Qwen3.5-2b-ReMix

This qwen3_5 model was trained 2x faster with Unsloth and Huggingface's TRL library.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

ertghiu256

Model Tree

Base

ertghiu256/Qwen3.5-2b-ReMix

Fine-tuned

this model

Input Modalities

Text

Image

Video

Output Modalities