Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

🧠 Model Details

  • Base Model: ertghiu256/Qwen3.5-2b-ReMix (a fine‑tune of Qwen3.5‑2B)
  • Fine‑tuning Method: GRPO, loss_type = "dr_grpo"
  • Framework: Unsloth + TRL’s GRPOTrainer
  • Max Optimal Length: 1024 tokens
  • LoRA Rank: 16 (LoRA 16‑bit, not full fine‑tuning)

⚙️ Training Configuration

Trained with these GRPOConfig goodies:

yaml

learning_rate: 4e-6
adam_beta1: 0.9
adam_beta2: 0.99
weight_decay: 0.1
warmup_ratio: 0.1
lr_scheduler_type: cosine
optim: adamw_8bit
logging_steps: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
num_generations: 2
max_prompt_length: 1024
max_completion_length: 1024
max_steps: 100
max_grad_norm: 0.1
importance_sampling_level: sequence
mask_truncated_completions: False
loss_type: dr_grpo

Training ran on Unsloth’s Vision GRPO notebook. With just 100 steps of RL.


📝 Prompt Format

The model expects this structure:

python

REASONING_START = "<think>"
REASONING_END = "</think>"
SOLUTION_START = "<answer>"
SOLUTION_END = "</answer>"

Prompt in training (image + text):

python

{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Your question here. Also first provide your reasoning... and then your final answer between <answer> and (put a single float here) </answer>"}
]
}

The model learned to spit out reasoning inside <think> and the final numeric answer inside <answer>. Short & clean. ✨


🎮 Recommended Inference Settings

🧘 Default (Balanced)

ParameterValue
temperature0.4
top_k30
repeat_penalty1.1

🧠 Reasoning Mode (More deterministic)

ParameterValue
temperature0.0 – 0.1
top_k60
repeat_penalty1.2

💻 Usage Example (Unsloth style)

python

from unsloth import FastVisionModel
from transformers import AutoProcessor
import torch
model, tokenizer = FastVisionModel.from_pretrained(
"ertghiu256/Qwen3.5-2b-ReMix-Vision-GRPO",
max_seq_length=1024, # 👈 stop from overthinking!
load_in_4bit=False,
fast_inference=False,
)
processor = AutoProcessor.from_pretrained("ertghiu256/Qwen3.5-2b-ReMix-Vision-GRPO")
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What's the area? Also first provide your reasoning... and then your final answer between <answer> and (put a single float here) </answer>"}
]
}
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_tensors="pt",
return_dict=True,
)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
outputs = model.generate(
**inputs,
max_new_tokens=1024,
temperature=0.4,
top_k=30,
repetition_penalty=1.1,
do_sample=True,
)
response = processor.decode(outputs[0], skip_special_tokens=False)
print(response)

⚠️ Limitations

  • Only 100 training steps – don’t expect much, but it’s okay!
  • Batch size 1, generations = 2 – it’s a tiny RL
  • 2B parameters = Sometimes can hallucinates answers

Uploaded finetuned model

  • Developed by: ertghiu256
  • License: apache-2.0
  • Finetuned from model : ertghiu256/Qwen3.5-2b-ReMix

This qwen3_5 model was trained 2x faster with Unsloth and Huggingface's TRL library.

Model provider

ertghiu256

ertghiu256

Model tree

Base

ertghiu256/Qwen3.5-2b-ReMix

Fine-tuned

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today