Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0🧠 Model Details
- Base Model:
ertghiu256/Qwen3.5-2b-ReMix(a fine‑tune of Qwen3.5‑2B) - Fine‑tuning Method: GRPO,
loss_type = "dr_grpo" - Framework: Unsloth + TRL’s
GRPOTrainer - Max Optimal Length: 1024 tokens
- LoRA Rank: 16 (LoRA 16‑bit, not full fine‑tuning)
⚙️ Training Configuration
Trained with these GRPOConfig goodies:
yaml
learning_rate: 4e-6adam_beta1: 0.9adam_beta2: 0.99weight_decay: 0.1warmup_ratio: 0.1lr_scheduler_type: cosineoptim: adamw_8bitlogging_steps: 1per_device_train_batch_size: 1gradient_accumulation_steps: 1num_generations: 2max_prompt_length: 1024max_completion_length: 1024max_steps: 100max_grad_norm: 0.1importance_sampling_level: sequencemask_truncated_completions: Falseloss_type: dr_grpo
Training ran on Unsloth’s Vision GRPO notebook. With just 100 steps of RL.
📝 Prompt Format
The model expects this structure:
python
REASONING_START = "<think>"REASONING_END = "</think>"SOLUTION_START = "<answer>"SOLUTION_END = "</answer>"
Prompt in training (image + text):
python
{"role": "user","content": [{"type": "image"},{"type": "text", "text": "Your question here. Also first provide your reasoning... and then your final answer between <answer> and (put a single float here) </answer>"}]}
The model learned to spit out reasoning inside <think> and the final numeric answer inside <answer>. Short & clean. ✨
🎮 Recommended Inference Settings
🧘 Default (Balanced)
| Parameter | Value |
|---|---|
temperature | 0.4 |
top_k | 30 |
repeat_penalty | 1.1 |
🧠 Reasoning Mode (More deterministic)
| Parameter | Value |
|---|---|
temperature | 0.0 – 0.1 |
top_k | 60 |
repeat_penalty | 1.2 |
💻 Usage Example (Unsloth style)
python
from unsloth import FastVisionModelfrom transformers import AutoProcessorimport torchmodel, tokenizer = FastVisionModel.from_pretrained("ertghiu256/Qwen3.5-2b-ReMix-Vision-GRPO",max_seq_length=1024, # 👈 stop from overthinking!load_in_4bit=False,fast_inference=False,)processor = AutoProcessor.from_pretrained("ertghiu256/Qwen3.5-2b-ReMix-Vision-GRPO")messages = [{"role": "user","content": [{"type": "image"},{"type": "text", "text": "What's the area? Also first provide your reasoning... and then your final answer between <answer> and (put a single float here) </answer>"}]}]inputs = processor.apply_chat_template(messages,add_generation_prompt=True,tokenize=True,return_tensors="pt",return_dict=True,)inputs = {k: v.to(model.device) for k, v in inputs.items()}outputs = model.generate(**inputs,max_new_tokens=1024,temperature=0.4,top_k=30,repetition_penalty=1.1,do_sample=True,)response = processor.decode(outputs[0], skip_special_tokens=False)print(response)
⚠️ Limitations
- Only 100 training steps – don’t expect much, but it’s okay!
- Batch size 1, generations = 2 – it’s a tiny RL
- 2B parameters = Sometimes can hallucinates answers
Uploaded finetuned model
- Developed by: ertghiu256
- License: apache-2.0
- Finetuned from model : ertghiu256/Qwen3.5-2b-ReMix
This qwen3_5 model was trained 2x faster with Unsloth and Huggingface's TRL library.
Model provider
ertghiu256
Model tree
Base
ertghiu256/Qwen3.5-2b-ReMix
Fine-tuned
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information