Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Model Description
Traditional reinforcement learning alignment (like standard GRPO) backpropagates formatting and correctness gradients across all layers of a language model. In smaller models (1.5B to 3B parameters), this triggers Central Engine Disruption—the destructive corruption of core mathematical and logical representations in early and middle layers (L0--L23).
LF-GRPO solves this by strictly freezing the model's central logic core (L0--L23) and confining parameter updates to the late-layer behavioral periphery (L24--L27). This allows the model to learn complex reasoning layout boundaries (such as step-by-step <think> tag monologues) without corrupting its underlying arithmetic capability.
Functional Behaviors:
- Structured Thinking: The model breaks down word problems step-by-step using logical numbering arrays.
- Conciseness Penalization: Through step-decay relative rewards, the model maintains a short, high-density reasoning path, preventing verbosity drift.
- Intact Core Arithmetic: Avoids the standard post-alignment reasoning decay, preserving raw calculation precision.
How to Get Started with the Model
You can load this adapter on top of the base Qwen-1.5B model using peft and transformers.
python
import torchfrom transformers import AutoModelForCausalLM, AutoTokenizerfrom peft import PeftModel# Load the base modelbase_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct",torch_dtype=torch.float16,device_map="auto")# Load the LF-GRPO adaptermodel = PeftModel.from_pretrained(base_model, "kridaydave/Qwen-1.5B-LFGRPO-OPTIM")tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")model.eval()# Prompt format (Zero-Shot CoT with system guidance)SYSTEM_PROMPT = ("A conversation between User and Assistant. The Assistant must think step-by-step ""inside <think>...</think> tags to solve the mathematical problem, and then provide ""the final numeric answer outside the tags.")prompt = f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n<|im_start|>user\nJanet has 16 eggs. She eats 3 for breakfast and bakes muffins with 4. She sells the rest for $2 each. How much does she make?<|im_end|>\n<|im_start|>assistant\n<think>\n"inputs = tokenizer(prompt, return_tensors="pt").to("cuda")with torch.no_grad():outputs = model.generate(**inputs, max_new_tokens=300, do_sample=False)print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:]))
Training Details
Training Data
The model was trained on a 1,000-sample subset of the OpenAI GSM8K dataset, optimized specifically for step-by-step math logic.
Training Procedure
- Regime: Two-stage optimization. Stage 1 (steps 0-100) focuses on format-priming and monologue tag alignment. Stage 2 (steps 101-300) optimizes for final math correctness and conciseness.
- Group Relative Search: Group size (N=4) is used to compute advantages relative to the group mean and standard deviation, bypassing the memory-heavy critic model.
- Autograd Periphery Insulation: Hard gradient masking applied at layer 24. 100% of parameters in layers 0-23 were kept frozen.
Training Hyperparameters
- LoRA Target Modules:
q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj - LoRA Rank / Alpha: 32 / 32
- Targeted Layers:
[24, 25, 26, 27] - Trainable Parameters: 5,275,648 (0.34% of base model)
- Optimizer:
paged_adamw_8bit(with CUDA page offloading) - Learning Rate: 1.5e-5
- Batch Configuration: Batch=1, Accumulation=4 (effective batch size = 4)
- Sequence Limits: Prompt=512, Completion=384
Evaluation Results
Evaluated on the OpenAI GSM8K test split (held-out prompts) under a zero-shot ChatML reasoning format:
- Qwen2.5-1.5B-Instruct (Base Baseline): ~42.0% - 50.0%
- Standard GRPO (Full-Layer LoRA): ~42.0% (degraded due to alignment tax / engine disruption)
- LF-GRPO (This Work - Step 100): ~50.0%
- LF-GRPO (This Work - Step 200/300): ~58.0% - 65.0% OOD accuracy (highly structured, concise CoT)
Environmental Impact
- Hardware Type: 1 x Tesla T4 GPU (16GB VRAM)
- Hours used: ~2.0 hours
- Cloud Provider: Google Colab
- Compute Region:
us-central1
Technical Specifications
Model Architecture
The underlying architecture is based on Qwen2.5 (RoPE embeddings, SwiGLU gating, and RMSNorm layers) using a 28-layer parameter layout.
Software
- TRL (Transformer Reinforcement Learning)
- Unsloth (Fast language model training & Triton kernels)
- vLLM (Fast CUDA graph decoders for advantage rollouts)
Model provider
kridaydave
Model tree
Base
Qwen/Qwen2.5-1.5B-Instruct
Adapter
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information