kridaydave

Qwen-1.5B-LFGRPO-OPTIM

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Model Description

Traditional reinforcement learning alignment (like standard GRPO) backpropagates formatting and correctness gradients across all layers of a language model. In smaller models (1.5B to 3B parameters), this triggers Central Engine Disruption—the destructive corruption of core mathematical and logical representations in early and middle layers ( $L 0$ -- $L 23$ ).

LF-GRPO solves this by strictly freezing the model's central logic core ( $L 0$ -- $L 23$ ) and confining parameter updates to the late-layer behavioral periphery ( $L 24$ -- $L 27$ ). This allows the model to learn complex reasoning layout boundaries (such as step-by-step <think> tag monologues) without corrupting its underlying arithmetic capability.

Functional Behaviors:

Structured Thinking: The model breaks down word problems step-by-step using logical numbering arrays.
Conciseness Penalization: Through step-decay relative rewards, the model maintains a short, high-density reasoning path, preventing verbosity drift.
Intact Core Arithmetic: Avoids the standard post-alignment reasoning decay, preserving raw calculation precision.

How to Get Started with the Model

You can load this adapter on top of the base Qwen-1.5B model using peft and transformers.

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load the base model
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-1.5B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load the LF-GRPO adapter
model = PeftModel.from_pretrained(base_model, "kridaydave/Qwen-1.5B-LFGRPO-OPTIM")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")

model.eval()

# Prompt format (Zero-Shot CoT with system guidance)
SYSTEM_PROMPT = (
    "A conversation between User and Assistant. The Assistant must think step-by-step "
    "inside <think>...</think> tags to solve the mathematical problem, and then provide "
    "the final numeric answer outside the tags."
)

prompt = f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n<|im_start|>user\nJanet has 16 eggs. She eats 3 for breakfast and bakes muffins with 4. She sells the rest for $2 each. How much does she make?<|im_end|>\n<|im_start|>assistant\n<think>\n"

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=300, do_sample=False)

print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:]))

Training Details

Training Data

The model was trained on a 1,000-sample subset of the OpenAI GSM8K dataset, optimized specifically for step-by-step math logic.

Training Procedure

Regime: Two-stage optimization. Stage 1 (steps 0-100) focuses on format-priming and monologue tag alignment. Stage 2 (steps 101-300) optimizes for final math correctness and conciseness.
Group Relative Search: Group size ( $N = 4$ ) is used to compute advantages relative to the group mean and standard deviation, bypassing the memory-heavy critic model.
Autograd Periphery Insulation: Hard gradient masking applied at layer 24. 100% of parameters in layers 0-23 were kept frozen.

Training Hyperparameters

LoRA Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
LoRA Rank / Alpha: 32 / 32
Targeted Layers: [24, 25, 26, 27]
Trainable Parameters: 5,275,648 (0.34% of base model)
Optimizer: paged_adamw_8bit (with CUDA page offloading)
Learning Rate: 1.5e-5
Batch Configuration: Batch=1, Accumulation=4 (effective batch size = 4)

Evaluation Results

Evaluated on the OpenAI GSM8K test split (held-out prompts) under a zero-shot ChatML reasoning format:

Qwen2.5-1.5B-Instruct (Base Baseline): ~42.0% - 50.0%
Standard GRPO (Full-Layer LoRA): ~42.0% (degraded due to alignment tax / engine disruption)
LF-GRPO (This Work - Step 100): ~50.0%
LF-GRPO (This Work - Step 200/300): ~58.0% - 65.0% OOD accuracy (highly structured, concise CoT)

Environmental Impact

Hardware Type: 1 x Tesla T4 GPU (16GB VRAM)
Hours used: ~2.0 hours
Cloud Provider: Google Colab
Compute Region: us-central1

Technical Specifications

Model Architecture

The underlying architecture is based on Qwen2.5 (RoPE embeddings, SwiGLU gating, and RMSNorm layers) using a 28-layer parameter layout.

Software

TRL (Transformer Reinforcement Learning)
Unsloth (Fast language model training & Triton kernels)
vLLM (Fast CUDA graph decoders for advantage rollouts)

Model provider

kridaydave

Model tree

Base

Qwen/Qwen2.5-1.5B-Instruct

Adapter

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Model Description

Functional Behaviors:

Structured Thinking: The model breaks down word problems step-by-step using logical numbering arrays.
Conciseness Penalization: Through step-decay relative rewards, the model maintains a short, high-density reasoning path, preventing verbosity drift.
Intact Core Arithmetic: Avoids the standard post-alignment reasoning decay, preserving raw calculation precision.

How to Get Started with the Model

You can load this adapter on top of the base Qwen-1.5B model using peft and transformers.

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load the base model
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-1.5B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load the LF-GRPO adapter
model = PeftModel.from_pretrained(base_model, "kridaydave/Qwen-1.5B-LFGRPO-OPTIM")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")

model.eval()

# Prompt format (Zero-Shot CoT with system guidance)
SYSTEM_PROMPT = (
    "A conversation between User and Assistant. The Assistant must think step-by-step "
    "inside <think>...</think> tags to solve the mathematical problem, and then provide "
    "the final numeric answer outside the tags."
)

prompt = f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n<|im_start|>user\nJanet has 16 eggs. She eats 3 for breakfast and bakes muffins with 4. She sells the rest for $2 each. How much does she make?<|im_end|>\n<|im_start|>assistant\n<think>\n"

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=300, do_sample=False)

print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:]))

Training Details

Training Data

The model was trained on a 1,000-sample subset of the OpenAI GSM8K dataset, optimized specifically for step-by-step math logic.

Training Procedure

Regime: Two-stage optimization. Stage 1 (steps 0-100) focuses on format-priming and monologue tag alignment. Stage 2 (steps 101-300) optimizes for final math correctness and conciseness.
Group Relative Search: Group size ( $N = 4$ ) is used to compute advantages relative to the group mean and standard deviation, bypassing the memory-heavy critic model.
Autograd Periphery Insulation: Hard gradient masking applied at layer 24. 100% of parameters in layers 0-23 were kept frozen.

Training Hyperparameters

LoRA Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
LoRA Rank / Alpha: 32 / 32
Targeted Layers: [24, 25, 26, 27]
Trainable Parameters: 5,275,648 (0.34% of base model)
Optimizer: paged_adamw_8bit (with CUDA page offloading)
Learning Rate: 1.5e-5
Batch Configuration: Batch=1, Accumulation=4 (effective batch size = 4)

Evaluation Results

Evaluated on the OpenAI GSM8K test split (held-out prompts) under a zero-shot ChatML reasoning format:

Qwen2.5-1.5B-Instruct (Base Baseline): ~42.0% - 50.0%
Standard GRPO (Full-Layer LoRA): ~42.0% (degraded due to alignment tax / engine disruption)
LF-GRPO (This Work - Step 100): ~50.0%
LF-GRPO (This Work - Step 200/300): ~58.0% - 65.0% OOD accuracy (highly structured, concise CoT)

Environmental Impact

Hardware Type: 1 x Tesla T4 GPU (16GB VRAM)
Hours used: ~2.0 hours
Cloud Provider: Google Colab
Compute Region: us-central1

Technical Specifications

Model Architecture

The underlying architecture is based on Qwen2.5 (RoPE embeddings, SwiGLU gating, and RMSNorm layers) using a 28-layer parameter layout.

Software

TRL (Transformer Reinforcement Learning)
Unsloth (Fast language model training & Triton kernels)
vLLM (Fast CUDA graph decoders for advantage rollouts)

Qwen-1.5B-LFGRPO-OPTIM

Get help setting up a custom Dedicated Endpoints.

README

Model Description

Functional Behaviors:

How to Get Started with the Model

Training Details

Training Data

Training Procedure

Training Hyperparameters

Evaluation Results

Environmental Impact

Technical Specifications

Model Architecture

Software

Explore FriendliAI today

README

Model Description

Functional Behaviors:

How to Get Started with the Model

Training Details

Training Data

Training Procedure

Training Hyperparameters

Evaluation Results

Environmental Impact

Technical Specifications

Model Architecture

Software