qwen3-1.7b-gsm8k-grpo API & Inference Endpoint

Training Details

Base Model: Qwen/Qwen3-1.7B
GPU: NVIDIA L4 (22GB VRAM)
Training Stages: SFT followed by GRPO
LoRA Rank (r): 32
LoRA Alpha: 64
Target Modules: all-linear
Quantization: 4-bit NF4 (BitsAndBytes)

How to run

You can use this model by loading the base model and applying this PEFT adapter. Here is a standalone code snippet:

Caveats

!pip install --upgrade torchao !pip install -U bitsandbytes>=0.46.1

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel

base_model_id = "Qwen/Qwen3-1.7B"
adapter_id = "ehzawad/qwen3-1.7b-gsm8k-grpo"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config=bnb_config,
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained(adapter_id)
hf_model = PeftModel.from_pretrained(base_model, adapter_id, is_trainable=False)
hf_model.eval()

prompt = "Janet has 3 bags with 4 apples each. She gives away 5 apples. How many remain?"
system_prompt = (
    "You are a careful math reasoning assistant. "
    "Solve the problem step by step, but keep the solution concise. "
    "End with exactly one final answer in the form \boxed{answer}."
)
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": prompt}
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True
).to(hf_model.device)

with torch.inference_mode():
    outputs = hf_model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.6,
        do_sample=True,
    )

response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(response)

Training Details

Base Model: Qwen/Qwen3-1.7B
GPU: NVIDIA L4 (22GB VRAM)
Training Stages: SFT followed by GRPO
LoRA Rank (r): 32
LoRA Alpha: 64
Target Modules: all-linear
Quantization: 4-bit NF4 (BitsAndBytes)

How to run

You can use this model by loading the base model and applying this PEFT adapter. Here is a standalone code snippet:

Caveats

!pip install --upgrade torchao !pip install -U bitsandbytes>=0.46.1

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel

base_model_id = "Qwen/Qwen3-1.7B"
adapter_id = "ehzawad/qwen3-1.7b-gsm8k-grpo"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config=bnb_config,
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained(adapter_id)
hf_model = PeftModel.from_pretrained(base_model, adapter_id, is_trainable=False)
hf_model.eval()

prompt = "Janet has 3 bags with 4 apples each. She gives away 5 apples. How many remain?"
system_prompt = (
    "You are a careful math reasoning assistant. "
    "Solve the problem step by step, but keep the solution concise. "
    "End with exactly one final answer in the form \boxed{answer}."
)
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": prompt}
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True
).to(hf_model.device)

with torch.inference_mode():
    outputs = hf_model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.6,
        do_sample=True,
    )

response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(response)

qwen3-1.7b-gsm8k-grpo

README

Training Details

How to run

Caveats

Explore FriendliAI today

README

Training Details

How to run

Caveats