xbruce22

gemma-4-e2b-reasoning-lora

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

What's in this repo

Table
FileWhy
adapter_model.safetensorsThe trained LoRA weights (12.08M params, ~46 MB)
adapter_config.jsonLoRA config (r=8, alpha=8, target modules)
tokenizer.json, tokenizer_config.json, chat_template.jinjaGemma4 tokenizer + chat template
chat.pyReady-to-run interactive chat script (streaming)
README.mdThis file

This is a LoRA adapter only, not a standalone model. You load the base model (unsloth/gemma-4-E2B-it) and apply this adapter on top — see below.

Quick start (chat)

bash

pip install torch transformers peft
python chat.py

chat.py auto-detects CUDA / Intel XPU / CPU, loads the base model, applies this adapter, merges it, and starts a streaming chat with thinking ON. In-chat commands: /q quit · /reset clear history · /raw show special-token markers · /think toggle thinking.

How to use the LoRA adapter (code)

python

import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from peft import PeftModel
BASE = "unsloth/gemma-4-E2B-it"
ADAPTER = "xbruce22/gemma-4-e2b-reasoning-lora"
device = "cuda" if torch.cuda.is_available() else (
"xpu" if hasattr(torch, "xpu") and torch.xpu.is_available() else "cpu")
dtype = torch.float32 if device == "cpu" else torch.bfloat16
base = AutoModelForCausalLM.from_pretrained(BASE, dtype=dtype).to(device)
model = PeftModel.from_pretrained(base, ADAPTER)
# Optional: merge LoRA into the weights for faster inference
model = model.merge_and_unload()
model.eval()
processor = AutoProcessor.from_pretrained(BASE)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write DFS in python, keep short."},
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
inputs = processor(text=[text], return_tensors="pt").to(device)
# Text-only: drop multimodal-only fields generate() rejects
for k in list(inputs):
if "token_type" in k or "pixel" in k or "audio" in k:
inputs.pop(k)
with torch.inference_mode():
out = model.generate(
**inputs, max_new_tokens=1024, do_sample=True,
temperature=1.0, top_p=0.95, top_k=64,
pad_token_id=processor.tokenizer.pad_token_id)
gen = out[0][inputs["input_ids"].shape[1]:]
print(processor.decode(gen, skip_special_tokens=True))

Notes:

  • Pass enable_thinking=True to apply_chat_template so the template injects <|think|> and the model produces the <|channel>thought ... <channel|> reasoning block before the answer.
  • Recommended Gemma-4 sampling: temperature=1.0, top_p=0.95, top_k=64.
  • If you don't merge_and_unload(), keep using the PeftModel directly — both work.

Expected output style

Prompt: Write DFS in python, keep short.

markdown

── thinking ──
- User wants a DFS implementation in Python, explicitly requesting it be "short"
- Settled on iterative version using a stack and visited set ...
- Concise version: no classes, just a function — keeps it short while remaining correct
── answer ──
def dfs(graph, start, visited=None):
...

The reasoning is now terse, bulleted, and scannable — the style it was fine-tuned to produce.

Training details

  • Method: LoRA (r=8, alpha=8, dropout=0) on the text language model's attention (q/k/v/o_proj) + MLP (gate/up/down_proj) modules. Vision and audio towers frozen (text-only finetune).
  • Trainable params: 12,079,104 (0.236% of 5.1B).
  • Data: 25,614 reasoning rows from Jackrong/GLM-5.1-Reasoning-1M-Cleaned (main subset). The verbose imd…answer thinking traces were condensed into terse flat bullet lists (via a condenser prompt); the original final answers were kept verbatim.
  • Training format: Gemma4 chat format with thinking ON — <|channel>thought\n...bullets...\n<channel|> then the final answer, <|turn> turn markers, assistant-only loss (user/system tokens masked to -100).
  • Hardware: Intel XPU (Intel Graphics 0xe211, 24 GB), bf16, adamw_torch, gradient checkpointing. No 4-bit / bitsandbytes (no XPU build).
  • Schedule: 1 full epoch, 6400 steps, per-device batch 1 × gradient accumulation 4, lr 2e-4 linear, 5 warmup steps, max_seq_length 1536. ~5.7 h.
  • Final train_loss: 0.795 (loss MA 1.22 → 0.76, token accuracy 0.74 → 0.79, no OOM).

License

Apache-2.0 (adapter weights). The base model unsloth/gemma-4-E2B-it follows Gemma's terms.

Model provider

xbruce22

Model tree

Base

unsloth/gemma-4-E2B-it

Adapter

this model

Modalities

Input

Video, Audio, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today