xbruce22
gemma-4-e2b-reasoning-lora
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0What's in this repo
| File | Why |
|---|---|
adapter_model.safetensors | The trained LoRA weights (12.08M params, ~46 MB) |
adapter_config.json | LoRA config (r=8, alpha=8, target modules) |
tokenizer.json, tokenizer_config.json, chat_template.jinja | Gemma4 tokenizer + chat template |
chat.py | Ready-to-run interactive chat script (streaming) |
README.md | This file |
This is a LoRA adapter only, not a standalone model. You load the base model (
unsloth/gemma-4-E2B-it) and apply this adapter on top — see below.
Quick start (chat)
bash
pip install torch transformers peftpython chat.py
chat.py auto-detects CUDA / Intel XPU / CPU, loads the base model, applies this adapter, merges it, and starts a streaming chat with thinking ON. In-chat commands: /q quit · /reset clear history · /raw show special-token markers · /think toggle thinking.
How to use the LoRA adapter (code)
python
import torchfrom transformers import AutoModelForCausalLM, AutoProcessorfrom peft import PeftModelBASE = "unsloth/gemma-4-E2B-it"ADAPTER = "xbruce22/gemma-4-e2b-reasoning-lora"device = "cuda" if torch.cuda.is_available() else ("xpu" if hasattr(torch, "xpu") and torch.xpu.is_available() else "cpu")dtype = torch.float32 if device == "cpu" else torch.bfloat16base = AutoModelForCausalLM.from_pretrained(BASE, dtype=dtype).to(device)model = PeftModel.from_pretrained(base, ADAPTER)# Optional: merge LoRA into the weights for faster inferencemodel = model.merge_and_unload()model.eval()processor = AutoProcessor.from_pretrained(BASE)messages = [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Write DFS in python, keep short."},]text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)inputs = processor(text=[text], return_tensors="pt").to(device)# Text-only: drop multimodal-only fields generate() rejectsfor k in list(inputs):if "token_type" in k or "pixel" in k or "audio" in k:inputs.pop(k)with torch.inference_mode():out = model.generate(**inputs, max_new_tokens=1024, do_sample=True,temperature=1.0, top_p=0.95, top_k=64,pad_token_id=processor.tokenizer.pad_token_id)gen = out[0][inputs["input_ids"].shape[1]:]print(processor.decode(gen, skip_special_tokens=True))
Notes:
- Pass
enable_thinking=Truetoapply_chat_templateso the template injects<|think|>and the model produces the<|channel>thought ... <channel|>reasoning block before the answer. - Recommended Gemma-4 sampling:
temperature=1.0, top_p=0.95, top_k=64. - If you don't
merge_and_unload(), keep using thePeftModeldirectly — both work.
Expected output style
Prompt: Write DFS in python, keep short.
markdown
── thinking ──- User wants a DFS implementation in Python, explicitly requesting it be "short"- Settled on iterative version using a stack and visited set ...- Concise version: no classes, just a function — keeps it short while remaining correct── answer ──def dfs(graph, start, visited=None):...
The reasoning is now terse, bulleted, and scannable — the style it was fine-tuned to produce.
Training details
- Method: LoRA (r=8, alpha=8, dropout=0) on the text language model's attention (
q/k/v/o_proj) + MLP (gate/up/down_proj) modules. Vision and audio towers frozen (text-only finetune). - Trainable params: 12,079,104 (0.236% of 5.1B).
- Data: 25,614 reasoning rows from
Jackrong/GLM-5.1-Reasoning-1M-Cleaned(main subset). The verboseimd…answerthinking traces were condensed into terse flat bullet lists (via a condenser prompt); the original final answers were kept verbatim. - Training format: Gemma4 chat format with thinking ON —
<|channel>thought\n...bullets...\n<channel|>then the final answer,<|turn>turn markers, assistant-only loss (user/system tokens masked to -100). - Hardware: Intel XPU (Intel Graphics 0xe211, 24 GB), bf16,
adamw_torch, gradient checkpointing. No 4-bit / bitsandbytes (no XPU build). - Schedule: 1 full epoch, 6400 steps, per-device batch 1 × gradient accumulation 4, lr 2e-4 linear, 5 warmup steps, max_seq_length 1536. ~5.7 h.
- Final train_loss: 0.795 (loss MA 1.22 → 0.76, token accuracy 0.74 → 0.79, no OOM).
License
Apache-2.0 (adapter weights). The base model unsloth/gemma-4-E2B-it follows Gemma's terms.
Model provider
xbruce22
Model tree
Base
unsloth/gemma-4-E2B-it
Adapter
this model
Modalities
Input
Video, Audio, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information