xbruce22/gemma-4-e2b-reasoning-lora API & Inference Endpoint

What's in this repo

Table
File	Why
`adapter_model.safetensors`	The trained LoRA weights (12.08M params, ~46 MB)
`adapter_config.json`	LoRA config (r=8, alpha=8, target modules)
`tokenizer.json`, `tokenizer_config.json`, `chat_template.jinja`	Gemma4 tokenizer + chat template
`chat.py`	Ready-to-run interactive chat script (streaming)
`README.md`	This file

This is a LoRA adapter only, not a standalone model. You load the base model (unsloth/gemma-4-E2B-it) and apply this adapter on top — see below.

Quick start (chat)

bash
pip install torch transformers peft
python chat.py

chat.py auto-detects CUDA / Intel XPU / CPU, loads the base model, applies this adapter, merges it, and starts a streaming chat with thinking ON. In-chat commands: /q quit · /reset clear history · /raw show special-token markers · /think toggle thinking.

How to use the LoRA adapter (code)

python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from peft import PeftModel

BASE = "unsloth/gemma-4-E2B-it"
ADAPTER = "xbruce22/gemma-4-e2b-reasoning-lora"

device = "cuda" if torch.cuda.is_available() else (
         "xpu" if hasattr(torch, "xpu") and torch.xpu.is_available() else "cpu")
dtype = torch.float32 if device == "cpu" else torch.bfloat16

base = AutoModelForCausalLM.from_pretrained(BASE, dtype=dtype).to(device)
model = PeftModel.from_pretrained(base, ADAPTER)
# Optional: merge LoRA into the weights for faster inference
model = model.merge_and_unload()
model.eval()

processor = AutoProcessor.from_pretrained(BASE)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Write DFS in python, keep short."},
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)

inputs = processor(text=[text], return_tensors="pt").to(device)
# Text-only: drop multimodal-only fields generate() rejects
for k in list(inputs):
    if "token_type" in k or "pixel" in k or "audio" in k:
        inputs.pop(k)

with torch.inference_mode():
    out = model.generate(
        **inputs, max_new_tokens=1024, do_sample=True,
        temperature=1.0, top_p=0.95, top_k=64,
        pad_token_id=processor.tokenizer.pad_token_id)

gen = out[0][inputs["input_ids"].shape[1]:]
print(processor.decode(gen, skip_special_tokens=True))

Notes:

Pass enable_thinking=True to apply_chat_template so the template injects <|think|> and the model produces the <|channel>thought ... <channel|> reasoning block before the answer.
Recommended Gemma-4 sampling: temperature=1.0, top_p=0.95, top_k=64.
If you don't merge_and_unload(), keep using the PeftModel directly — both work.

Expected output style

Prompt: Write DFS in python, keep short.

markdown
── thinking ──
- User wants a DFS implementation in Python, explicitly requesting it be "short"
- Settled on iterative version using a stack and visited set ...
- Concise version: no classes, just a function — keeps it short while remaining correct
── answer ──
def dfs(graph, start, visited=None):
    ...

The reasoning is now terse, bulleted, and scannable — the style it was fine-tuned to produce.

Training details

Method: LoRA (r=8, alpha=8, dropout=0) on the text language model's attention (q/k/v/o_proj) + MLP (gate/up/down_proj) modules. Vision and audio towers frozen (text-only finetune).
Trainable params: 12,079,104 (0.236% of 5.1B).
Data: 25,614 reasoning rows from Jackrong/GLM-5.1-Reasoning-1M-Cleaned (main subset). The verbose imd…answer thinking traces were condensed into terse flat bullet lists (via a condenser prompt); the original final answers were kept verbatim.
Training format: Gemma4 chat format with thinking ON — <|channel>thought\n...bullets...\n<channel|> then the final answer, <|turn> turn markers, assistant-only loss (user/system tokens masked to -100).
Hardware: Intel XPU (Intel Graphics 0xe211, 24 GB), bf16, adamw_torch, gradient checkpointing. No 4-bit / bitsandbytes (no XPU build).
Schedule: 1 full epoch, 6400 steps, per-device batch 1 × gradient accumulation 4, lr 2e-4 linear, 5 warmup steps, max_seq_length 1536. ~5.7 h.
Final train_loss: 0.795 (loss MA 1.22 → 0.76, token accuracy 0.74 → 0.79, no OOM).

License

Apache-2.0 (adapter weights). The base model unsloth/gemma-4-E2B-it follows Gemma's terms.

gemma-4-e2b-reasoning-lora

Get help setting up a custom Dedicated Endpoints.

README