Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Role in the system
| Model | Job | |
|---|---|---|
| 🔔 Buzz | YUGOROU/quiz-buzz-reg-1.2bjp-merged (LFM2.5-1.2B + regression head) | Reads the question char-by-char, buzzes when conf ≥ θ (~9 ms/char). |
| 🧠 Answer (this model) | gemma-4-26B-A4B SFT | From the partial question at buzz time, <think>…</think> reasoning → answer. |
Total ≈ 27.2B params (≤ 32B), built for the HF Build Small Hackathon.
Usage
python
from transformers import AutoTokenizer, AutoModelForCausalLMimport torchrepo = "YUGOROU/quiz-main-gemma-merged"tok = AutoTokenizer.from_pretrained(repo)model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype=torch.bfloat16, device_map="auto")prefix = "日本の首都は東京ですが、アメリカの首都は" # partial question at buzz timemsgs = [{"role": "user", "content": f"早押しクイズ({len(prefix)}文字目時点):\n{prefix}"}]ids = tok.apply_chat_template(msgs, enable_thinking=True, add_generation_prompt=True, return_tensors="pt").to(model.device)out = model.generate(ids,max_new_tokens=320,do_sample=False,eos_token_id=[1, 106], # gemma-4 closes the turn with <turn|>=106, not only <eos>=1)print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))# <think> … </think>ワシントンD.C.
Important: gemma-4 ends an assistant turn with
<turn|>(id 106). If you only stop on<eos>(id 1) the model will keep hallucinating new turns. Always include 106 in your stop set (vLLM:--stop-token-ids 1 106).<think>reasoning is required — disabling it collapses accuracy.
Training
- Base:
unsloth/gemma-4-26B-A4B(MoE, 26B total / 4B active),gemma-4-thinkingchat template. - SFT (Unsloth bf16 LoRA, merged to 16-bit) on a quiz-grammar corpus built from AI王 / JAQKET:
user = partial question at the statistically-decidable buzz position (S-buzz), assistant =
<think>{reasoning}</think>{answer}with adaptive think budget by difficulty. - Full-question QA ≈ 76%; at the buzz position ≈ 62–74% depending on threshold θ (later buzz → higher accuracy).
Attribution & license
This model is a fine-tune of Google Gemma 4, which Google releases under the Apache License 2.0. The model weights are therefore distributed under Apache 2.0.
Training data derived from AI王 (Project AIO) / JAQKET. Quiz questions © abc/EQIDEN実行委員会 / 株式会社キュービック / クイズ法人カプリティオ. Non-commercial research use only. No dataset redistribution — only model weights and inference code are released.
Model provider
YUGOROU
Model tree
Base
unsloth/gemma-4-26B-A4B
Fine-tuned
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information