yuxinlu1
gemma-4-12B-coder-fable5-composer2.5-v1
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0๐ฏ What this repo is for
This repo holds the un-quantized master weights (model.safetensors, bf16). Use it to:
- ๐ง Roll your own quants โ make custom GGUF / MLX / AWQ / GPTQ builds from full precision.
- ๐งช Fine-tune further โ it's a clean base for your own LoRA / continued training.
- ๐ค Run it in
transformers(needs a recent build withgemma4_unifiedsupport).
๐ Just want to run it? You don't need this repo โ grab a ready-made quant from the GGUF repo โ (runs in ~4.5 GB of VRAM / unified memory in LM Studio, Ollama, llama.cpp, Janโฆ). This master is for builders. ๐
๐ Announcements
๐ v2 is almost here! Initial training of v2 is done and it's in benchmarking + final QA. So many of you flagged the agentic behavior โ so this round I significantly grew the dataset (especially agentic data). v2 is focused on agentic + coding. Targeting a release this Friday or Saturday (US Pacific). ๐
๐ฃ Context length is 256K. This master ships with the corrected max_position_embeddings = 262144 (256K) โ the
well-known upstream Gemma 4 metadata bug (config.json once said 131072) is already fixed here, so anything you
quantize/convert from these weights inherits the full 256K. ๐ Thanks to the community member who spotted it!
๐ค Run it in transformers
python
from transformers import AutoModelForCausalLM, AutoTokenizerimport torchrepo = "yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1"tok = AutoTokenizer.from_pretrained(repo)model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype=torch.bfloat16, device_map="auto")msgs = [{"role": "user", "content": "Write a Python function to check if a string is a valid IPv4 address."}]inputs = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(model.device)out = model.generate(inputs, max_new_tokens=1024)print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))
๐ง Thinking mode: it thinks in Gemma's native thought channel before answering (keep
enable_thinking=true, the default chat template handles it). Recommended sampling:temp 1.0, top_p 0.95, top_k 64; for coding you can also go greedy (temp 0) for more deterministic solutions. Needs a recenttransformersthat knows thegemma4_unifiedarchitecture.
๐ฆ Ready-made GGUF quants
All from the GGUF repo:
| Quant | Size | Vibe |
|---|---|---|
| ๐ข Q2_K | 4.5 GB | tiniest โ runs almost anywhere |
| ๐ก Q3_K_M | 5.7 GB | great for 8 GB VRAM |
| ๐ต Q4_K_M | 6.87 GB | the sweet spot ๐ (recommended) |
| ๐ฃ Q6_K | 9.11 GB | near-lossless |
| โช Q8_0 | 11.8 GB | basically full quality |
โ ๏ธ GGUF needs a recent llama.cpp โ this is the
gemma4_unifiedarchitecture, older builds won't load it.
โก Optional: free speed with MTP (lossless)
There's a tiny Gemma 4 MTP draft model in my main reasoning repo โ
MTP/ folder. It's the
stock Gemma 4 drafter, so it pairs with any Gemma 4 12B quant โ including these coder quants โ for
lossless speculative decoding (byte-for-byte identical output, just faster). Because it's trained on base Gemma 4,
the hit-rate on this fine-tune is a bit lower than on vanilla Gemma 4, but it's free and has no downside. Add three
flags (--model-draft, --spec-type draft-mtp, --n-gpu-layers-draft); see the
main repo for the full command. ๐๏ธ
๐ Training data (the interesting part ๐ณ)
A distillation of two complementary chain-of-thought sources over verifiable Python coding tasks (algorithmic / function-level problems with deterministic tests):
- ๐ฅ Main โ Composer 2.5 real CoT. Genuine model-authored reasoning traces; each solution was run against the task's tests and only passing ones were kept. The reasoning you learn from leads to code that actually works.
- ๐ฅ Aux โ Fable 5 redo. The problems where Composer 2.5 got it wrong, handed to Fable 5 to re-derive a fresh, self-consistent CoT and a correct solution โ again gated on passing the tests. Recovers the hard cases the main teacher missed. These are synthetic (rationalized) CoT and are tagged separately.
Real CoT for solid coverage + synthetic "second-attempt" CoT to patch the failures โ all verified by execution before training. โ
โ ๏ธ Good to know
- Reduced refusals: task-focused training with no safety hedging, so it refuses less than the base model. It is not safety-aligned โ add your own guardrails for production. Use responsibly. ๐
- Specialized for Python / algorithmic coding; general-knowledge facts/numbers should still be double-checked.
- English-centric.
๐ Base & License
- License: Apache 2.0. Gemma 4 is released by Google under Apache 2.0 (unlike the older Gemma 1/2/3 terms), so this fine-tune is Apache 2.0 too โ free to use, modify, and redistribute. ๐
- Base model:
google/gemma-4-12B-it. - Personal/hobby project โ shared as-is, no warranty. Have fun, and happy hacking! ๐พโจ
Model provider
yuxinlu1
Model tree
Base
google/gemma-4-12B-it
Fine-tuned
this model
Modalities
Input
Video, Audio, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information