yuxinlu1

gemma-4-12B-coder-fable5-composer2.5-v1

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

๐ŸŽฏ What this repo is for

This repo holds the un-quantized master weights (model.safetensors, bf16). Use it to:

  • ๐Ÿ”ง Roll your own quants โ€” make custom GGUF / MLX / AWQ / GPTQ builds from full precision.
  • ๐Ÿงช Fine-tune further โ€” it's a clean base for your own LoRA / continued training.
  • ๐Ÿค— Run it in transformers (needs a recent build with gemma4_unified support).

๐Ÿƒ Just want to run it? You don't need this repo โ€” grab a ready-made quant from the GGUF repo โ†’ (runs in ~4.5 GB of VRAM / unified memory in LM Studio, Ollama, llama.cpp, Janโ€ฆ). This master is for builders. ๐Ÿ’š


๐Ÿ“Œ Announcements

๐Ÿš€ v2 is almost here! Initial training of v2 is done and it's in benchmarking + final QA. So many of you flagged the agentic behavior โ€” so this round I significantly grew the dataset (especially agentic data). v2 is focused on agentic + coding. Targeting a release this Friday or Saturday (US Pacific). ๐ŸŽ‰

๐Ÿ“ฃ Context length is 256K. This master ships with the corrected max_position_embeddings = 262144 (256K) โ€” the well-known upstream Gemma 4 metadata bug (config.json once said 131072) is already fixed here, so anything you quantize/convert from these weights inherits the full 256K. ๐Ÿ’š Thanks to the community member who spotted it!


๐Ÿค— Run it in transformers

python

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
repo = "yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype=torch.bfloat16, device_map="auto")
msgs = [{"role": "user", "content": "Write a Python function to check if a string is a valid IPv4 address."}]
inputs = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=1024)
print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))

๐Ÿง  Thinking mode: it thinks in Gemma's native thought channel before answering (keep enable_thinking=true, the default chat template handles it). Recommended sampling: temp 1.0, top_p 0.95, top_k 64; for coding you can also go greedy (temp 0) for more deterministic solutions. Needs a recent transformers that knows the gemma4_unified architecture.


๐Ÿ“ฆ Ready-made GGUF quants

All from the GGUF repo:

Table
QuantSizeVibe
๐ŸŸข Q2_K4.5 GBtiniest โ€” runs almost anywhere
๐ŸŸก Q3_K_M5.7 GBgreat for 8 GB VRAM
๐Ÿ”ต Q4_K_M6.87 GBthe sweet spot ๐Ÿ‘Œ (recommended)
๐ŸŸฃ Q6_K9.11 GBnear-lossless
โšช Q8_011.8 GBbasically full quality

โš ๏ธ GGUF needs a recent llama.cpp โ€” this is the gemma4_unified architecture, older builds won't load it.


โšก Optional: free speed with MTP (lossless)

There's a tiny Gemma 4 MTP draft model in my main reasoning repo โ†’ MTP/ folder. It's the stock Gemma 4 drafter, so it pairs with any Gemma 4 12B quant โ€” including these coder quants โ€” for lossless speculative decoding (byte-for-byte identical output, just faster). Because it's trained on base Gemma 4, the hit-rate on this fine-tune is a bit lower than on vanilla Gemma 4, but it's free and has no downside. Add three flags (--model-draft, --spec-type draft-mtp, --n-gpu-layers-draft); see the main repo for the full command. ๐ŸŽ๏ธ


๐Ÿ“š Training data (the interesting part ๐Ÿณ)

A distillation of two complementary chain-of-thought sources over verifiable Python coding tasks (algorithmic / function-level problems with deterministic tests):

  • ๐Ÿฅ‡ Main โ€” Composer 2.5 real CoT. Genuine model-authored reasoning traces; each solution was run against the task's tests and only passing ones were kept. The reasoning you learn from leads to code that actually works.
  • ๐Ÿฅˆ Aux โ€” Fable 5 redo. The problems where Composer 2.5 got it wrong, handed to Fable 5 to re-derive a fresh, self-consistent CoT and a correct solution โ€” again gated on passing the tests. Recovers the hard cases the main teacher missed. These are synthetic (rationalized) CoT and are tagged separately.

Real CoT for solid coverage + synthetic "second-attempt" CoT to patch the failures โ€” all verified by execution before training. โœ…


โš ๏ธ Good to know

  • Reduced refusals: task-focused training with no safety hedging, so it refuses less than the base model. It is not safety-aligned โ€” add your own guardrails for production. Use responsibly. ๐Ÿ™
  • Specialized for Python / algorithmic coding; general-knowledge facts/numbers should still be double-checked.
  • English-centric.

๐Ÿ“š Base & License

  • License: Apache 2.0. Gemma 4 is released by Google under Apache 2.0 (unlike the older Gemma 1/2/3 terms), so this fine-tune is Apache 2.0 too โ€” free to use, modify, and redistribute. ๐ŸŽ‰
  • Base model: google/gemma-4-12B-it.
  • Personal/hobby project โ€” shared as-is, no warranty. Have fun, and happy hacking! ๐Ÿพโœจ

Model provider

yuxinlu1

Model tree

Base

google/gemma-4-12B-it

Fine-tuned

this model

Modalities

Input

Video, Audio, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today