ArnavKewalram

gemma-4-E2B-coder-v1

README

License: apache-2.0

Who is this for?

Use this model if you want a capable coding assistant that:

Runs fully offline on a laptop or edge device (4 GB RAM minimum with Q4_K_M)
Requires no GPU — fast CPU inference via Ollama or llama.cpp
Is Apache 2.0 licensed for commercial use without restrictions
Needs Python, JavaScript, TypeScript, Go, Rust, SQL, Bash, or C++ support

Not ideal for: very long context tasks (training max was 384 tokens), security-critical code generation, or tasks needing the base model's multimodal capabilities.

Table with columns: gemma-4-E2B-coder-v1, Typical 7B coder
	gemma-4-E2B-coder-v1	Typical 7B coder
Size (Q4)	~3.2 GB	~4.5 GB
Min RAM	4 GB	6 GB
Runs on CPU	Yes (fast — Griffin arch)	Yes (slow)
License	Apache 2.0	Varies
Context at training	384 tokens	2K–8K

Quick Start

python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "ArnavKewalram/gemma-4-E2B-coder-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Write a Python function that checks if a number is prime."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=512, temperature=0.2, do_sample=True)

print(tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

4-bit quantized (runs on 4 GB VRAM)

python
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("ArnavKewalram/gemma-4-E2B-coder-v1", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "ArnavKewalram/gemma-4-E2B-coder-v1",
    quantization_config=bnb,
    device_map="auto",
    trust_remote_code=True,
)

llama.cpp / Ollama (GGUF — no Python required)

bash
# Q4_K_M — ~3.2 GB, runs on 4 GB RAM (laptop/desktop/edge)
ollama run hf.co/ArnavKewalram/gemma-4-E2B-coder-v1:Q4_K_M

# llama.cpp (Gemma 4 E-series chat format)
./llama-cli -m gemma-4-E2B-coder-v1-Q4_K_M.gguf \
  -p "<bos><|turn>user\nWrite a binary search in Python<turn|>\n<|turn>model\n" \
  --temp 0.2 -n 512

Available GGUF variants:

Table with columns: File, Size, Use case
File	Size	Use case
`gemma-4-E2B-coder-v1-Q4_K_M.gguf`	~3.2 GB	Best compression; 4 GB RAM minimum
`gemma-4-E2B-coder-v1-Q5_K_M.gguf`	~3.4 GB	Better accuracy, 4 GB RAM minimum
`gemma-4-E2B-coder-v1-Q8_0.gguf`	~4.6 GB	Near-lossless, 6 GB RAM recommended
`gemma-4-E2B-coder-v1-F16.gguf`	~8.6 GB	Full precision BF16, 12 GB RAM

Note on GGUF sizes: Gemma 4 E2B has an unusually large vocabulary (262,144 tokens vs. ~32K typical). The embedding tables alone account for ~2 GB in the quantized files — larger than a standard Llama 3.2-1B model. Q4_K_M quantizes the embeddings to Q6_K to preserve quality, which explains the larger-than-expected file sizes.

Why This Model?

Real code patterns — Magicoder extracts instruction pairs from actual GitHub repositories, not synthetic textbook examples
Griffin architecture — hybrid local-attention + linear recurrent layers gives lower latency than pure-transformer models of the same size
First-mover — no other gemma-4-E2B coding fine-tune exists as of June 2026
Fully merged — released as a complete BF16 checkpoint, no adapter files required

Model Details

Table with columns: Property, Value
Property	Value
Base model	google/gemma-4-E2B-it
Total parameters	~3.9B
Architecture	Hybrid Griffin (attention + linear recurrent)
Fine-tuning method	QLoRA (4-bit NF4, double quant)
LoRA rank / alpha	16 / 32
LoRA targets	q/k/v/o (attention), gate/up/down (MLP) — full-path list targeting, excludes Gemma4ClippableLinear SSM layers
Trainable params	24.2M (0.47% of total)

Training

Trained with TRL SFTTrainer + PEFT LoRA + bitsandbytes on a single consumer GPU.

Griffin architecture note: The Gemma 4 E-series alternates between standard local-attention layers and Griffin linear-recurrent (SSM) layers. The SSM layers use a custom Gemma4ClippableLinear wrapper that is incompatible with PEFT's default module injection. To work around this, LoRA adapters are injected into a pre-filtered list of 205 Linear4bit instances — verified by isinstance check at load time — covering attention projections (q/k/v/o_proj) and MLP layers (gate/up/down_proj) across all 26 layers, while safely skipping all SSM-layer wrappers. After training, adapters are merged into the base weights using PeftModel.merge_and_unload().

Training curve (logged every 25 steps):

Table with columns: Step, Loss, Token Accuracy
Step	Loss	Token Accuracy
25	1.696	70.8%
50	0.7828	79.2%
75	0.737	80.3%
100	0.7311	80.4%
125	0.6896	81.5%
150	0.695	81.2%

Evaluation

HumanEval (pass@1)

Evaluated on the full OpenAI HumanEval benchmark (164 Python problems) using Q4_K_M GGUF via Ollama, raw completion mode (no chat template), temperature 0.2, 512 max tokens.

Table with columns: Model, Size, HumanEval pass@1
Model	Size	HumanEval pass@1
gemma-4-E2B-coder-v1 (Q4_K_M)	3.9B	34.1%
Code Llama	7B	33.5%
Qwen2.5-Coder	1.5B	37.2%
Llama 3.2	3B	25.4%
Gemma 2	2B	18.7%

34.1% pass@1 — competitive with Code Llama 7B at roughly half the parameter count. Notable given the model was fine-tuned on only 10,000 samples with a 384-token context window; longer-context problems are the primary failure mode.

Keyword Score

Keyword-based evaluation on 8 coding prompts using Q4_K_M GGUF (CPU inference, llama.cpp b9684, temperature 0.2):

Table with columns: Prompt, Keywords checked, Score
Prompt	Keywords checked	Score
Miller-Rabin primality test	`miller`, `witness`, `def is_prime`	33%
Binary search	`mid`, `lo`, `hi`, `def binary_search`	75%
Thread-safe LRU cache	`OrderedDict`, , ,

Average keyword score: 88.5% (8 prompts).

Keyword scoring checks that expected API/structural elements appear in the output — it is a proxy for code correctness, not a formal benchmark. The Miller-Rabin score (33%) is low because the model wrote a functionally correct implementation using variable names a and x rather than the keyword-matched names miller/witness.

Limitations

384-token training max; prompts + responses longer than this were truncated during fine-tuning — quality may degrade on very long inputs
Not evaluated on security-sensitive code generation tasks
Inherits biases and knowledge cutoff of google/gemma-4-E2B-it
Text-only; multimodal capabilities of the base model are not fine-tuned here

Citation

bibtex
@misc{gemma4_2026,
  title={Gemma 4: Open Models Based on Gemini Research and Technology},
  author={Google DeepMind},
  year={2026},
}

@misc{magicoder2023,
  title={Magicoder: Source Code Is All You Need},
  author={Wei, Yuxiang and Wang, Zhe and Liu, Jiawei and Ding, Yifeng and Zhang, Lingming},
  year={2023},
  eprint={2312.02120},
  archivePrefix={arXiv},
}

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

ArnavKewalram

Model Tree

Base

google/gemma-4-E2B-it

Quantized

this model

Input Modalities

Text

Image

Output Modalities

Text

Supported Functionality

Dedicated Endpoints

Container

Explore FriendliAI today

Get started Talk to an engineer

README

License: apache-2.0

Who is this for?

Use this model if you want a capable coding assistant that:

Runs fully offline on a laptop or edge device (4 GB RAM minimum with Q4_K_M)
Requires no GPU — fast CPU inference via Ollama or llama.cpp
Is Apache 2.0 licensed for commercial use without restrictions
Needs Python, JavaScript, TypeScript, Go, Rust, SQL, Bash, or C++ support

Not ideal for: very long context tasks (training max was 384 tokens), security-critical code generation, or tasks needing the base model's multimodal capabilities.

Table with columns: gemma-4-E2B-coder-v1, Typical 7B coder
	gemma-4-E2B-coder-v1	Typical 7B coder
Size (Q4)	~3.2 GB	~4.5 GB
Min RAM	4 GB	6 GB
Runs on CPU	Yes (fast — Griffin arch)	Yes (slow)
License	Apache 2.0	Varies
Context at training	384 tokens	2K–8K

Quick Start

python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "ArnavKewalram/gemma-4-E2B-coder-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Write a Python function that checks if a number is prime."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=512, temperature=0.2, do_sample=True)

print(tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

4-bit quantized (runs on 4 GB VRAM)

python
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("ArnavKewalram/gemma-4-E2B-coder-v1", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "ArnavKewalram/gemma-4-E2B-coder-v1",
    quantization_config=bnb,
    device_map="auto",
    trust_remote_code=True,
)

llama.cpp / Ollama (GGUF — no Python required)

bash
# Q4_K_M — ~3.2 GB, runs on 4 GB RAM (laptop/desktop/edge)
ollama run hf.co/ArnavKewalram/gemma-4-E2B-coder-v1:Q4_K_M

# llama.cpp (Gemma 4 E-series chat format)
./llama-cli -m gemma-4-E2B-coder-v1-Q4_K_M.gguf \
  -p "<bos><|turn>user\nWrite a binary search in Python<turn|>\n<|turn>model\n" \
  --temp 0.2 -n 512

Available GGUF variants:

Table with columns: File, Size, Use case
File	Size	Use case
`gemma-4-E2B-coder-v1-Q4_K_M.gguf`	~3.2 GB	Best compression; 4 GB RAM minimum
`gemma-4-E2B-coder-v1-Q5_K_M.gguf`	~3.4 GB	Better accuracy, 4 GB RAM minimum
`gemma-4-E2B-coder-v1-Q8_0.gguf`	~4.6 GB	Near-lossless, 6 GB RAM recommended
`gemma-4-E2B-coder-v1-F16.gguf`	~8.6 GB	Full precision BF16, 12 GB RAM

Note on GGUF sizes: Gemma 4 E2B has an unusually large vocabulary (262,144 tokens vs. ~32K typical). The embedding tables alone account for ~2 GB in the quantized files — larger than a standard Llama 3.2-1B model. Q4_K_M quantizes the embeddings to Q6_K to preserve quality, which explains the larger-than-expected file sizes.

Why This Model?

Real code patterns — Magicoder extracts instruction pairs from actual GitHub repositories, not synthetic textbook examples
Griffin architecture — hybrid local-attention + linear recurrent layers gives lower latency than pure-transformer models of the same size
First-mover — no other gemma-4-E2B coding fine-tune exists as of June 2026
Fully merged — released as a complete BF16 checkpoint, no adapter files required

Model Details

Table with columns: Property, Value
Property	Value
Base model	google/gemma-4-E2B-it
Total parameters	~3.9B
Architecture	Hybrid Griffin (attention + linear recurrent)
Fine-tuning method	QLoRA (4-bit NF4, double quant)
LoRA rank / alpha	16 / 32
LoRA targets	q/k/v/o (attention), gate/up/down (MLP) — full-path list targeting, excludes Gemma4ClippableLinear SSM layers
Trainable params	24.2M (0.47% of total)

Training

Trained with TRL SFTTrainer + PEFT LoRA + bitsandbytes on a single consumer GPU.

Training curve (logged every 25 steps):

Table with columns: Step, Loss, Token Accuracy
Step	Loss	Token Accuracy
25	1.696	70.8%
50	0.7828	79.2%
75	0.737	80.3%
100	0.7311	80.4%
125	0.6896	81.5%
150	0.695	81.2%

Evaluation

HumanEval (pass@1)

Evaluated on the full OpenAI HumanEval benchmark (164 Python problems) using Q4_K_M GGUF via Ollama, raw completion mode (no chat template), temperature 0.2, 512 max tokens.

Table with columns: Model, Size, HumanEval pass@1
Model	Size	HumanEval pass@1
gemma-4-E2B-coder-v1 (Q4_K_M)	3.9B	34.1%
Code Llama	7B	33.5%
Qwen2.5-Coder	1.5B	37.2%
Llama 3.2	3B	25.4%
Gemma 2	2B	18.7%

Keyword Score

Keyword-based evaluation on 8 coding prompts using Q4_K_M GGUF (CPU inference, llama.cpp b9684, temperature 0.2):

Table with columns: Prompt, Keywords checked, Score
Prompt	Keywords checked	Score
Miller-Rabin primality test	`miller`, `witness`, `def is_prime`	33%
Binary search	`mid`, `lo`, `hi`, `def binary_search`	75%
Thread-safe LRU cache	`OrderedDict`, , ,

Average keyword score: 88.5% (8 prompts).

Limitations

384-token training max; prompts + responses longer than this were truncated during fine-tuning — quality may degrade on very long inputs
Not evaluated on security-sensitive code generation tasks
Inherits biases and knowledge cutoff of google/gemma-4-E2B-it
Text-only; multimodal capabilities of the base model are not fine-tuned here

Citation

bibtex
@misc{gemma4_2026,
  title={Gemma 4: Open Models Based on Gemini Research and Technology},
  author={Google DeepMind},
  year={2026},
}

@misc{magicoder2023,
  title={Magicoder: Source Code Is All You Need},
  author={Wei, Yuxiang and Wang, Zhe and Liu, Jiawei and Ding, Yifeng and Zhang, Lingming},
  year={2023},
  eprint={2312.02120},
  archivePrefix={arXiv},
}