Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: mit

What this is

The student sees only a code diff and writes a haiku review from scratch (autonomous reviewer: it finds and phrases the issue itself). The human reviewer comment is never shown to the model; it was used only to build the teacher's distillation target and to score the output. That framing makes relevance a stiff test, by design.

Status

  • v0.1: shipped as a documented baseline. LoRA SFT on Gemma 4 E4B. Form did not improve (it regressed, and the change is not statistically significant). Content metrics (relevance, category) improved and cleared their targets.
  • v1.0: in progress (reasoning-token architecture). Not yet released.

Evaluation

Held-out 100 (Python subset of ronantakizawa/codereview-bench), scored by the canonical harness (scripts/score.py) with the same prompt and greedy decoding on both the base and the fine-tuned model (bench-drop discipline: same code, same set, both sides).

MetricBase floorv0.1Golden numberVerdict
#1 Form (valid 5-7-5)14% (14/100)8% (8/100)>= 45%Miss
#3 Relevance (real-pair mean)0.3210.394>= 0.37Hit
#2 Category (4-class macro-F1)0.3190.458hold gap (>= +0.139)Hit

Details:

  • Form regressed from 14% to 8%. Paired McNemar exact test: p = 0.263 (n_discordant = 20; 13 rows lost validity, 7 gained). The change is not statistically significant; pure SFT on 113 pairs did not transfer the generate-check-revise loop's form skill into the weights.
  • Relevance (metric #3): real-pair mean 0.321 -> 0.394, clearing the 0.37 golden. The gap over the shuffle floor held and widened (+0.192 -> +0.287; shuffle floor 0.129 -> 0.107).
  • Category (metric #2): 4-class macro-F1 0.319 -> 0.458, majority baseline 0.179. The gap over baseline more than doubled (+0.139 -> +0.279), clearing the hold-the-gap commitment.

Golden numbers were committed before any SFT run, so the result cannot be retrofit. The form target was the ambitious one (~3.2x the floor); a miss there is reported honestly and is on-thesis for a project whose argument is that the eval mattered more than the model.

Training

  • Base: google/gemma-4-E4B-it
  • Method: LoRA SFT (r=16, lora_alpha=32, lora_dropout=0.05, target_modules="all-linear", CAUSAL_LM), completion-only loss with the prompt tokens masked.
  • Data: 113 (diff, haiku) pairs. Haiku targets generated by the teacher Qwen/Qwen3-30B-A3B-Instruct-2507 via the v2 teacher loop (form check plus a relevance gate).
  • Optimization: effective batch 8 (per_device_train_batch_size=1 x gradient_accumulation_steps=8), learning_rate=2e-4, bf16, 3 epochs (about 45 optimizer steps).
  • Hardware: A100-80GB on Modal.

Limitations

  • Form regression: v0.1 produces fewer valid 5-7-5 haikus than the base model. The loop can steer the teacher to valid form at inference, but a single SFT pass on 113 pairs did not install that capability in the student.
  • Counter caveats: the harness syllable counter has known limits on snake_case identifiers and unknown acronyms (it does not consult the acronym table inside a snake_case split; see the project's open issues). Identifier-heavy lines have syllable counts that are non-trivial to verify by ear.
  • Scope: research and educational use only. Not for production code review.

Links

License

  • LoRA adapter weights in this repo: MIT.
  • Combined model (base Gemma 4 E4B plus this adapter): subject to the Gemma Terms of Use, see google/gemma-4-E4B-it. MIT applies to the adapter, not to the base weights.

Model provider

shahfazal

Model tree

Base

google/gemma-4-E4B-it

Adapter

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today