shahfazal/lgtm-575-gemma4-e4b-v0.1 API & Inference Endpoint

What this is

The student sees only a code diff and writes a haiku review from scratch (autonomous reviewer: it finds and phrases the issue itself). The human reviewer comment is never shown to the model; it was used only to build the teacher's distillation target and to score the output. That framing makes relevance a stiff test, by design.

Status

v0.1: shipped as a documented baseline. LoRA SFT on Gemma 4 E4B. Form did not improve (it regressed, and the change is not statistically significant). Content metrics (relevance, category) improved and cleared their targets.
v1.0: in progress (reasoning-token architecture). Not yet released.

Evaluation

Held-out 100 (Python subset of ronantakizawa/codereview-bench), scored by the canonical harness (scripts/score.py) with the same prompt and greedy decoding on both the base and the fine-tuned model (bench-drop discipline: same code, same set, both sides).

Metric	Base floor	v0.1	Golden number	Verdict
#1 Form (valid 5-7-5)	14% (14/100)	8% (8/100)	>= 45%	Miss
#3 Relevance (real-pair mean)	0.321	0.394	>= 0.37	Hit
#2 Category (4-class macro-F1)	0.319	0.458	hold gap (>= +0.139)	Hit

Details:

Form regressed from 14% to 8%. Paired McNemar exact test: p = 0.263 (n_discordant = 20; 13 rows lost validity, 7 gained). The change is not statistically significant; pure SFT on 113 pairs did not transfer the generate-check-revise loop's form skill into the weights.
Relevance (metric #3): real-pair mean 0.321 -> 0.394, clearing the 0.37 golden. The gap over the shuffle floor held and widened (+0.192 -> +0.287; shuffle floor 0.129 -> 0.107).
Category (metric #2): 4-class macro-F1 0.319 -> 0.458, majority baseline 0.179. The gap over baseline more than doubled (+0.139 -> +0.279), clearing the hold-the-gap commitment.

Golden numbers were committed before any SFT run, so the result cannot be retrofit. The form target was the ambitious one (~3.2x the floor); a miss there is reported honestly and is on-thesis for a project whose argument is that the eval mattered more than the model.

Training

Base: google/gemma-4-E4B-it
Method: LoRA SFT (r=16, lora_alpha=32, lora_dropout=0.05, target_modules="all-linear", CAUSAL_LM), completion-only loss with the prompt tokens masked.
Data: 113 (diff, haiku) pairs. Haiku targets generated by the teacher Qwen/Qwen3-30B-A3B-Instruct-2507 via the v2 teacher loop (form check plus a relevance gate).
Optimization: effective batch 8 (per_device_train_batch_size=1 x gradient_accumulation_steps=8), learning_rate=2e-4, bf16, 3 epochs (about 45 optimizer steps).
Hardware: A100-80GB on Modal.

Limitations

Form regression: v0.1 produces fewer valid 5-7-5 haikus than the base model. The loop can steer the teacher to valid form at inference, but a single SFT pass on 113 pairs did not install that capability in the student.
Counter caveats: the harness syllable counter has known limits on snake_case identifiers and unknown acronyms (it does not consult the acronym table inside a snake_case split; see the project's open issues). Identifier-heavy lines have syllable counts that are non-trivial to verify by ear.
Scope: research and educational use only. Not for production code review.

License

LoRA adapter weights in this repo: MIT.
Combined model (base Gemma 4 E4B plus this adapter): subject to the Gemma Terms of Use, see google/gemma-4-E4B-it. MIT applies to the adapter, not to the base weights.

lgtm-575-gemma4-e4b-v0.1

Get help setting up a custom Dedicated Endpoints.

README