JonnyJF/laguna-xs2-coding-pruned-9pct API & Inference Endpoint

Files

Table
File	Purpose
`adapter_model.safetensors` + `adapter_config.json`	peft LoRA adapter on the shared expert
`drops.json`	`{moe_block_idx (0..38): [expert_indices_to_mask]}` — apply BEFORE loading the adapter
`prune.py`	`mask_experts()` helper to apply the mask via router bias = -inf
`mask_snap.json`	same as `drops.json`, saved by the training script as a sanity check

Usage

python
import json, torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
from prune import mask_experts  # included in this repo

REPO = "poolside-laguna-hackathon/<this-repo>"

tok = AutoTokenizer.from_pretrained("poolside/Laguna-XS.2", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "poolside/Laguna-XS.2",
    dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

# 1. Mask the cold-tail experts (sets router bias to -inf for dropped experts)
drops = {int(k): v for k, v in json.loads(open("drops.json").read()).items()}
mask_experts(model, drops)

# 2. Attach the LoRA adapter
model = PeftModel.from_pretrained(model, REPO)
model.eval()

Methodology

Skew measurement — forward-hook every LagunaTopKRouter (39 of them). Accumulate per-(layer, expert) selection counts and renormalised routing-weight mass over 30 HumanEval prompts × ~200 greedy-generated tokens (~250k routing events).
Cold-tail identification — pick bottom-N% experts by mass per layer. Under DeepSeek-V3-style routing (
markdown
```
sigmoid(logits) + e_score_correction_bias
```
for load balancing) counts get forced toward uniform but mass still skews. Mass is the truth.
Layer-weighted spec — the per-layer 80%-mass leaderboard showed layers 13–17 needing only 52–66 of 256 experts for 80% mass (vs. median 87, max 126 in early layers). So: 25% prune in layers 11–18, 5% elsewhere.
Mask, don't slice (yet) — mask_experts() sets e_score_correction_bias to -inf for dropped experts. The router's top-k will never pick them, the renormalisation across the surviving top-k handles the rest. Tensors keep their shape — slicing is a separate ship step.
Distillation healing — frozen full Laguna generates teacher completions on 120 MBPP prompts. The pruned-via-mask model gets a LoRA on the shared expert and is trained to match the teacher's completions via SFT (cross-entropy on the completion tokens only).

Architectural note

Laguna's routed experts are stored as batched 3D parameter tensors (gate_up_proj: (256, 1024, 2048), down_proj: (256, 2048, 512)) and used via manual matmul, not nn.Linear. peft's LoRA can't target raw parameter slices, so the adapter goes on the always-on shared expert (shared_experts.{gate,up,down}_proj, which ARE nn.Linear). The shared expert sees every token, so a LoRA delta there absorbs the lost routed-expert contribution into the always-on path.

Honest limitations

8.9% routed-expert reduction is modest. DeepSeek-V3 load balancing fights skew effectively; the cold tail is real but not dramatic.
LoRA targets the shared expert, not the routed survivors. Direct healing of the routed survivors needs a custom LoRA wrapper for the batched-tensor storage — left as future work.
No HumanEval / SWE-bench numbers reported here — eval pipeline ran into trim-heuristic issues mid-hackathon (model emitted </assistant> chat tokens that the original trim missed). The eval harness is fixed in the source repo; numbers will be uploaded as a follow-up.

Source code

Full hackathon repo (probe, prune, eval, distillation, model card): see the Source code link in this repo's README.

Built by Jonathan Farrow (@JonnyJF) for the 30 May 2026 Poolside × Prime Intellect hackathon.

laguna-xs2-coding-pruned-9pct

Get help setting up a custom Dedicated Endpoints.

README