Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Files
| File | Purpose |
|---|---|
adapter_model.safetensors + adapter_config.json | peft LoRA adapter on the shared expert |
drops.json | {moe_block_idx (0..38): [expert_indices_to_mask]} — apply BEFORE loading the adapter |
prune.py | mask_experts() helper to apply the mask via router bias = -inf |
mask_snap.json | same as drops.json, saved by the training script as a sanity check |
Usage
python
import json, torchfrom transformers import AutoModelForCausalLM, AutoTokenizerfrom peft import PeftModelfrom prune import mask_experts # included in this repoREPO = "poolside-laguna-hackathon/<this-repo>"tok = AutoTokenizer.from_pretrained("poolside/Laguna-XS.2", trust_remote_code=True)model = AutoModelForCausalLM.from_pretrained("poolside/Laguna-XS.2",dtype=torch.bfloat16,device_map="auto",trust_remote_code=True,)# 1. Mask the cold-tail experts (sets router bias to -inf for dropped experts)drops = {int(k): v for k, v in json.loads(open("drops.json").read()).items()}mask_experts(model, drops)# 2. Attach the LoRA adaptermodel = PeftModel.from_pretrained(model, REPO)model.eval()
Methodology
- Skew measurement — forward-hook every
LagunaTopKRouter(39 of them). Accumulate per-(layer, expert) selection counts and renormalised routing-weight mass over 30 HumanEval prompts × ~200 greedy-generated tokens (~250k routing events). - Cold-tail identification — pick bottom-N% experts by mass per
layer. Under DeepSeek-V3-style routing (for load balancing) counts get forced toward uniform but mass still skews. Mass is the truth.
markdown
sigmoid(logits) + e_score_correction_bias - Layer-weighted spec — the per-layer 80%-mass leaderboard showed layers 13–17 needing only 52–66 of 256 experts for 80% mass (vs. median 87, max 126 in early layers). So: 25% prune in layers 11–18, 5% elsewhere.
- Mask, don't slice (yet) —
mask_experts()setse_score_correction_biasto-inffor dropped experts. The router's top-k will never pick them, the renormalisation across the surviving top-k handles the rest. Tensors keep their shape — slicing is a separate ship step. - Distillation healing — frozen full Laguna generates teacher completions on 120 MBPP prompts. The pruned-via-mask model gets a LoRA on the shared expert and is trained to match the teacher's completions via SFT (cross-entropy on the completion tokens only).
Architectural note
Laguna's routed experts are stored as batched 3D parameter tensors
(gate_up_proj: (256, 1024, 2048), down_proj: (256, 2048, 512)) and used
via manual matmul, not nn.Linear. peft's LoRA can't target raw
parameter slices, so the adapter goes on the always-on shared expert
(shared_experts.{gate,up,down}_proj, which ARE nn.Linear). The shared
expert sees every token, so a LoRA delta there absorbs the lost routed-expert
contribution into the always-on path.
Honest limitations
- 8.9% routed-expert reduction is modest. DeepSeek-V3 load balancing fights skew effectively; the cold tail is real but not dramatic.
- LoRA targets the shared expert, not the routed survivors. Direct healing of the routed survivors needs a custom LoRA wrapper for the batched-tensor storage — left as future work.
- No HumanEval / SWE-bench numbers reported here — eval pipeline ran into
trim-heuristic issues mid-hackathon (model emitted
</assistant>chat tokens that the original trim missed). The eval harness is fixed in the source repo; numbers will be uploaded as a follow-up.
Source code
Full hackathon repo (probe, prune, eval, distillation, model card): see the
Source code link in this repo's README.
Built by Jonathan Farrow (@JonnyJF) for the 30 May 2026 Poolside × Prime Intellect hackathon.
Model provider
JonnyJF
Model tree
Base
poolside/Laguna-XS.2
Adapter
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information