Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Files

FilePurpose
adapter_model.safetensors + adapter_config.jsonpeft LoRA adapter on the shared expert
drops.json{moe_block_idx (0..38): [expert_indices_to_mask]} — apply BEFORE loading the adapter
prune.pymask_experts() helper to apply the mask via router bias = -inf
mask_snap.jsonsame as drops.json, saved by the training script as a sanity check

Usage

python

import json, torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
from prune import mask_experts # included in this repo
REPO = "poolside-laguna-hackathon/<this-repo>"
tok = AutoTokenizer.from_pretrained("poolside/Laguna-XS.2", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"poolside/Laguna-XS.2",
dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
# 1. Mask the cold-tail experts (sets router bias to -inf for dropped experts)
drops = {int(k): v for k, v in json.loads(open("drops.json").read()).items()}
mask_experts(model, drops)
# 2. Attach the LoRA adapter
model = PeftModel.from_pretrained(model, REPO)
model.eval()

Methodology

  1. Skew measurement — forward-hook every LagunaTopKRouter (39 of them). Accumulate per-(layer, expert) selection counts and renormalised routing-weight mass over 30 HumanEval prompts × ~200 greedy-generated tokens (~250k routing events).
  2. Cold-tail identification — pick bottom-N% experts by mass per layer. Under DeepSeek-V3-style routing (

    markdown

    sigmoid(logits) + e_score_correction_bias
    for load balancing) counts get forced toward uniform but mass still skews. Mass is the truth.
  3. Layer-weighted spec — the per-layer 80%-mass leaderboard showed layers 13–17 needing only 52–66 of 256 experts for 80% mass (vs. median 87, max 126 in early layers). So: 25% prune in layers 11–18, 5% elsewhere.
  4. Mask, don't slice (yet)mask_experts() sets e_score_correction_bias to -inf for dropped experts. The router's top-k will never pick them, the renormalisation across the surviving top-k handles the rest. Tensors keep their shape — slicing is a separate ship step.
  5. Distillation healing — frozen full Laguna generates teacher completions on 120 MBPP prompts. The pruned-via-mask model gets a LoRA on the shared expert and is trained to match the teacher's completions via SFT (cross-entropy on the completion tokens only).

Architectural note

Laguna's routed experts are stored as batched 3D parameter tensors (gate_up_proj: (256, 1024, 2048), down_proj: (256, 2048, 512)) and used via manual matmul, not nn.Linear. peft's LoRA can't target raw parameter slices, so the adapter goes on the always-on shared expert (shared_experts.{gate,up,down}_proj, which ARE nn.Linear). The shared expert sees every token, so a LoRA delta there absorbs the lost routed-expert contribution into the always-on path.

Honest limitations

  • 8.9% routed-expert reduction is modest. DeepSeek-V3 load balancing fights skew effectively; the cold tail is real but not dramatic.
  • LoRA targets the shared expert, not the routed survivors. Direct healing of the routed survivors needs a custom LoRA wrapper for the batched-tensor storage — left as future work.
  • No HumanEval / SWE-bench numbers reported here — eval pipeline ran into trim-heuristic issues mid-hackathon (model emitted </assistant> chat tokens that the original trim missed). The eval harness is fixed in the source repo; numbers will be uploaded as a follow-up.

Source code

Full hackathon repo (probe, prune, eval, distillation, model card): see the Source code link in this repo's README.

Built by Jonathan Farrow (@JonnyJF) for the 30 May 2026 Poolside × Prime Intellect hackathon.

Model provider

JonnyJF

Model tree

Base

poolside/Laguna-XS.2

Adapter

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today