Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Loading
python
import torchfrom transformers import AutoModelForCausalLM, AutoTokenizerm = AutoModelForCausalLM.from_pretrained("poolside-laguna-hackathon/laguna-xs2-dense-stage1",trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="cuda")tok = AutoTokenizer.from_pretrained("poolside-laguna-hackathon/laguna-xs2-dense-stage1", trust_remote_code=True)
Dense FFNs (intermediate 4608) are zero-padded to 8192 so the stock modeling_laguna.py loads it (numerically identical); exported reports ≈3.8B, true model ≈3.0B. last.pt (raw Stage-1 FFN weights) is also in this repo. Footprint: ≈6 GB bf16 vs ≈67 GB for the 33B MoE (≈11× less weight VRAM).
See the Stage-2 card for the full method, results, and next steps. Code: https://github.com/postscarcity-inc/laguna-xs.2-dense
Model provider
poolside-laguna-hackathon
Model tree
Base
poolside/Laguna-XS.2
Fine-tuned
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information