Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Loading

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
m = AutoModelForCausalLM.from_pretrained("poolside-laguna-hackathon/laguna-xs2-dense-stage1",
trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="cuda")
tok = AutoTokenizer.from_pretrained("poolside-laguna-hackathon/laguna-xs2-dense-stage1", trust_remote_code=True)

Dense FFNs (intermediate 4608) are zero-padded to 8192 so the stock modeling_laguna.py loads it (numerically identical); exported reports ≈3.8B, true model ≈3.0B. last.pt (raw Stage-1 FFN weights) is also in this repo. Footprint: ≈6 GB bf16 vs ≈67 GB for the 33B MoE (≈11× less weight VRAM).

See the Stage-2 card for the full method, results, and next steps. Code: https://github.com/postscarcity-inc/laguna-xs.2-dense

Model provider

poolside-laguna-hackathon

Model tree

Base

poolside/Laguna-XS.2

Fine-tuned

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today