axonlabsai/axon-250m API & Inference Endpoint

Model Details

Parameters: ~362M (F32) — marketed as 250M class
Architecture: LlamaForCausalLM (custom reconfiguration)
Hidden size: 960
Layers: 32
Attention heads: 15
KV heads: 5 (GQA)
Intermediate size: 2560
Max context: 8192 tokens
Vocab size: 49,152
Activation: SiLU
Tokenizer: SmolLM2 tokenizer with ChatML formatting (<|im_start|> / <|im_end|>)
License: MIT

Key Differences from Source

Unlike the base SmolLM2-360M, Axon 250M was created through architectural merging and reconfiguration:

Restructured layer count and attention configuration
GQA with 5 KV heads for efficient inference
Custom head dimension of 64
RoPE with theta=100000

Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("axonlabsai/axon-250m", torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("axonlabsai/axon-250m")

messages = [{"role": "user", "content": "Hey, what's up?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
output = model.generate(inputs, max_new_tokens=128)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Limitations

NOT fine-tuned — no task-specific training was performed
Very small model with limited reasoning and factual knowledge
Prone to hallucination and incoherent outputs on complex prompts
Best suited for simple chat and experimentation, not production use
The "250M" branding reflects its model class, actual parameter count is ~362M

About Axon Labs

Axon Labs builds AI models and tools. This is our tiny model — small enough to run anywhere, dumb enough to be funny.

axon-250m

Get help setting up a custom Dedicated Endpoints.

README