Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model Details

  • Base model: veyra-ai/veyra-30m-base-5b-tokens
  • Parameters: approximately 36.2M
  • Language: English
  • Context length: 1024 tokens
  • Architecture: decoder-only causal language model
  • Chat format: ChatML
  • License: Apache 2.0
  • Status: experimental instruct release

Architecture

Veyra-30M is a compact decoder-only transformer-style model.

  • Vocabulary size: 8,192
  • Hidden size: 512
  • Layers: 8
  • Layer pattern: alternating attention and MLP-only blocks
  • Query heads: 8
  • Key/value heads: 2
  • Attention type: grouped-query attention
  • MLP: SwiGLU
  • Normalization: RMSNorm
  • Positional encoding: RoPE
  • Context length: 1024 tokens
  • Tied input/output embeddings
  • KV cache support for generation

Training

This checkpoint was trained with masked ChatML supervised fine-tuning.

  • Base checkpoint: veyra-ai/veyra-30m-base-5b-tokens
  • Selected checkpoint: sft_masked_chatml_0100M.pt
  • SFT tokens: 100M non-padding SFT tokens
  • Supervised assistant tokens: approximately 35M supervised assistant tokens
  • Loss masking: system and user turns were used as context but masked from loss; assistant responses were supervised.

The released checkpoint was selected from intermediate SFT milestones based on qualitative behavior, not only lowest training loss. Later SFT checkpoints reached lower loss but showed more generic refusal/template behavior, so checkpoint selection was based on output quality.

Intended Use

  • Tiny-model instruction-following experiments
  • Local assistant experiments
  • ChatML behavior testing
  • Simple formatting and JSON experiments
  • Basic Python/helpfulness prompts
  • Studying SFT behavior in very small language models

Not Intended For

  • Production applications
  • Safety-critical use cases
  • Medical, legal, financial, or security advice
  • Reliable factual QA
  • Long-context reasoning
  • Robust arithmetic
  • User-facing deployment without additional safeguards

Known Limitations

This is a 30M-parameter model and has significant limitations.

Known failure modes include:

  • Weak variable binding
  • Weak arithmetic
  • Weak multi-turn recall
  • Occasional repetition
  • Confident but incorrect answers
  • Generic refusal or disclaimer behavior
  • Tool-call or reasoning-template contamination
  • Sensitivity to prompt wording
  • Fluent nonsense on unfamiliar prompts

This model is not fully safety-tuned. It may refuse some harmful requests, but refusal behavior is not reliable.

Prompt Format

This model uses ChatML.

text

<|im_start|>system
You are Veyra, a tiny local instruction model. Be concise, useful, casual, and lightly playful. Correct mistakes gently.<|im_end|>
<|im_start|>user
What does a tokenizer do?<|im_end|>
<|im_start|>assistant

Usage

python

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "veyra-ai/veyra-30m-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
messages = [
{
"role": "system",
"content": "You are Veyra, a tiny local instruction model. Be concise, useful, casual, and lightly playful."
},
{
"role": "user",
"content": "Explain what an API is using a simple analogy."
},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=120,
temperature=0.3,
top_k=40,
do_sample=True,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|im_end|>")],
use_cache=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))

Example Prompts

The examples below are suggested prompts for trying the model. They are not benchmark results and should not be treated as representative guarantees.

text

Explain what an API is using a simple analogy.

text

Return only JSON for a book with title='Dune', author='Frank Herbert', year=1965.

text

Write a Python function that checks whether a number is even.

text

What does FileNotFoundError usually mean in Python?

text

Given these facts: color=blue, animal=otter, number=17. What animal was mentioned?

Special Tokens

Important tokenizer special tokens include:

text

<|bos|>
<|eos|>
<|pad|>
<|unk|>
<|im_start|>
<|im_end|>
<|tool_call|>
<|tool_result|>
<|context|>
<|reasoning|>
<|end_reasoning|>
<|answer|>
<|fim_prefix|>
<|fim_middle|>
<|fim_suffix|>

For this checkpoint, standard ChatML with <|im_start|> and <|im_end|> is the recommended format.

Benchmarks

Benchmarks will be added in a later update.

Citation / Attribution

If you use or build on this model, please retain attribution to Veyra AI.

License

Apache 2.0.

Model provider

veyra-ai

Model tree

Base

veyra-ai/Veyra-30M-Base

Fine-tuned

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today