veyra-ai/Veyra-30M-Instruct API & Inference Endpoint

Model Details

Base model: veyra-ai/veyra-30m-base-5b-tokens
Parameters: approximately 36.2M
Language: English
Context length: 1024 tokens
Architecture: decoder-only causal language model
Chat format: ChatML
License: Apache 2.0
Status: experimental instruct release

Architecture

Veyra-30M is a compact decoder-only transformer-style model.

Vocabulary size: 8,192
Hidden size: 512
Layers: 8
Layer pattern: alternating attention and MLP-only blocks
Query heads: 8
Key/value heads: 2
Attention type: grouped-query attention
MLP: SwiGLU
Normalization: RMSNorm
Positional encoding: RoPE
Context length: 1024 tokens
Tied input/output embeddings
KV cache support for generation

Training

This checkpoint was trained with masked ChatML supervised fine-tuning.

Base checkpoint: veyra-ai/veyra-30m-base-5b-tokens
Selected checkpoint: sft_masked_chatml_0100M.pt
SFT tokens: 100M non-padding SFT tokens
Supervised assistant tokens: approximately 35M supervised assistant tokens
Loss masking: system and user turns were used as context but masked from loss; assistant responses were supervised.

The released checkpoint was selected from intermediate SFT milestones based on qualitative behavior, not only lowest training loss. Later SFT checkpoints reached lower loss but showed more generic refusal/template behavior, so checkpoint selection was based on output quality.

Intended Use

Tiny-model instruction-following experiments
Local assistant experiments
ChatML behavior testing
Simple formatting and JSON experiments
Basic Python/helpfulness prompts
Studying SFT behavior in very small language models

Not Intended For

Production applications
Safety-critical use cases
Medical, legal, financial, or security advice
Reliable factual QA
Long-context reasoning
Robust arithmetic
User-facing deployment without additional safeguards

Known Limitations

This is a 30M-parameter model and has significant limitations.

Known failure modes include:

Weak variable binding
Weak arithmetic
Weak multi-turn recall
Occasional repetition
Confident but incorrect answers
Generic refusal or disclaimer behavior
Tool-call or reasoning-template contamination
Sensitivity to prompt wording
Fluent nonsense on unfamiliar prompts

This model is not fully safety-tuned. It may refuse some harmful requests, but refusal behavior is not reliable.

Prompt Format

This model uses ChatML.

text
<|im_start|>system
You are Veyra, a tiny local instruction model. Be concise, useful, casual, and lightly playful. Correct mistakes gently.<|im_end|>
<|im_start|>user
What does a tokenizer do?<|im_end|>
<|im_start|>assistant

Usage

python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "veyra-ai/veyra-30m-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)

messages = [
    {
        "role": "system",
        "content": "You are Veyra, a tiny local instruction model. Be concise, useful, casual, and lightly playful."
    },
    {
        "role": "user",
        "content": "Explain what an API is using a simple analogy."
    },
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=120,
        temperature=0.3,
        top_k=40,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|im_end|>")],
        use_cache=True,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=False))

Example Prompts

The examples below are suggested prompts for trying the model. They are not benchmark results and should not be treated as representative guarantees.

text
Explain what an API is using a simple analogy.

text
Return only JSON for a book with title='Dune', author='Frank Herbert', year=1965.

text
Write a Python function that checks whether a number is even.

text
What does FileNotFoundError usually mean in Python?

text
Given these facts: color=blue, animal=otter, number=17. What animal was mentioned?

Special Tokens

Important tokenizer special tokens include:

text
<|bos|>
<|eos|>
<|pad|>
<|unk|>
<|im_start|>
<|im_end|>
<|tool_call|>
<|tool_result|>
<|context|>
<|reasoning|>
<|end_reasoning|>
<|answer|>
<|fim_prefix|>
<|fim_middle|>
<|fim_suffix|>

For this checkpoint, standard ChatML with <|im_start|> and <|im_end|> is the recommended format.

Benchmarks

Benchmarks will be added in a later update.

Citation / Attribution

If you use or build on this model, please retain attribution to Veyra AI.

License

Apache 2.0.

Veyra-30M-Instruct

Get help setting up a custom Dedicated Endpoints.

README