Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Model Details
- Type: decoder-only causal language model, Llama-compatible architecture
- Parameters: approximately 52.6M
- Context length target: 16k tokens
- Training target: about 15B pretraining tokens plus chat/instruction tuning
- Hardware: 8x NVIDIA RTX 5090 cloud GPUs
- Tokenizer: TinyLlama/Llama-style 32k tokenizer with a Neo50M chat template
Intended Uses
- toy/local assistant experiments
- educational training and inference demos
- lightweight generation
- testing HF, GGUF, ONNX, and distributed training pipelines
Limitations
Neo50M is very small. It is not reliable for factual accuracy, has limited reasoning ability, may hallucinate, and should not be used for safety-critical decisions or high-stakes advice.
Transformers Usage
python
from transformers import AutoModelForCausalLM, AutoTokenizerrepo_id = "KookiesXy/Neo50M"tokenizer = AutoTokenizer.from_pretrained(repo_id)model = AutoModelForCausalLM.from_pretrained(repo_id, device_map="auto")messages = [{"role": "user", "content": "Write a short thank-you note."}]inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)out = model.generate(inputs, max_new_tokens=120, temperature=0.7, top_p=0.9)print(tokenizer.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))
GGUF Usage
After downloading a GGUF file:
bash
llama-cli -m neo50m-q4_k_m.gguf -p "User: Write a haiku about GPUs.\nAssistant:"
ONNX Usage
The ONNX export is intended for forward-pass validation and integration experiments. Use ONNX Runtime to load onnx/model.onnx and feed integer input_ids plus attention_mask.
Dataset Summary
The training pipeline streams a configurable mixture of FineWeb-Edu, Cosmopedia, Wikipedia-like text, TinyStories, and a small permissive code component. SFT uses OpenHermes-style, UltraChat-style, Alpaca-style, and small refusal/helpfulness examples when available. Dataset availability can change; the exact configs are included with the upload.
Eval Results
Eval artifacts, when present, are uploaded under evals/.
Model provider
KookiesXy
Model tree
Base
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information