Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model Details

  • Type: decoder-only causal language model, Llama-compatible architecture
  • Parameters: approximately 52.6M
  • Context length target: 16k tokens
  • Training target: about 15B pretraining tokens plus chat/instruction tuning
  • Hardware: 8x NVIDIA RTX 5090 cloud GPUs
  • Tokenizer: TinyLlama/Llama-style 32k tokenizer with a Neo50M chat template

Intended Uses

  • toy/local assistant experiments
  • educational training and inference demos
  • lightweight generation
  • testing HF, GGUF, ONNX, and distributed training pipelines

Limitations

Neo50M is very small. It is not reliable for factual accuracy, has limited reasoning ability, may hallucinate, and should not be used for safety-critical decisions or high-stakes advice.

Transformers Usage

python

from transformers import AutoModelForCausalLM, AutoTokenizer
repo_id = "KookiesXy/Neo50M"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id, device_map="auto")
messages = [{"role": "user", "content": "Write a short thank-you note."}]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=120, temperature=0.7, top_p=0.9)
print(tokenizer.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))

GGUF Usage

After downloading a GGUF file:

bash

llama-cli -m neo50m-q4_k_m.gguf -p "User: Write a haiku about GPUs.\nAssistant:"

ONNX Usage

The ONNX export is intended for forward-pass validation and integration experiments. Use ONNX Runtime to load onnx/model.onnx and feed integer input_ids plus attention_mask.

Dataset Summary

The training pipeline streams a configurable mixture of FineWeb-Edu, Cosmopedia, Wikipedia-like text, TinyStories, and a small permissive code component. SFT uses OpenHermes-style, UltraChat-style, Alpaca-style, and small refusal/helpfulness examples when available. Dataset availability can change; the exact configs are included with the upload.

Eval Results

Eval artifacts, when present, are uploaded under evals/.

Model provider

KookiesXy

Model tree

Base

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today