FlameF0X/TinyMoE-100m-A1K API & Inference Endpoint

Model Details

Architecture: Sparse Mixture of Experts (MoE)
Total Parameters: ~100M
Experts: 8 total, 2 active per token
Context Length: 1024 tokens
Base Architecture: Mistral
License: MIT

Training Data

This model was trained on a mixture of datasets to balance narrative capability and factual grounding:

TinyStories: Used for coherent, creative narrative generation.
WikiText-103: Used for structural language understanding and general knowledge.

Quick Start

You can load this model using the transformers library:

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "FlameF0X/TinyMoE-100M-A1K"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

input_text = "Once upon a time,"
inputs = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Performance & Usage

This model is intended for research and edge applications where low latency and small memory footprints are required. Due to its MoE nature, it maintains the parameter count of a larger model while keeping the inference speed of a much smaller dense model.

Training Configuration

The model was trained with the following core configurations:

Hidden Size: 256
Number of Layers: 8
Intermediate Size: 512
Learning Rate: 5e-4 (Cosine schedule)
Optimizer: AdamW

TinyMoE-100m-A1K

Get help setting up a custom Dedicated Endpoints.

README