FlameF0X

TinyMoE-100m-2x8

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: mit

Model Details

Architecture: Sparse Mixture of Experts (MoE)
Total Parameters: 99,809,280 (~100M total parameters)
Active Parameters per Token: 22,544,640 (~22.5M active parameters)
Expert Configuration: 8 total local experts, 2 active experts routed per token (num_experts_per_tok": 2)
Context Length: 1024 tokens
Base Architecture: Mixtral / Mistral For Causal LM
License: MIT

Parameter Breakdown

Unlike a standard dense model, an MoE model stores a larger footprint of parameters on disk but selectively activates only a subset for any given token during a forward pass:

Table
Component	Total Parameters	Status During Inference
Embeddings (Input + LM Head)	24,576,000	Always Active
Attention Blocks (10 Layers)	4,423,680	Always Active
MoE Routers (10 Layers)	30,720	Always Active
Experts (8 Total across 10 Layers)	70,778,880	2 of 8 Active per Layer (~17.6M active)
Overall Footprint	99,809,280	22,544,640 Active per Token

Training Data

This model was trained on a high-quality mixture of datasets to balance narrative fluidness with factual language structural grounding:

TinyStories: For coherent, creative synthetic narrative generation.
WikiText-103: For general knowledge syntax, vocabulary diversity, and structural language understanding.

Quick Start

You can load and experiment with this model using the Hugging Face transformers library:

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "FlameF0X/TinyMoE-100M-2x8"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

input_text = "Once upon a time,"
inputs = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))