adamroberts

tinystories-5090

Result

Model weights: model.safetensors
Training step: 20000 / 20000
Final train loss: 0.785740
Final validation loss: 0.875080
Final throughput: 207135 tokens/s
Final step time: 2531.04 ms
Final reported BF16 MFU: 39.7%
Average iteration time: 2605.014347 ms
Safetensors size: 248,894,656 bytes
Parameter count: 124,475,904

The TinyStories paper reports eval losses of 1.33 to 1.58 for the 768-hidden-size 1- and 2-layer attention-head ablations in Figure 24. This run's 0.875080 validation loss is lower, but the comparison is not apples-to-apples: this model is a 12-layer GPT-2-style model using GPT-2 tokenization, a 1024-token context, and a different implementation/training setup.

Architecture

Family: GPT-2-style decoder-only Transformer
Descriptor: d12
Layers: 12
Attention heads: 12
Hidden size: 768
Context length: 1024
Vocabulary size: 50,257
Precision: BF16 weights

Training

The run used the TinyStories GPT-2 dataset files generated by dev/data/tinystories.py in llm.kittens.

bash
./train_gpt2cu \
    -i "dev/data/tinystories/TinyStories_train.bin" \
    -j "dev/data/tinystories/TinyStories_val.bin" \
    -o "log124M/5090_S" \
    -v 250 -s 20000 -g 144 \
    -h 0 \
    -b 64 -t 1024 -d 524288 \
    -r 0 \
    -z 1 \
    -c 0.1 \
    -l 0.0006 -q 0.0 -u 700 -n 5000 \
    -y 0 \
    -e "d12" \
    -x 20000

Key settings:

Hardware target: RTX 5090 / SM120
Micro batch: 64
Sequence length: 1024
Total desired batch size: 524,288 tokens
Max steps: 20,000
Optimizer: AdamW as implemented in llm.kittens
Peak learning rate: 6e-4
Scheduler: cosine
Warmup: 700 steps
Final LR fraction: 0.0

Sample

Prompt/sample emitted at the final checkpoint:

text
Once upon a time, there was a little boy named Timmy. Timmy loved going to school and playing with his friends. One day, Timmy woke up and felt very hot. He asked his mom if his head hurt. His mom said it might be burnt. Timmy's mom recommended they switch their shirts outside so he would feel better.

Timmy went outside and saw his friends playing. He wanted to join them, but he remembered his mom's recommendation. He switched his shirt right away and felt much cooler. Timmy was happy he listened to his mom and his friends.

Later, during recess, Timmy's friend asked him to go on the slide.

Files

model.safetensors: BF16 Transformers weights.
config.json: GPT-2 model configuration.
generation_config.json: default generation settings.
tokenizer.json: GPT-2 tokenizer.
vocab.json and merges.txt: GPT-2 BPE vocabulary files.

Loading

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "adamroberts/tinystories-5090"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16)

inputs = tokenizer("Once upon a time", return_tensors="pt")
with torch.inference_mode():
    outputs = model.generate(**inputs, max_new_tokens=80, do_sample=True, temperature=0.8)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Source implementation: https://github.com/adamdroberts/llm.kittens

TinyStories reference paper: https://arxiv.org/abs/2305.07759

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

adamroberts

Model Tree

Base

this model

Input Modalities

Text

Output Modalities

Text

Supported Functionality

Dedicated EndpointsContainer

Explore FriendliAI today

Get started Talk to an engineer

Result

Model weights: model.safetensors
Training step: 20000 / 20000
Final train loss: 0.785740
Final validation loss: 0.875080
Final throughput: 207135 tokens/s
Final step time: 2531.04 ms
Final reported BF16 MFU: 39.7%
Average iteration time: 2605.014347 ms
Safetensors size: 248,894,656 bytes
Parameter count: 124,475,904

Architecture

Family: GPT-2-style decoder-only Transformer
Descriptor: d12
Layers: 12
Attention heads: 12
Hidden size: 768
Context length: 1024
Vocabulary size: 50,257
Precision: BF16 weights

Training

The run used the TinyStories GPT-2 dataset files generated by dev/data/tinystories.py in llm.kittens.

bash
./train_gpt2cu \
    -i "dev/data/tinystories/TinyStories_train.bin" \
    -j "dev/data/tinystories/TinyStories_val.bin" \
    -o "log124M/5090_S" \
    -v 250 -s 20000 -g 144 \
    -h 0 \
    -b 64 -t 1024 -d 524288 \
    -r 0 \
    -z 1 \
    -c 0.1 \
    -l 0.0006 -q 0.0 -u 700 -n 5000 \
    -y 0 \
    -e "d12" \
    -x 20000

Key settings:

Hardware target: RTX 5090 / SM120
Micro batch: 64
Sequence length: 1024
Total desired batch size: 524,288 tokens
Max steps: 20,000
Optimizer: AdamW as implemented in llm.kittens
Peak learning rate: 6e-4
Scheduler: cosine
Warmup: 700 steps
Final LR fraction: 0.0

Sample

Prompt/sample emitted at the final checkpoint:

text
Once upon a time, there was a little boy named Timmy. Timmy loved going to school and playing with his friends. One day, Timmy woke up and felt very hot. He asked his mom if his head hurt. His mom said it might be burnt. Timmy's mom recommended they switch their shirts outside so he would feel better.

Timmy went outside and saw his friends playing. He wanted to join them, but he remembered his mom's recommendation. He switched his shirt right away and felt much cooler. Timmy was happy he listened to his mom and his friends.

Later, during recess, Timmy's friend asked him to go on the slide.

Files

model.safetensors: BF16 Transformers weights.
config.json: GPT-2 model configuration.
generation_config.json: default generation settings.
tokenizer.json: GPT-2 tokenizer.
vocab.json and merges.txt: GPT-2 BPE vocabulary files.

Loading

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "adamroberts/tinystories-5090"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16)

inputs = tokenizer("Once upon a time", return_tensors="pt")
with torch.inference_mode():
    outputs = model.generate(**inputs, max_new_tokens=80, do_sample=True, temperature=0.8)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Source implementation: https://github.com/adamdroberts/llm.kittens

TinyStories reference paper: https://arxiv.org/abs/2305.07759

tinystories-5090

README

Result

Architecture

Training

Sample

Files

Loading

Explore FriendliAI today

README

Result

Architecture

Training

Sample

Files

Loading