Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model Details

  • Model type: decoder-only causal language model
  • Architecture: Llama-style Transformer
  • Languages: Indonesian and English
  • Parameters: 556,269,696 in the validated final pretraining checkpoint
  • Context length: 2048 tokens
  • Tokenizer: BPE tokenizer trained for this project
  • Vocabulary size: 64,000 tokens in the current pipeline config
  • Hidden size: 1152
  • Layers: 28
  • Attention heads: 16
  • Key/value heads: 8, using grouped-query attention
  • Feed-forward size: 3072, using SwiGLU-style activation
  • Positional encoding: RoPE
  • Normalization: RMSNorm
  • Precision target: bfloat16
  • Validation checkpoint: checkpoints/pretrain/final

Training Pipeline

Bhineka-GPT-500M is produced through the following stages:

  1. Dataset download from public Hugging Face datasets
  2. Rule-based cleaning and quality filtering
  3. Exact and MinHash deduplication
  4. BPE tokenizer training
  5. Binary shard creation
  6. Domain-weighted curriculum sampling
  7. Causal language model pretraining
  8. Supervised fine-tuning on instruction/chat datasets
  9. Direct Preference Optimization
  10. Export to Hugging Face format with safetensors

Training Data

The current project configuration targets a bilingual and technical mixture with approximately 12.5B total pretraining tokens:

DomainApproximate Target TokensPurpose
English high-quality web8.7BGeneral knowledge, reasoning, writing
Indonesian high-quality web2.15BIndonesian language coverage and local text style
Code1.05BPython, JavaScript, Go, SQL, and technical generation
Math / academic600MMathematical and academic text exposure

Main pretraining sources include FineWeb, FineWeb-Edu, CulturaX Indonesian, mC4 Indonesian, Indonesian Wikipedia, GitHub code subsets, and OpenWebMath.

Instruction tuning data configured in the project includes Alpaca-style and chat-style datasets such as Alpaca Cleaned, Dolly 15k, Alpaca Indonesian, Alpaca GPT-4 Indonesian, OpenHermes 2.5, and SlimOrca.

Intended Uses

This model is intended for research, experimentation, and application prototyping in Indonesian-English language tasks, including:

  • General chat and instruction following
  • Indonesian and English question answering
  • Indonesian-English translation
  • Summarization and rewriting
  • Technical explanation and drafting
  • Python, JavaScript, Go, and SQL code assistance
  • Markdown and structured response generation

Out-of-Scope Uses

This model should not be used as the sole source of truth for high-stakes decisions, including medical, legal, financial, safety-critical, or emergency contexts. It should also not be used to generate harmful instructions, impersonation, spam, fraud, or privacy-invasive content.

Limitations

  • The model may hallucinate facts, citations, code behavior, or numerical details.
  • Performance may vary across Indonesian dialects, informal registers, and domain-specific terminology.
  • The model can reflect biases and quality issues present in public web, code, math, and instruction datasets.
  • Smaller language models may struggle with long reasoning chains, complex tool use, and strict factuality.
  • The reported validation-loss results cover language-modeling loss only; broader instruction-following, safety, factuality, and downstream task evaluations are still recommended before production use.

Evaluation

Validation loss was measured with scripts/run_validation_loss.py on the final pretraining checkpoint:

  • Checkpoint: checkpoints/pretrain/final
  • Evaluation date: 2026-05-31
  • Device: CUDA
  • Evaluation dtype: float32
  • Context length: 2048 tokens
  • Batch size: 4
  • Tokens evaluated: 53,678,481
  • Batches evaluated: 6,558
  • Tokenizer vocabulary: 64,000
  • Model vocabulary: 64,000
  • Random-loss baseline: 11.0666
  • Parameter check: 556,269,696 trainable parameters, no non-finite values reported
DomainLossPerplexityTokensBatches
Overall2.535512.622753,678,4816,558
Code1.43044.180415,121,1891,847
English high-quality web3.155123.454320,635,8072,521
Indonesian high-quality web2.725615.265111,481,6231,403
Math2.806216.54626,439,862787

Benchmark Comparison

The following benchmark table compares Bhineka-GPT with several small open-weight language models in the same approximate parameter range. These numbers should be read as an orientation benchmark rather than a perfectly fair leaderboard comparison, because evaluation harness settings, shot count, prompt format, checkpoint type, tokenizer, and instruction tuning status may differ across sources.

For Bhineka-GPT, ARC, HellaSwag, and WinoGrande were evaluated in 0-shot mode, while GSM8K used 5-shot evaluation.

ModelParamsARCHellaSwagWinoGrandeGSM8KNotes
Bhineka-GPT556M24.8331.5848.861.900-shot except GSM8K 5-shot
Pythia-410M-deduped±410M / 0.5B27.9040.0452.090.00Open LLM Leaderboard-style evaluation, mostly few-shot [1]
Pythia-1B-deduped1B29.1049.6553.591.14Larger model, trained with substantially more compute and data [2]
TinyLlama-1.1B Chat1.1B36.0961.1061.25Pretrained on approximately 3T tokens; target training setup reported as 16×A100 for about 90 days [3]
TinyLlama 1.1B variant1.1B30.2955.1255.800.53Fine-tuned variant, Open LLM Leaderboard-style evaluation [4]
Qwen2-0.5B±0.5B non-embedding61.1049.3074.4036.50Much more mature model family; not a fair direct comparison against a from-scratch sub-$100 training experiment [5]

Interpretation:

  • Bhineka-GPT is competitive enough to be a useful research baseline for a from-scratch 556M bilingual Indonesian-English model, especially considering its limited training budget.
  • Larger or more mature models such as TinyLlama, Pythia-1B, and Qwen2-0.5B benefit from more training tokens, more mature infrastructure, and/or larger-scale optimization.
  • The comparison is most useful for positioning Bhineka-GPT as a lightweight experimental bilingual model, not as a claim of state-of-the-art performance.

These results measure next-token prediction quality on validation data. Recommended additional evaluations before release include Indonesian and English instruction-following benchmarks, translation quality checks, summarization and factuality tests, code generation tests, and safety testing.

Usage

After export and upload, the model can be loaded with Transformers. Because this project defines a custom bhineka model architecture, the model repository may need to include the custom modeling files and be loaded with trust_remote_code=True.

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
repo_id = "BhinekaIntiLabs/bhineka-gpt"
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
repo_id,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
prompt = "<|user|> Jelaskan apa itu deduplikasi data dalam pelatihan model bahasa.<|sep|><|assistant|>"
inputs = tokenizer(
prompt,
return_tensors="pt",
add_special_tokens=False,
).to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=256,
do_sample=True,
temperature=0.8,
top_p=0.9,
repetition_penalty=1.1,
use_cache=False,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
completion = outputs[0, inputs["input_ids"].shape[1]:]
print(tokenizer.decode(completion, skip_special_tokens=True))

The exporter saves the model in Hugging Face format with safetensors, tokenizer files, config files, and generation config.

License

This model card declares the apache-2.0 license in the Hugging Face metadata. Please ensure that all training data usage, code dependencies, and released model artifacts are compatible with this license before publishing.

Citation

If you use this model or pipeline, cite the project repository:

bibtex

@software{bhineka_llm_500m,
title = {Bhineka-GPT-500M},
author = {Bhineka-GPT contributors},
year = {2026},
note = {Bilingual Indonesian-English language model training pipeline}
}

Model provider

BhinekaIntiLabs

Model tree

Base

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today