BhinekaIntiLabs

bhineka-gpt

Deploy Dedicated

README

License: apache-2.0

Model Details

Model type: decoder-only causal language model
Architecture: Llama-style Transformer
Languages: Indonesian and English
Parameters: 556,269,696 in the validated final pretraining checkpoint
Context length: 2048 tokens
Tokenizer: BPE tokenizer trained for this project
Vocabulary size: 64,000 tokens in the current pipeline config
Hidden size: 1152
Layers: 28
Attention heads: 16
Key/value heads: 8, using grouped-query attention
Feed-forward size: 3072, using SwiGLU-style activation
Positional encoding: RoPE
Normalization: RMSNorm
Precision target: bfloat16
Validation checkpoint: checkpoints/pretrain/final

Training Pipeline

Bhineka-GPT-500M is produced through the following stages:

Dataset download from public Hugging Face datasets
Rule-based cleaning and quality filtering
Exact and MinHash deduplication
BPE tokenizer training
Binary shard creation
Domain-weighted curriculum sampling
Causal language model pretraining
Supervised fine-tuning on instruction/chat datasets
Direct Preference Optimization
Export to Hugging Face format with safetensors

Training Data

The current project configuration targets a bilingual and technical mixture with approximately 12.5B total pretraining tokens:

Table with columns: Domain, Approximate Target Tokens, Purpose
Domain	Approximate Target Tokens	Purpose
English high-quality web	8.7B	General knowledge, reasoning, writing
Indonesian high-quality web	2.15B	Indonesian language coverage and local text style
Code	1.05B	Python, JavaScript, Go, SQL, and technical generation
Math / academic	600M	Mathematical and academic text exposure

Main pretraining sources include FineWeb, FineWeb-Edu, CulturaX Indonesian, mC4 Indonesian, Indonesian Wikipedia, GitHub code subsets, and OpenWebMath.

Instruction tuning data configured in the project includes Alpaca-style and chat-style datasets such as Alpaca Cleaned, Dolly 15k, Alpaca Indonesian, Alpaca GPT-4 Indonesian, OpenHermes 2.5, and SlimOrca.

Intended Uses

This model is intended for research, experimentation, and application prototyping in Indonesian-English language tasks, including:

General chat and instruction following
Indonesian and English question answering
Indonesian-English translation
Summarization and rewriting
Technical explanation and drafting
Python, JavaScript, Go, and SQL code assistance
Markdown and structured response generation

Out-of-Scope Uses

This model should not be used as the sole source of truth for high-stakes decisions, including medical, legal, financial, safety-critical, or emergency contexts. It should also not be used to generate harmful instructions, impersonation, spam, fraud, or privacy-invasive content.

Limitations

The model may hallucinate facts, citations, code behavior, or numerical details.
Performance may vary across Indonesian dialects, informal registers, and domain-specific terminology.
The model can reflect biases and quality issues present in public web, code, math, and instruction datasets.
Smaller language models may struggle with long reasoning chains, complex tool use, and strict factuality.
The reported validation-loss results cover language-modeling loss only; broader instruction-following, safety, factuality, and downstream task evaluations are still recommended before production use.

Evaluation

Validation loss was measured with scripts/run_validation_loss.py on the final pretraining checkpoint:

Checkpoint: checkpoints/pretrain/final
Evaluation date: 2026-05-31
Device: CUDA
Evaluation dtype: float32
Context length: 2048 tokens
Batch size: 4
Tokens evaluated: 53,678,481
Batches evaluated: 6,558
Tokenizer vocabulary: 64,000
Model vocabulary: 64,000
Random-loss baseline: 11.0666
Parameter check: 556,269,696 trainable parameters, no non-finite values reported

Table with columns: Domain, Loss, Perplexity, Tokens, Batches
Domain	Loss	Perplexity	Tokens	Batches
Overall	2.5355	12.6227	53,678,481	6,558
Code	1.4304	4.1804	15,121,189	1,847
English high-quality web	3.1551	23.4543	20,635,807	2,521

Benchmark Comparison

The following benchmark table compares Bhineka-GPT with several small open-weight language models in the same approximate parameter range. These numbers should be read as an orientation benchmark rather than a perfectly fair leaderboard comparison, because evaluation harness settings, shot count, prompt format, checkpoint type, tokenizer, and instruction tuning status may differ across sources.

For Bhineka-GPT, ARC, HellaSwag, and WinoGrande were evaluated in 0-shot mode, while GSM8K used 5-shot evaluation.

Table with columns: Model, Params, ARC, HellaSwag, WinoGrande, GSM8K, Notes
Model	Params	ARC	HellaSwag	WinoGrande	GSM8K	Notes
Bhineka-GPT	556M	24.83	31.58	48.86	1.90	0-shot except GSM8K 5-shot
Pythia-410M-deduped	±410M / 0.5B	27.90	40.04	52.09	0.00

Interpretation:

Bhineka-GPT is competitive enough to be a useful research baseline for a from-scratch 556M bilingual Indonesian-English model, especially considering its limited training budget.
Larger or more mature models such as TinyLlama, Pythia-1B, and Qwen2-0.5B benefit from more training tokens, more mature infrastructure, and/or larger-scale optimization.
The comparison is most useful for positioning Bhineka-GPT as a lightweight experimental bilingual model, not as a claim of state-of-the-art performance.

These results measure next-token prediction quality on validation data. Recommended additional evaluations before release include Indonesian and English instruction-following benchmarks, translation quality checks, summarization and factuality tests, code generation tests, and safety testing.

Usage

After export and upload, the model can be loaded with Transformers. Because this project defines a custom bhineka model architecture, the model repository may need to include the custom modeling files and be loaded with trust_remote_code=True.

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "BhinekaIntiLabs/bhineka-gpt"

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

prompt = "<|user|> Jelaskan apa itu deduplikasi data dalam pelatihan model bahasa.<|sep|><|assistant|>"
inputs = tokenizer(
    prompt,
    return_tensors="pt",
    add_special_tokens=False, 
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.8,
    top_p=0.9,
    repetition_penalty=1.1,
    use_cache=False,     
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

completion = outputs[0, inputs["input_ids"].shape[1]:]
print(tokenizer.decode(completion, skip_special_tokens=True))

The exporter saves the model in Hugging Face format with safetensors, tokenizer files, config files, and generation config.

License

This model card declares the apache-2.0 license in the Hugging Face metadata. Please ensure that all training data usage, code dependencies, and released model artifacts are compatible with this license before publishing.

Citation

If you use this model or pipeline, cite the project repository:

bibtex
@software{bhineka_llm_500m,
  title = {Bhineka-GPT-500M},
  author = {Bhineka-GPT contributors},
  year = {2026},
  note = {Bilingual Indonesian-English language model training pipeline}
}

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

BhinekaIntiLabs

Model Tree

Base

this model

Input Modalities

Text

Output Modalities

Text

Supported Functionality

Dedicated Endpoints

Container

Explore FriendliAI today

Get started Talk to an engineer

README

License: apache-2.0

Model Details

Model type: decoder-only causal language model
Architecture: Llama-style Transformer
Languages: Indonesian and English
Parameters: 556,269,696 in the validated final pretraining checkpoint
Context length: 2048 tokens
Tokenizer: BPE tokenizer trained for this project
Vocabulary size: 64,000 tokens in the current pipeline config
Hidden size: 1152
Layers: 28
Attention heads: 16
Key/value heads: 8, using grouped-query attention
Feed-forward size: 3072, using SwiGLU-style activation
Positional encoding: RoPE
Normalization: RMSNorm
Precision target: bfloat16
Validation checkpoint: checkpoints/pretrain/final

Training Pipeline

Bhineka-GPT-500M is produced through the following stages:

Dataset download from public Hugging Face datasets
Rule-based cleaning and quality filtering
Exact and MinHash deduplication
BPE tokenizer training
Binary shard creation
Domain-weighted curriculum sampling
Causal language model pretraining
Supervised fine-tuning on instruction/chat datasets
Direct Preference Optimization
Export to Hugging Face format with safetensors

Training Data

The current project configuration targets a bilingual and technical mixture with approximately 12.5B total pretraining tokens:

Table with columns: Domain, Approximate Target Tokens, Purpose
Domain	Approximate Target Tokens	Purpose
English high-quality web	8.7B	General knowledge, reasoning, writing
Indonesian high-quality web	2.15B	Indonesian language coverage and local text style
Code	1.05B	Python, JavaScript, Go, SQL, and technical generation
Math / academic	600M	Mathematical and academic text exposure

Main pretraining sources include FineWeb, FineWeb-Edu, CulturaX Indonesian, mC4 Indonesian, Indonesian Wikipedia, GitHub code subsets, and OpenWebMath.

Intended Uses

This model is intended for research, experimentation, and application prototyping in Indonesian-English language tasks, including:

General chat and instruction following
Indonesian and English question answering
Indonesian-English translation
Summarization and rewriting
Technical explanation and drafting
Python, JavaScript, Go, and SQL code assistance
Markdown and structured response generation

Out-of-Scope Uses

Limitations

The model may hallucinate facts, citations, code behavior, or numerical details.
Performance may vary across Indonesian dialects, informal registers, and domain-specific terminology.
The model can reflect biases and quality issues present in public web, code, math, and instruction datasets.
Smaller language models may struggle with long reasoning chains, complex tool use, and strict factuality.
The reported validation-loss results cover language-modeling loss only; broader instruction-following, safety, factuality, and downstream task evaluations are still recommended before production use.

Evaluation

Validation loss was measured with scripts/run_validation_loss.py on the final pretraining checkpoint:

Checkpoint: checkpoints/pretrain/final
Evaluation date: 2026-05-31
Device: CUDA
Evaluation dtype: float32
Context length: 2048 tokens
Batch size: 4
Tokens evaluated: 53,678,481
Batches evaluated: 6,558
Tokenizer vocabulary: 64,000
Model vocabulary: 64,000
Random-loss baseline: 11.0666
Parameter check: 556,269,696 trainable parameters, no non-finite values reported

Table with columns: Domain, Loss, Perplexity, Tokens, Batches
Domain	Loss	Perplexity	Tokens	Batches
Overall	2.5355	12.6227	53,678,481	6,558
Code	1.4304	4.1804	15,121,189	1,847
English high-quality web	3.1551	23.4543	20,635,807	2,521

Benchmark Comparison

For Bhineka-GPT, ARC, HellaSwag, and WinoGrande were evaluated in 0-shot mode, while GSM8K used 5-shot evaluation.

Table with columns: Model, Params, ARC, HellaSwag, WinoGrande, GSM8K, Notes
Model	Params	ARC	HellaSwag	WinoGrande	GSM8K	Notes
Bhineka-GPT	556M	24.83	31.58	48.86	1.90	0-shot except GSM8K 5-shot
Pythia-410M-deduped	±410M / 0.5B	27.90	40.04	52.09	0.00

Interpretation:

Bhineka-GPT is competitive enough to be a useful research baseline for a from-scratch 556M bilingual Indonesian-English model, especially considering its limited training budget.
Larger or more mature models such as TinyLlama, Pythia-1B, and Qwen2-0.5B benefit from more training tokens, more mature infrastructure, and/or larger-scale optimization.
The comparison is most useful for positioning Bhineka-GPT as a lightweight experimental bilingual model, not as a claim of state-of-the-art performance.

Usage

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "BhinekaIntiLabs/bhineka-gpt"

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

prompt = "<|user|> Jelaskan apa itu deduplikasi data dalam pelatihan model bahasa.<|sep|><|assistant|>"
inputs = tokenizer(
    prompt,
    return_tensors="pt",
    add_special_tokens=False, 
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.8,
    top_p=0.9,
    repetition_penalty=1.1,
    use_cache=False,     
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

completion = outputs[0, inputs["input_ids"].shape[1]:]
print(tokenizer.decode(completion, skip_special_tokens=True))

The exporter saves the model in Hugging Face format with safetensors, tokenizer files, config files, and generation config.

License

Citation

If you use this model or pipeline, cite the project repository:

bibtex
@software{bhineka_llm_500m,
  title = {Bhineka-GPT-500M},
  author = {Bhineka-GPT contributors},
  year = {2026},
  note = {Bilingual Indonesian-English language model training pipeline}
}