Highlights
- 🇰🇷 Korean-specialized, from scratch — Llama-3-style architecture (RoPE, GQA, SwiGLU,
RMSNorm), 128K byte-level BPE tokenizer, trained from random initialization.
- 🥇 Beats the size-matched
polyglot-ko-1.3b and the larger Tri-1.9B on HAE-RAE and
Belebele-Ko (5-shot), the two Korean-language benchmarks emphasized here. (It trails
polyglot-ko-1.3b on KoBEST commonsense and KMMLU, and the flagship EXAONE-4.0-1.2B overall.)
- 🔬 A data-centric recipe — we show that which corpus you continue-pretrain on decides
which capability improves (web → commonsense, Wikipedia → knowledge).
- 📦 Edge-friendly — 1.26B parameters; runs comfortably on a single consumer GPU.
Benchmark Results
Korean benchmarks via the EleutherAI lm-evaluation-harness, 5-shot, accuracy (%). All models
evaluated under identical settings. Bold = best, underline = second best.
Table with columns: Benchmark, Jumini-Ko-1.2B (1.26B), polyglot-ko-1.3b (1.43B), Tri-1.9B (1.9B), EXAONE-4.0-1.2B† (1.28B)| Benchmark | Jumini-Ko-1.2B (1.26B) | polyglot-ko-1.3b (1.43B) | Tri-1.9B (1.9B) | EXAONE-4.0-1.2B† (1.28B) |
|---|
| HAE-RAE (Korean knowledge) | 21.9 | 18.7 | 18.9 | 30.0 |
| Belebele-Ko (reading) | 27.9 | 22.4 | 22.9 | 44.7 |
| KMMLU (knowledge) | 24.3 | 27.8 | 16.6 | 32.6 |
| KoBEST (commonsense) | 49.5 | 55.9 | 50.1 | 50.6 |
† EXAONE-4.0-1.2B is a strong flagship model trained on vastly more data/compute, shown as
an aspirational reference. Against the open same-tier baselines (polyglot-ko-1.3b, Tri-1.9B),
Jumini leads on the Korean-specific HAE-RAE and Belebele-Ko while being the smallest model.
Jumini also beats polyglot-ko-1.3b on 4 of 5 HAE-RAE subtasks (history, loan-word,
rare-word, standard-nomenclature). It trails polyglot-ko-1.3b on commonsense (KoBEST) and broad
knowledge (KMMLU). Full per-subtask numbers are in the technical report.
Quickstart
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
repo = "properly59/Jumini-Ko-1.2B"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype=torch.float16, device_map="auto")
prompt = "### 질문:\n대한민국의 수도는 어디인가요?\n\n### 답변:\n"
ids = tok(tok.bos_token + prompt, return_tensors="pt", add_special_tokens=False).to(model.device)
out = model.generate(**ids, max_new_tokens=128, do_sample=True, temperature=0.8,
min_p=0.05, repetition_penalty=1.2, no_repeat_ngram_size=3,
pad_token_id=tok.pad_token_id)
print(tok.decode(out[0][ids.input_ids.shape[1]:], skip_special_tokens=True))
Model Details
Table | |
|---|
| Architecture | Decoder-only Transformer (Llama-3 family) |
| Parameters | 1.26B (hidden 2048, 28 layers, 32 Q / 8 KV heads, SwiGLU 4096) |
| Position encoding | RoPE (θ = 500,000) |
| Tokenizer | Byte-level BPE, 128,000 vocab |
| Context length | 4,096 |
| Precision | bf16 / fp16 |
| License | Apache-2.0 |
Training
A three-stage, fully-documented pipeline on top of the from-scratch base:
- Continued pre-training on a high-quality Korean mixture (FineWeb-2
kor_Hang,
KOREAN-WEBTEXT, Korean Wikipedia), document-boundary packed.
- Encyclopedic annealing on Korean Wikipedia (LR → 0) — the most token-efficient route to
Korean knowledge.
- Supervised fine-tuning on a 132K permissively-licensed Korean instruction mixture
(KoAlpaca, OpenOrca-KO, KOpen-Platypus, KULLM-v2), with completion-only loss and explicit EOS
supervision.
All continued-pretraining and instruction data are public corpora used only for post-training;
no external pretrained weights are used. A benchmark decontamination check found 0.00% of benchmark
items substantially covered (≥50% of 25-character shingles) by the instruction data.
Intended Use & Limitations
Intended for Korean text generation, QA, summarization, and research on small-model training.
As a compact model trained from scratch under a constrained budget, its factual accuracy is
limited and it can produce incorrect content; greedy decoding is best paired with a repetition
penalty. It trails much larger / higher-budget Korean models (e.g., EXAONE) on knowledge tasks
and has not undergone safety alignment. Use for research and non-critical applications only.
Citation
@techreport{jumini2026,
title = {Jumini-Ko-1.2B Technical Report},
author = {Cho, Ju-min},
year = {2026},
note = {https://huggingface.co/properly59/Jumini-Ko-1.2B}
}