Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model Details

PropertyValue
ArchitectureMixtral-style MoE (8 experts, top-2 routing)
Parameters14.83B total / ~7.42B active per token
Layers24
Hidden size4096
Attention heads32 (GQA — 8 KV heads)
Head dim128
Expert intermediate size5,632
Experts8 total, top-2 per token
Context length4,096 tokens
Vocabulary131,074 (131,072 SPM + `<
RoPE theta500,000
Sliding window512 (alternating layers)
NormRMSNorm (eps=1e-5)
ActivationSiLU
Dtypebfloat16
LanguagesKorean (primary), English

Full Training Pipeline

StageStepsTokensDataHardware
Pretraining Stage 1100,000~50BKorean + English web corpus2× H200 SXM
Pretraining Stage 2120,000~13BKorean + English web corpus (continued)2× H200 SXM
SFT Epoch 118,000710Mkeural-SFT 1.14M ChatML samples2× H200 SXM
DPO Round 16,927440K Korean preference pairs2× H200 SXM
SFT Epoch 229,1127.63Bkeural-SFT 710K samples (2nd pass)2× H200 SXM
SFT Epoch 3 (this checkpoint)50,000 / 65,849~18B2.35M merged ChatML dataset2× H200 SXM

SFT Epoch 3 Training Details

HyperparameterValue
Resumed fromcheckpoint_29112 (SFT epoch 2 final)
Learning rate1e-5 → 1e-6 cosine decay
Min learning rate1e-6
Current LR at 50K2.19e-06
Effective batch size64 (4 per GPU × 8 grad accum × 2 GPUs)
Max sequence length4,096 tokens
Weight decay0.05
Gradient clipping1.0
OptimizerAdamW
Checkpoint step50,000 (76.4% of epoch)
Total epoch steps65,849
Training loss at 50K~2.01
ParallelismFSDP FULL_SHARD (ZeRO-3 equivalent)
Precisionbfloat16 + gradient checkpointing
Hardware2× NVIDIA H200 SXM (139 GiB each)

SFT Epoch 3 Dataset (2,351,212 samples)

SourceSamplesLanguage
OpenHermes-2.51,001,551English
SlimOrca517,982English
UltraChat193,212English
OpenOrca138,639English
AIHub multisession sci127,868Korean
AIHub daily conversation120,867Korean
AIHub multisession social85,346Korean
Alpaca46,303English
KoInstruct QA45,299Korean
KoInstruct base42,276Korean
KoAlpaca21,091Korean
AIHub expert QA10,778Korean
Total2,351,212Korean ~19% / English ~81%

Chat Format (ChatML)

markdown

<|im_start|>system
You are a helpful bilingual Korean-English assistant. Always respond in the same language as the user.<|im_end|>
<|im_start|>user
안녕하세요! 파이썬 리스트 정렬 방법을 알려주세요.<|im_end|>
<|im_start|>assistant

How to Use

With vLLM (recommended)

bash

python -m vllm.entrypoints.openai.api_server \
--model mkd-hossain/keural-sft3-50k \
--dtype auto \
--max-model-len 4096 \
--gpu-memory-utilization 0.7

python

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
model="mkd-hossain/keural-sft3-50k",
messages=[
{"role": "system", "content": "You are a helpful bilingual Korean-English assistant. Always respond in the same language as the user."},
{"role": "user", "content": "인공지능이란 무엇인가요?"},
],
max_tokens=512,
temperature=0.7,
)
print(response.choices[0].message.content)

With transformers

python

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "mkd-hossain/keural-sft3-50k"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a helpful bilingual Korean-English assistant."},
{"role": "user", "content": "파이썬 리스트 정렬 방법을 알려주세요."},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.1,
do_sample=True,
eos_token_id=131073,
)
response = tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
response = response.split("<|im_end|>")[0].strip()
print(response)

Special Tokens

TokenIDPurpose
`<im_start>`
`<im_end>`
<bos>1Beginning of sequence
<eos>2End of sequence (not used for chat)
<pad>0Padding

Always set eos_token_id=131073 — do not use ID 2.

Checkpoint Comparison

CheckpointStageStepsProgress
mkd-hossain/keural-pretrainedPretraining120,000Base model
mkd-hossain/keural-sft-18kSFT Epoch 118,000Initial instruction tuning
mkd-hossain/keural-dpo-finalDPO Round 16,927Alignment
mkd-hossain/keural-sft2SFT Epoch 229,1122nd SFT pass
mkd-hossain/keural-sft3-40kSFT Epoch 340,00060.7% of epoch 3
mkd-hossain/keural-sft3-50kSFT Epoch 350,00076.4% of epoch 3

Limitations

  • Maximum context is 4,096 tokens.
  • This is an intermediate checkpoint — epoch 3 completes at step 65,849.
  • Not safety-aligned — do not deploy in production without additional safety fine-tuning.
  • DPO round 2 planned (485,793 pairs) after SFT epoch 3 completes.

License

Apache 2.0

Model provider

mkd-hossain

Model tree

Base

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today