mayflowergmbh

boldt-dc-1b-german-it-16k

README

License: apache-2.0

Usage

The tokenizer ships with a registered chat template. Standard transformers usage:

python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "mayflowergmbh/boldt-dc-1b-german-it-16k"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map="cuda")

messages = [{"role": "user", "content": "Wie heißt die Hauptstadt von Frankreich?"}]
prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(tok.decode(out[0, inputs.input_ids.shape[-1]:], skip_special_tokens=False))

The model's generation_config already sets eos_token_id = [0, 32003] so generation stops on either <|endoftext|> (id 0, pretrained EOS) or <|end|> (id 32003, chat-format EOS). The model emits both naturally; recognising both is required.

Chat format

markdown
<|system|>
{system message}
<|user|>
{user message}
<|assistant|>
{assistant response}
<|end|>

<|system|> is optional. Multi-turn appends additional <|user|> / <|assistant|> blocks before the closing <|end|>.

Training

Pipeline:

CPT-16K (continued pretraining): YaRN RoPE 2048→16384 on long German text, ~5000 steps, LoRA r=64 on Q/K/V/O/MLP/embed/lm_head.
SFT-16K: instruction tuning on German chat data, 7000 steps at 16K context. Plain transformers + peft (not Unsloth, which had bugs at this scale, see model history).

ORPO post-training was investigated but did not produce a measurable improvement on tier-1 benchmarks at this model size and was not included in this release.

Evaluation (lm-evaluation-harness, German tier 1)

Table with columns: Task, Boldt-DC-1B (base), This model (v3 SFT)
Task	Boldt-DC-1B (base)	This model (v3 SFT)
arc_de (25-shot)	0.362	0.332
hellaswag_de (10-shot)	0.504	0.466
m_mmlu_de (5-shot)	0.256	0.249
truthfulqa_de_mc2 (0-shot)	0.373	0.415
belebele_deu_Latn (0-shot)	0.229	0.228
mean	0.345

The instruction tuning gains 4.2 pp on TruthfulQA-de while losing 3.0 pp on ARC-de and 3.8 pp on HellaSwag-de. This is the expected trade for 1B-class instruct models (cf. SmolLM2-1.7B-Instruct: −8.8 pp ARC, −2.6 pp HellaSwag) and is comparable to other published small-instruct deltas in this range.

Known limitations

Math is weak. 1.25 B model arithmetic is unreliable.
Factual recall is limited. Like other 1 B-class models, subtle factual errors in long-form answers are common.
No tool-use / function-calling training.
Long-context use beyond ~8 K is untested in this release.

License

Apache-2.0 (inherits from the base model).

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

mayflowergmbh

Model Tree

Base

Boldt/Boldt-DC-1B

Fine-tuned

this model

Input Modalities

Text

Output Modalities