Usage
The tokenizer ships with a registered chat template. Standard transformers usage:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "mayflowergmbh/boldt-dc-1b-german-it-16k"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map="cuda")
messages = [{"role": "user", "content": "Wie heißt die Hauptstadt von Frankreich?"}]
prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(tok.decode(out[0, inputs.input_ids.shape[-1]:], skip_special_tokens=False))
The model's generation_config already sets eos_token_id = [0, 32003] so generation stops on either <|endoftext|> (id 0, pretrained EOS) or <|end|> (id 32003, chat-format EOS). The model emits both naturally; recognising both is required.
<|system|>
{system message}
<|user|>
{user message}
<|assistant|>
{assistant response}
<|end|>
<|system|> is optional. Multi-turn appends additional <|user|> / <|assistant|> blocks before the closing <|end|>.
Training
Pipeline:
- CPT-16K (continued pretraining): YaRN RoPE 2048→16384 on long German text, ~5000 steps, LoRA r=64 on Q/K/V/O/MLP/embed/lm_head.
- SFT-16K: instruction tuning on German chat data, 7000 steps at 16K context. Plain
transformers + peft (not Unsloth, which had bugs at this scale, see model history).
ORPO post-training was investigated but did not produce a measurable improvement on tier-1 benchmarks at this model size and was not included in this release.
Evaluation (lm-evaluation-harness, German tier 1)
Table with columns: Task, Boldt-DC-1B (base), This model (v3 SFT)| Task | Boldt-DC-1B (base) | This model (v3 SFT) |
|---|
| arc_de (25-shot) | 0.362 | 0.332 |
| hellaswag_de (10-shot) | 0.504 | 0.466 |
| m_mmlu_de (5-shot) | 0.256 | 0.249 |
| truthfulqa_de_mc2 (0-shot) | 0.373 | 0.415 |
| belebele_deu_Latn (0-shot) | 0.229 | 0.228 |
| mean | 0.345 |
The instruction tuning gains 4.2 pp on TruthfulQA-de while losing 3.0 pp on ARC-de and 3.8 pp on HellaSwag-de. This is the expected trade for 1B-class instruct models (cf. SmolLM2-1.7B-Instruct: −8.8 pp ARC, −2.6 pp HellaSwag) and is comparable to other published small-instruct deltas in this range.
Known limitations
- Math is weak. 1.25 B model arithmetic is unreliable.
- Factual recall is limited. Like other 1 B-class models, subtle factual errors in long-form answers are common.
- No tool-use / function-calling training.
- Long-context use beyond ~8 K is untested in this release.
License
Apache-2.0 (inherits from the base model).