Usage
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "mayflowergmbh/boldt-dc-1b-german-it-16k-dpo"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map="cuda")
messages = [{"role": "user", "content": "Erkläre kurz, was eine Funktion in Python ist."}]
prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(tok.decode(out[0, inputs.input_ids.shape[-1]:], skip_special_tokens=False))
generation_config.eos_token_id = [0, 32003] covers both <|endoftext|> and <|end|>.
Recipe
- SFT model (
mayflowergmbh/boldt-dc-1b-german-it-16k): plain transformers+peft SFT of Boldt/Boldt-DC-1B, 7000 steps at 16K context.
- DPO checkpoint: TRL
DPOTrainer, loss_type="sigmoid", β=0.3, rpo_alpha=0.5 (NLL anchor on chosen — prevents the "response suppression" failure mode documented in 3D-Properties of DPO, Yan et al. 2024, arXiv:2406.07327). LR 5e-7, 800 steps, LoRA r=32 on QKV/MLP. Dataset: mayflowergmbh/boldt-dc-1b-orpo-onpolicy-de length-filtered to |chosen|/|rejected| ≤ 3 (54k → 22k pairs).
- SLERP merge via
mergekit at t=0.5, dtype=bfloat16, tokenizer_source: union. ~30 seconds on a single A6000.
Why merge: the SFT model preserves more reasoning capacity (commonsense benchmarks regress less than under pure DPO), while the DPO model has slightly better chat-style behaviour. SLERP at t=0.5 recovers both. The LFM2 paper documents the same observation for full-model merging at the 1.2B scale.
Evaluation (lm-evaluation-harness, German tier 1)
Table with columns: Task, base (no FT), SFT, DPO (pre-merge), this (merge)| Task | base (no FT) | SFT | DPO (pre-merge) | this (merge) |
|---|
| arc_de (25-shot) | 0.3618 | 0.3319 | 0.3285 | 0.3353 |
| hellaswag_de (10-shot) | 0.5037 | 0.4655 | 0.4667 | 0.4651 |
| m_mmlu_de (5-shot) | 0.2560 | 0.2488 | 0.2503 | 0.2488 |
The merge is the highest-mean working variant in the SFT/DPO/merge family — +0.25 pp over pure SFT, +0.11 pp over pure DPO. Largest individual gains: arc_de recovers +0.34 pp from the DPO regression, belebele_deu_Latn adds +0.89 pp over SFT. Per-task deltas are within stderr (~±1.5 pp), but the direction is consistent across tasks and the result reproduces the LFM2 paper's claim that same-model merging recovers task-specific knowledge that preference tuning erodes.
No public 1B-class German chat model published Q4 2025 – Q1 2026 has been found to meaningfully exceed the Boldt/Boldt-DC-1B base on these tier-1 averages without teacher-model distillation. This release does not close that gap, but it is the highest tier-1 mean among models in this family that also generate coherent German.
Mergekit config
slices:
- sources:
- model: mayflowergmbh/boldt-dc-1b-german-it-16k
layer_range: [0, 16]
- model: <DPO-tuned checkpoint of the SFT>
layer_range: [0, 16]
merge_method: slerp
base_model: mayflowergmbh/boldt-dc-1b-german-it-16k
parameters:
t: 0.5
dtype: bfloat16
tokenizer_source: union
Known limitations
Inherits all of the SFT base's limits — math arithmetic is unreliable (1.25 B ceiling), factual recall has typical small-model errors, no tool-use / function-calling training, long-context use beyond ~8 K is untested.
License
Apache-2.0 (inherits from the base model).