mayflowergmbh

boldt-dc-1b-german-it-16k-dpo

README

License: apache-2.0

Usage

python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "mayflowergmbh/boldt-dc-1b-german-it-16k-dpo"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map="cuda")

messages = [{"role": "user", "content": "Erkläre kurz, was eine Funktion in Python ist."}]
prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(tok.decode(out[0, inputs.input_ids.shape[-1]:], skip_special_tokens=False))

generation_config.eos_token_id = [0, 32003] covers both <|endoftext|> and <|end|>.

Recipe

SFT model (mayflowergmbh/boldt-dc-1b-german-it-16k): plain transformers+peft SFT of Boldt/Boldt-DC-1B, 7000 steps at 16K context.
DPO checkpoint: TRL DPOTrainer, loss_type="sigmoid", β=0.3, rpo_alpha=0.5 (NLL anchor on chosen — prevents the "response suppression" failure mode documented in 3D-Properties of DPO, Yan et al. 2024, arXiv:2406.07327). LR 5e-7, 800 steps, LoRA r=32 on QKV/MLP. Dataset: mayflowergmbh/boldt-dc-1b-orpo-onpolicy-de length-filtered to |chosen|/|rejected| ≤ 3 (54k → 22k pairs).
SLERP merge via mergekit at t=0.5, dtype=bfloat16, tokenizer_source: union. ~30 seconds on a single A6000.

Why merge: the SFT model preserves more reasoning capacity (commonsense benchmarks regress less than under pure DPO), while the DPO model has slightly better chat-style behaviour. SLERP at t=0.5 recovers both. The LFM2 paper documents the same observation for full-model merging at the 1.2B scale.

Evaluation (lm-evaluation-harness, German tier 1)

Table with columns: Task, base (no FT), SFT, DPO (pre-merge), this (merge)
Task	base (no FT)	SFT	DPO (pre-merge)	this (merge)
arc_de (25-shot)	0.3618	0.3319	0.3285	0.3353
hellaswag_de (10-shot)	0.5037	0.4655	0.4667	0.4651
m_mmlu_de (5-shot)	0.2560	0.2488	0.2503	0.2488

The merge is the highest-mean working variant in the SFT/DPO/merge family — +0.25 pp over pure SFT, +0.11 pp over pure DPO. Largest individual gains: arc_de recovers +0.34 pp from the DPO regression, belebele_deu_Latn adds +0.89 pp over SFT. Per-task deltas are within stderr (~±1.5 pp), but the direction is consistent across tasks and the result reproduces the LFM2 paper's claim that same-model merging recovers task-specific knowledge that preference tuning erodes.

No public 1B-class German chat model published Q4 2025 – Q1 2026 has been found to meaningfully exceed the Boldt/Boldt-DC-1B base on these tier-1 averages without teacher-model distillation. This release does not close that gap, but it is the highest tier-1 mean among models in this family that also generate coherent German.

Mergekit config

yaml
slices:
  - sources:
      - model: mayflowergmbh/boldt-dc-1b-german-it-16k        # SFT
        layer_range: [0, 16]
      - model: <DPO-tuned checkpoint of the SFT>
        layer_range: [0, 16]
merge_method: slerp
base_model: mayflowergmbh/boldt-dc-1b-german-it-16k
parameters:
  t: 0.5
dtype: bfloat16
tokenizer_source: union

Known limitations

Inherits all of the SFT base's limits — math arithmetic is unreliable (1.25 B ceiling), factual recall has typical small-model errors, no tool-use / function-calling training, long-context use beyond ~8 K is untested.

License

Apache-2.0 (inherits from the base model).

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

mayflowergmbh

Model Tree

Base

mayflowergmbh/boldt-dc-1b-german-it-16k

Fine-tuned

this model

Input Modalities

Text

Output Modalities