Ditto-8B API & Inference Endpoint

Method

Ditto-8B is trained with DITTO, a reinforcement learning method that uses verbal feedback as the learning signal. After each output, the model receives descriptive feedback and produces an improved version; both are jointly optimized with GRPO. This distills the verbal guidance into the policy, so no feedback is needed at inference time.

Results

Primary metric for each benchmark (higher is better).

Table with columns: Dim, Benchmark, GPT 5.5, Gemini 3.1 Pro, Claude Opus 4.7, Qwen 3.6 Plus, Others*, Qwen3 8B Inst, Ditto-8B
Dim	Benchmark	GPT 5.5	Gemini 3.1 Pro	Claude Opus 4.7	Qwen 3.6 Plus	Others*	Qwen3 8B Inst	Ditto-8B
CONV	UserLLM	65.3	67.7	57.6	72.1	44.6	46.0	91.5
CONV	MirrorBench	56.7	48.3	63.7	48.0	45.4	54.0	73.4
CONV	Humanual-Chat	28.2	21.0	22.6	22.2	25.8	24.7	21.0
CONV	SimArena-Doc	83.4	83.0	83.5	82.4	83.5	83.6	84.4
SS	Sotopia-Hard	31.9	27.8	32.4	28.3	31.7	27.7	45.8
COG	Fantom	93.0	93.0	80.0	89.0	70.0	23.0	92.0
COG	Hitom	82.0	86.0	93.0	73.0	56.0	62.0	79.0
COG	Paratomi	99.0	97.0	90.0	94.0	75.0	67.0	95.0
COG	Social-R1	69.0	79.0	67.0	67.0	47.0	54.0	50.0
ROLE	Coser	66.2	62.1	66.5	55.9	30.3	43.5	64.4
ROLE	Lifechoices	91.0	84.0	92.0	79.0	67.0	70.0	70.0
ROLE	Twinvoice	74.0	86.0	83.0	71.0	40.0	42.0	71.0
ROLE	BehaviorChain	95.0	92.0	96.0	85.0	36.0	41.0	44.0
ROLE	SimArena-Math	68.5	71.5	68.7	70.9	70.5	68.9	69.6
ROLE	Mistakes	72.0	73.0	74.0	67.0	56.0	27.0	36.0
ROLE	Humanual-Email	50.1	46.9	50.4	47.9	42.8	43.7	40.8
ROLE	Humanual-News	40.2	42.3	41.3	41.8	33.1	32.5	27.5
ROLE	Humanual-Politics	42.0	32.5	43.5	31.6	34.2	33.2	29.7
EVAL	AlignX	71.2	73.4	71.6	69.8	66.8	68.6	67.4
EVAL	Humanllm	45.7	46.9	44.2	42.7	35.2	34.1	33.1
EVAL	Socsci210	77.2	78.0	77.2	74.5	75.2	73.6	72.5
EVAL	Humanual-Book	57.6	62.4	61.4	58.4	50.2	53.6	53.4
EVAL	Humanual-Opinion	39.8	36.0	46.2	34.2	37.4	37.2	30.3

* Others: best result among other specialized human-simulation models (HumanLM-8B, Sotopia-RL-7B, UserLM-8B, Coser-8B).

Note. The released Ditto-8B is a single generalist distilled from a set of task-specific DITTO experts via rejection sampling on the training set.

Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "sunweiwei/Ditto-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

messages = [{"role": "user", "content": "Hello!"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Citation

bibtex
@article{sun2026ditto,
  title         = {Reinforcing Human Behavior Simulation via Verbal Feedback},
  author        = {Sun, Weiwei and Zhou, Xuhui and Liu, Jiarui and Du, Weihua and Sun, Haojia and Xie, Yiqing and Ma, Qianou and Chen, Sihao and Wan, Mengting and Yang, Longqi and Zhou, Pei and Wu, Sherry and Welleck, Sean and Neubig, Graham and Yang, Yiming and Sap, Maarten},
  year          = {2026},
  eprint        = {2605.20506},
  archivePrefix = {arXiv},
  url           = {http://arxiv.org/abs/2605.20506}
}

Dim

Benchmark

GPT 5.5

Gemini 3.1 Pro

Claude Opus 4.7

Qwen 3.6 Plus

Others*

Qwen3 8B Inst

Ditto-8B

CONV

UserLLM

65.3

67.7

57.6

72.1

44.6

46.0

91.5

CONV

MirrorBench

56.7

48.3

63.7

48.0

45.4

54.0

73.4

CONV

Humanual-Chat

28.2

21.0

22.6

22.2

25.8

24.7

21.0

CONV

SimArena-Doc

83.4

83.0

83.5

82.4

83.5

83.6

84.4

Sotopia-Hard

31.9

27.8

32.4

28.3

31.7

27.7

45.8

COG

Fantom

93.0

80.0

89.0

70.0

23.0

92.0

COG

Hitom

82.0

86.0

93.0

73.0

56.0

62.0

79.0

COG

Paratomi

99.0

97.0

90.0

94.0

75.0

67.0

95.0

COG

Social-R1

69.0

79.0

67.0

47.0

54.0

50.0

ROLE

Coser

66.2

62.1

66.5

55.9

30.3

43.5

64.4

ROLE

Lifechoices

91.0

84.0

92.0

79.0

67.0

70.0

ROLE

Twinvoice

74.0

86.0

83.0

71.0

40.0

42.0

71.0

ROLE

BehaviorChain

95.0

92.0

96.0

85.0

36.0

41.0

44.0

ROLE

SimArena-Math

68.5

71.5

68.7

70.9

70.5

68.9

69.6

ROLE

Mistakes

72.0

73.0

74.0

67.0

56.0

27.0

36.0

ROLE

Humanual-Email

50.1

46.9

50.4

47.9

42.8

43.7

40.8

ROLE

Humanual-News

40.2

42.3

41.3

41.8

33.1

32.5

27.5

ROLE

Humanual-Politics

42.0

32.5

43.5

31.6

34.2

33.2

29.7

EVAL

AlignX

71.2

73.4

71.6

69.8

66.8

68.6

67.4

EVAL

Humanllm

45.7

46.9

44.2

42.7

35.2

34.1

33.1

EVAL

Socsci210

77.2

78.0

77.2

74.5

75.2

73.6

72.5

EVAL

Humanual-Book

57.6

62.4

61.4

58.4

50.2

53.6

53.4

EVAL

Humanual-Opinion

39.8

36.0

46.2

34.2

37.4

37.2

30.3

python

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "sunweiwei/Ditto-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

messages = [{"role": "user", "content": "Hello!"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

bibtex

@article{sun2026ditto,
  title         = {Reinforcing Human Behavior Simulation via Verbal Feedback},
  author        = {Sun, Weiwei and Zhou, Xuhui and Liu, Jiarui and Du, Weihua and Sun, Haojia and Xie, Yiqing and Ma, Qianou and Chen, Sihao and Wan, Mengting and Yang, Longqi and Zhou, Pei and Wu, Sherry and Welleck, Sean and Neubig, Graham and Yang, Yiming and Sap, Maarten},
  year          = {2026},
  eprint        = {2605.20506},
  archivePrefix = {arXiv},
  url           = {http://arxiv.org/abs/2605.20506}
}

Ditto-8B

README

Method

Results

Usage

Citation

Explore FriendliAI today

README

Method

Results

Usage

Citation