DJLougen

Ornstein-3.5-9B-V2-Coder-experimental

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Benchmarks (GBS STANDARD-200)

Table
	Qwen3.5-9B-Base	Ornstein V1.5	Ornstein V2	V2-Coder (exp.)
Overall	0.725	0.850	0.825	0.785
Reasoning	0.68	0.90	1.00	1.00
GPQA	0.36	0.80	1.00	1.00
Coding	0.77	0.80	0.65	0.57

(V1.5 column = published reference; V2/V2-Coder = fresh same-seed runs. Coding = livecodebench + hlce.)

What was tried

~1,950 distinct verified problems (MBPP asserts + code_contests stdin/stdout), graded partial-credit reward, KL anchor (beta 0.02), num_generations 8, 120 GRPO steps.
The fixes addressed the prior collapse mechanism (narrow data, no KL anchor, near-binary reward), yet coding still regressed — likely the execution-reward signal pulls the policy toward a narrow solution style that hurts held-out LiveCodeBench. Open research question; a terminal-bench / SWE-bench agentic setup is the planned next direction.

Retains V2's vision tower + MTP head (full multimodal weights).

Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "DJLougen/Ornstein-3.5-9B-V2-Coder-experimental"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")

Support This Work

I'm a PhD student in visual neuroscience at the University of Toronto who also happens to spend way too much time fine-tuning, merging, and quantizing open-weight models on rented H100s/B200s and a local DGX Spark. All training compute is self-funded. If my uploads have been useful, consider buying a PhD student a coffee.

Support on Ko-fi