HarleyCooper

Qwen3.6-35B-A3B-Dakota1890-GRPO

README

License: apache-2.0

Current Status

Training completed on May 27, 2026. This card now includes the audited final run findings and W&B-style result charts.

HF repo: HarleyCooper/Qwen3.6-35B-A3B-Dakota1890-GRPO
Base model: Qwen/Qwen3.6-35B-A3B
Training platform: Thinking Machines Tinker
Method: GRPO-style RL with a custom Dakota grammar verifier
Adapter type: LoRA, rank 32
W&B project: christian-cooper-us/dakota-rl-grammar
Completed full run: owf98569
Reward-channel pilot: d44bra91
Thinking Machines cost: $68.75
Tokens processed: 82.05 million
Final Tinker sampler path: tinker://1f23df9c-5d88-59d9-a7e8-dd4e169ea7d0:train:0/sampler_weights/final
Final Tinker state path: tinker://1f23df9c-5d88-59d9-a7e8-dd4e169ea7d0:train:0/weights/final
Inference adapter weights: adapter_model.safetensors
Adapter config: adapter_config.json

The reward-channel pilot completed before the full run and cost about $0.26. It verified that the repaired environment can emit nonzero pattern_raw and exact_match_raw channels locally and in W&B before scaling up.

Final Run Findings

Dakota1890 full run dashboard

The full run completed 199 metric rows, ending at training step 198. It cost $68.75 in Thinking Machines credits and processed 82.05 million tokens. The final audit found:

composite reward improved from 0.1664 to 0.2297;
character-overlap reward improved from 0.1424 to 0.4027;
affix reward stayed high and ended at 1.0000;
all-task pattern_raw was nonzero in 186 of 199 logged training rows;
identify_pattern pattern reward reached 0.90625 and was nonzero in 179 of 199 rows;
eval pattern_raw remained nonzero, ending at 0.0586;
exact-match reward stayed at throughout the mixed-task run;

The key result is that the repaired pattern channel is live in a full paid Tinker run. Exact match remains a task-design and prompting problem for short answer-only completions, not a reward-plumbing failure.

The machine-readable summary and markdown findings are included in analysis/final_run_summary.json and analysis/FINAL_RUN_FINDINGS.md.

Composite reward progression

Pattern reward channel

Reward components

Source Lineage

Dakota1890 is built around historical Dakota language source material rather than generic web text. The local project contains the primary Riggs scan, 440 scanned page images, public visual artifacts used for documentation, extracted grammar rules, extracted dictionary vocabulary, generated RL tasks, and W&B/Tinker run logs.

Source and extraction inventory:

primary source PDF: grammardictionar00riggrich.pdf
JP2 source scans: 440 local page images under Dictionary/grammardictionar00riggrich_jp2
processed page images: 440 JPG conversions under data/processed_images
page-layout manifest: 440 rows, including 345 two-column pages and 84 single-column pages
grammar extraction: pages 1-92, with no missing pages in that intended range
dictionary extraction: pages 95-430, with no missing pages in that intended range
verified dictionary entries: 24,224
median dictionary entries per extracted page: 63
average extraction confidence: about 0.9245
first and last verified headwords: a through zig'zag

The current RL grammar dataset contains:

10,576 total packaged RL tasks
1,497 extracted grammar-rule records
684 grammar rules in the grammar extraction pass
350 interlinear texts in the grammar extraction pass
396 linguistic terms in the grammar extraction pass
33 special-character forms observed in the grammar extraction pass
1,497 rows with pattern-bearing verification metadata
514 rows with affix metadata
median reference answer length of about 4 words

Task family counts in the packaged dataset:

Table with columns: Task family, Count
Task family	Count
`word_translation`	2,879
`reverse_translation`	2,137
`morphology`	1,934
`identify_pattern`	1,497
`positive_negative_evidence`	584
`exception_trigger`

The dictionary extraction and synthetic Q&A generation work are adjacent dataset-building tracks. This Tinker RL run does not require OpenAI SFT data or a synthetic dictionary Q&A dump to start; it trains against the existing grammar-task environment and its reward function. The dictionary artifacts remain important for future evaluation, documentation, and broader Dakota lexical coverage.

Visual Project Artifacts

These local project images are included to preserve the visual record of the build alongside the model card:

Table with columns: Artifact, Preview
Artifact	Preview
Grammar source page
Dictionary source page
Training screenshot

What Changed For This Run

Earlier public runs showed metrics/pattern_reward = 0.0 and metrics/exact_match_reward = 0.0 throughout the public W&B surface. The pattern channel was a real plumbing bug: the packaged dataset stores metadata under entry["info"], while the environment had been reading top-level fields such as entry["verification_pattern"], entry["task_type"], and entry["difficulty"].

That has been fixed. The environment now preserves:

task type
difficulty
rule id
hints
verification pattern, including info.pattern
special-character and affix metadata

The full run now logs the composite reward ledger under namespaces such as env/all/ledger/*, env/task/identify_pattern/ledger/*, and difficulty-specific ledgers. Early full-run metrics already showed nonzero pattern reward on identify_pattern tasks with composite_diff = 0.0, meaning the logged scalar reward matches the reconstructed component ledger.

Reward Function

The Dakota grammar verifier uses a composite reward:

Table with columns: Component, Weight, Purpose
Component	Weight	Purpose
exact match	40%	Rewards short answers that exactly match rigid references
character overlap	20%	Rewards lexical and orthographic overlap with the reference
pattern match	15%	Rewards structural matches for rule-bearing tasks
affix accuracy	10%	Rewards required Dakota affix behavior
length control	15%	Penalizes verbose completions when the gold answer is short

Difficulty multipliers are applied after the component sum. The run logs raw component values, normalized values, weights, weighted contributions, reconstructed composites, final reward scalar, and composite_diff for auditability.

Exact match is intentionally strict. In the older public 0.6B run, completions were often much longer than the reference answers, while the packaged dataset has a median reference answer length of about four words. That makes exact match a behavioral and prompting problem, not evidence that the exact-match function is dead. The new run keeps max generation short and logs enough ledger detail to diagnose whether exact-match-sensitive task families are improving.

Training Configuration

The active full run was launched with:

Table with columns: Setting, Value
Setting	Value
model	`Qwen/Qwen3.6-35B-A3B`
batches planned	199
batch size	48
group size	16
max tokens	128
temperature	0.5
learning rate	`4e-5`
LoRA rank	32

The local runtime gate passed before launch with current Tinker, Tinker cookbook, W&B, Gemini, tokenizer, and reward-channel smoke checks.

Intended Use

This adapter is intended for research and tool-building around historical Dakota grammar tasks:

grammar-rule drills
short translation and reverse-translation tasks
morphology and affix experiments
verifier-driven RL experiments for low-resource language work
reproducible study of reward components for historical grammar sources

It is not intended as a standalone Dakota language authority, a substitute for community language expertise, or a production translation system.

Limitations And Ethical Notes

The source material is a historical grammar and dictionary published in 1890. It reflects the terminology, analysis, orthography, and colonial-era framing of its time. Outputs from this model can inherit mistakes, omissions, and outdated descriptions from the source extraction process and from the base model.

Dakota language work should be reviewed with appropriate community and linguistic expertise. This repository should be treated as an experimental technical artifact: useful for transparent research, not authoritative for teaching, cultural interpretation, or official translation.

Usage

The Tinker final sampler checkpoint is available now for direct Tinker sampling:

text
tinker://1f23df9c-5d88-59d9-a7e8-dd4e169ea7d0:train:0/sampler_weights/final

The Hugging Face PEFT adapter for this run lives in this repository:

Use it against the base model Qwen/Qwen3.6-35B-A3B with a standard PEFT loading path:

python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model_name = "Qwen/Qwen3.6-35B-A3B"
adapter_name = "HarleyCooper/Qwen3.6-35B-A3B-Dakota1890-GRPO"

model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
model = PeftModel.from_pretrained(model, adapter_name)

messages = [
    {"role": "system", "content": "Answer Dakota grammar tasks concisely. Return only the answer."},
    {"role": "user", "content": "Translate 'my elder brother' to Dakota. Return only the answer."},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=64, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Citation

Primary source:

Riggs, Stephen Return. 1890. A Dakota-English Dictionary. Contributions to North American Ethnology, Volume VII. Washington: Government Printing Office.

Training and experiment tracking:

Thinking Machines Tinker for the RL training run
W&B for experiment tracking and reward-ledger audit trails
Dakota1890 repository artifacts for extraction, task generation, and verifier code

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider