Current Status
Training completed on May 27, 2026. This card now includes the audited final run findings and W&B-style result charts.
- HF repo:
HarleyCooper/Qwen3.6-35B-A3B-Dakota1890-GRPO
- Base model:
Qwen/Qwen3.6-35B-A3B
- Training platform: Thinking Machines Tinker
- Method: GRPO-style RL with a custom Dakota grammar verifier
- Adapter type: LoRA, rank 32
- W&B project:
christian-cooper-us/dakota-rl-grammar
- Completed full run:
owf98569
- Reward-channel pilot:
d44bra91
- Thinking Machines cost: $68.75
- Tokens processed: 82.05 million
- Final Tinker sampler path:
tinker://1f23df9c-5d88-59d9-a7e8-dd4e169ea7d0:train:0/sampler_weights/final
- Final Tinker state path:
tinker://1f23df9c-5d88-59d9-a7e8-dd4e169ea7d0:train:0/weights/final
- Inference adapter weights:
adapter_model.safetensors
- Adapter config:
adapter_config.json
The reward-channel pilot completed before the full run and cost about $0.26. It verified that the repaired environment can emit nonzero pattern_raw and exact_match_raw channels locally and in W&B before scaling up.
Final Run Findings

The full run completed 199 metric rows, ending at training step 198. It cost $68.75 in Thinking Machines credits and processed 82.05 million tokens. The final audit found:
- composite reward improved from
0.1664 to 0.2297;
- character-overlap reward improved from
0.1424 to 0.4027;
- affix reward stayed high and ended at
1.0000;
- all-task
pattern_raw was nonzero in 186 of 199 logged training rows;
identify_pattern pattern reward reached 0.90625 and was nonzero in 179 of 199 rows;
- eval
pattern_raw remained nonzero, ending at 0.0586;
- exact-match reward stayed at throughout the mixed-task run;
The key result is that the repaired pattern channel is live in a full paid Tinker run. Exact match remains a task-design and prompting problem for short answer-only completions, not a reward-plumbing failure.
The machine-readable summary and markdown findings are included in analysis/final_run_summary.json and analysis/FINAL_RUN_FINDINGS.md.



Source Lineage
Dakota1890 is built around historical Dakota language source material rather than generic web text. The local project contains the primary Riggs scan, 440 scanned page images, public visual artifacts used for documentation, extracted grammar rules, extracted dictionary vocabulary, generated RL tasks, and W&B/Tinker run logs.
Source and extraction inventory:
- primary source PDF:
grammardictionar00riggrich.pdf
- JP2 source scans: 440 local page images under
Dictionary/grammardictionar00riggrich_jp2
- processed page images: 440 JPG conversions under
data/processed_images
- page-layout manifest: 440 rows, including 345 two-column pages and 84 single-column pages
- grammar extraction: pages 1-92, with no missing pages in that intended range
- dictionary extraction: pages 95-430, with no missing pages in that intended range
- verified dictionary entries: 24,224
- median dictionary entries per extracted page: 63
- average extraction confidence: about 0.9245
- first and last verified headwords:
a through zig'zag
The current RL grammar dataset contains:
- 10,576 total packaged RL tasks
- 1,497 extracted grammar-rule records
- 684 grammar rules in the grammar extraction pass
- 350 interlinear texts in the grammar extraction pass
- 396 linguistic terms in the grammar extraction pass
- 33 special-character forms observed in the grammar extraction pass
- 1,497 rows with pattern-bearing verification metadata
- 514 rows with affix metadata
- median reference answer length of about 4 words
Task family counts in the packaged dataset:
Table with columns: Task family, Count| Task family | Count |
|---|
word_translation | 2,879 |
reverse_translation | 2,137 |
morphology | 1,934 |
identify_pattern | 1,497 |
positive_negative_evidence | 584 |
exception_trigger |
The dictionary extraction and synthetic Q&A generation work are adjacent dataset-building tracks. This Tinker RL run does not require OpenAI SFT data or a synthetic dictionary Q&A dump to start; it trains against the existing grammar-task environment and its reward function. The dictionary artifacts remain important for future evaluation, documentation, and broader Dakota lexical coverage.
Visual Project Artifacts
These local project images are included to preserve the visual record of the build alongside the model card:
Table with columns: Artifact, Preview| Artifact | Preview |
|---|
| Grammar source page |  |
| Dictionary source page |  |
| Training screenshot |  |
What Changed For This Run
Earlier public runs showed metrics/pattern_reward = 0.0 and metrics/exact_match_reward = 0.0 throughout the public W&B surface. The pattern channel was a real plumbing bug: the packaged dataset stores metadata under entry["info"], while the environment had been reading top-level fields such as entry["verification_pattern"], entry["task_type"], and entry["difficulty"].
That has been fixed. The environment now preserves:
- task type
- difficulty
- rule id
- hints
- verification pattern, including
info.pattern
- special-character and affix metadata
The full run now logs the composite reward ledger under namespaces such as env/all/ledger/*, env/task/identify_pattern/ledger/*, and difficulty-specific ledgers. Early full-run metrics already showed nonzero pattern reward on identify_pattern tasks with composite_diff = 0.0, meaning the logged scalar reward matches the reconstructed component ledger.
Reward Function
The Dakota grammar verifier uses a composite reward:
Table with columns: Component, Weight, Purpose| Component | Weight | Purpose |
|---|
| exact match | 40% | Rewards short answers that exactly match rigid references |
| character overlap | 20% | Rewards lexical and orthographic overlap with the reference |
| pattern match | 15% | Rewards structural matches for rule-bearing tasks |
| affix accuracy | 10% | Rewards required Dakota affix behavior |
| length control | 15% | Penalizes verbose completions when the gold answer is short |
Difficulty multipliers are applied after the component sum. The run logs raw component values, normalized values, weights, weighted contributions, reconstructed composites, final reward scalar, and composite_diff for auditability.
Exact match is intentionally strict. In the older public 0.6B run, completions were often much longer than the reference answers, while the packaged dataset has a median reference answer length of about four words. That makes exact match a behavioral and prompting problem, not evidence that the exact-match function is dead. The new run keeps max generation short and logs enough ledger detail to diagnose whether exact-match-sensitive task families are improving.
Training Configuration
The active full run was launched with:
Table with columns: Setting, Value| Setting | Value |
|---|
| model | Qwen/Qwen3.6-35B-A3B |
| batches planned | 199 |
| batch size | 48 |
| group size | 16 |
| max tokens | 128 |
| temperature | 0.5 |
| learning rate | 4e-5 |
| LoRA rank | 32 |
The local runtime gate passed before launch with current Tinker, Tinker cookbook, W&B, Gemini, tokenizer, and reward-channel smoke checks.
Intended Use
This adapter is intended for research and tool-building around historical Dakota grammar tasks:
- grammar-rule drills
- short translation and reverse-translation tasks
- morphology and affix experiments
- verifier-driven RL experiments for low-resource language work
- reproducible study of reward components for historical grammar sources
It is not intended as a standalone Dakota language authority, a substitute for community language expertise, or a production translation system.
Limitations And Ethical Notes
The source material is a historical grammar and dictionary published in 1890. It reflects the terminology, analysis, orthography, and colonial-era framing of its time. Outputs from this model can inherit mistakes, omissions, and outdated descriptions from the source extraction process and from the base model.
Dakota language work should be reviewed with appropriate community and linguistic expertise. This repository should be treated as an experimental technical artifact: useful for transparent research, not authoritative for teaching, cultural interpretation, or official translation.
Usage
The Tinker final sampler checkpoint is available now for direct Tinker sampling:
tinker://1f23df9c-5d88-59d9-a7e8-dd4e169ea7d0:train:0/sampler_weights/final
The Hugging Face PEFT adapter for this run lives in this repository:
Use it against the base model Qwen/Qwen3.6-35B-A3B with a standard PEFT loading path:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_model_name = "Qwen/Qwen3.6-35B-A3B"
adapter_name = "HarleyCooper/Qwen3.6-35B-A3B-Dakota1890-GRPO"
model = AutoModelForCausalLM.from_pretrained(
base_model_name,
device_map="auto",
torch_dtype="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
model = PeftModel.from_pretrained(model, adapter_name)
messages = [
{"role": "system", "content": "Answer Dakota grammar tasks concisely. Return only the answer."},
{"role": "user", "content": "Translate 'my elder brother' to Dakota. Return only the answer."},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=64, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Citation
Primary source:
Riggs, Stephen Return. 1890. A Dakota-English Dictionary. Contributions to North American Ethnology, Volume VII. Washington: Government Printing Office.
Training and experiment tracking:
- Thinking Machines Tinker for the RL training run
- W&B for experiment tracking and reward-ledger audit trails
- Dakota1890 repository artifacts for extraction, task generation, and verifier code