🦥 Unsloth Patch Details
This model was structured and optimized 2x faster using Unsloth to complete the native tokenization space for the Pashto language.
- Developed by: nassimjp
- License: apache-2.0
- Finetuned from model: Qwen/Qwen3.5-9B
- Architecture Base: Qwen3.5 (4-bit quantized)
🎯 Project Overview & Core Motivation
The original Qwen3.5 tokenizer natively supports 42 Pashto alphabet characters as single unique tokens. However, 7 critical Pashto-specific characters were missing from its native vocabulary, forcing the tokenizer to fallback into split, multi-token byte sequences (Byte-level BPE fragmentation).
This precision patch initializes and injects only those 7 missing characters into the vocabulary via the Unsloth wrapper. This approach ensures 100% native Pashto alphabet coverage without disrupting preexisting multi-lingual merge rules or vocabulary weights.
🧩 The Fixed 7 Native Characters:
📊 Pashto Alphabet Coverage Matrix
Table with columns: Tokenizer Profile, Native Pashto Characters Supported, Missing Native Alphabet entries, Evaluation Status| Tokenizer Profile | Native Pashto Characters Supported | Missing Native Alphabet entries | Evaluation Status |
|---|
| Original Qwen3.5 | 42 | 7 | ⚠️ Subword Fragmentation |
| Qwen3.5-Pashto-Base (Ours) | 49 | 0 | ✅ 100% Alphabet Integrity |
🧪 Scientific Validation & Empirical Tests
Phase 1: Single Token Isolation Assertions
Each of the 7 newly added characters now strictly processes into exactly 1 unified token instead of leaking into multi-byte arrays:
ټ → 1 token (ID 248077)
ډ → 1 token (ID 248078)
ږ → 1 token (ID 248079)
ڼ → 1 token (ID 248080)
څ 1 token (ID )
Phase 2: Sequence Efficiency Compression Benchmarking
We evaluated token consumption using a standard native Pashto sample phrase:
"زه نن ښوونځي ته ځم ځکه چې هلته کتابونه لولم."
- Original Qwen Tokenizer Count:
28 tokens (Bloated due to byte fallbacks like Ú and ģ)
- Modified Pashto Tokenizer Count:
24 tokens
- Net Performance Gain:
−4 tokens (~14.3% sequence compression efficiency)
💻 Local Training Implementation
This is a Base Model intended for Continued Pretraining (CPT) or fine-tuning workflows on massive Pashto corpora. To pull this repository and target the new embeddings for down-stream training, initialize the following Unsloth framework:
import torch
from unsloth import FastLanguageModel
model_name = "nassimjp/qwen3.5-9b-4bit-pashto-base"
max_seq_length = 4096
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = model_name,
max_seq_length = max_seq_length,
dtype = torch.bfloat16,
load_in_4bit = True,
)
model = FastLanguageModel.get_peft_model(
model,
r = 32,
lora_alpha = 32,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
lora_dropout = 0,
bias = "none",
modules_to_save = ["embed_tokens", "lm_head"],
)
⚠️ Intended Usage Notice
This repository hosts a foundational Base Model, not a conversational assistant or instruction-tuned checkpoint. It requires Supervised Fine-Tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF) before handling chat applications.