nassimjp

qwen3.5-9b-4bit-pashto-base

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

🦥 Unsloth Patch Details

This model was structured and optimized 2x faster using Unsloth to complete the native tokenization space for the Pashto language.

Developed by: nassimjp
License: apache-2.0
Finetuned from model: Qwen/Qwen3.5-9B
Architecture Base: Qwen3.5 (4-bit quantized)

🎯 Project Overview & Core Motivation

The original Qwen3.5 tokenizer natively supports 42 Pashto alphabet characters as single unique tokens. However, 7 critical Pashto-specific characters were missing from its native vocabulary, forcing the tokenizer to fallback into split, multi-token byte sequences (Byte-level BPE fragmentation).

This precision patch initializes and injects only those 7 missing characters into the vocabulary via the Unsloth wrapper. This approach ensures 100% native Pashto alphabet coverage without disrupting preexisting multi-lingual merge rules or vocabulary weights.

🧩 The Fixed 7 Native Characters:

text
ټ  ډ  ږ  ڼ  څ  ځ  ښ

📊 Pashto Alphabet Coverage Matrix

Table with columns: Tokenizer Profile, Native Pashto Characters Supported, Missing Native Alphabet entries, Evaluation Status
Tokenizer Profile	Native Pashto Characters Supported	Missing Native Alphabet entries	Evaluation Status
Original Qwen3.5	42	7	⚠️ Subword Fragmentation
Qwen3.5-Pashto-Base (Ours)	49	0	✅ 100% Alphabet Integrity

🧪 Scientific Validation & Empirical Tests

Phase 1: Single Token Isolation Assertions

Each of the 7 newly added characters now strictly processes into exactly 1 unified token instead of leaking into multi-byte arrays:

ټ $\to$ 1 token (ID 248077)
ډ $\to$ 1 token (ID 248078)
ږ $\to$ 1 token (ID 248079)
ڼ $\to$ 1 token (ID 248080)
څ 1 token (ID )

Phase 2: Sequence Efficiency Compression Benchmarking

We evaluated token consumption using a standard native Pashto sample phrase:

text
"زه نن ښوونځي ته ځم ځکه چې هلته کتابونه لولم."

Original Qwen Tokenizer Count: 28 tokens (Bloated due to byte fallbacks like Ú and ģ)
Modified Pashto Tokenizer Count: 24 tokens
Net Performance Gain: −4 tokens (~14.3% sequence compression efficiency)

💻 Local Training Implementation

This is a Base Model intended for Continued Pretraining (CPT) or fine-tuning workflows on massive Pashto corpora. To pull this repository and target the new embeddings for down-stream training, initialize the following Unsloth framework:

python
import torch
from unsloth import FastLanguageModel

model_name = "nassimjp/qwen3.5-9b-4bit-pashto-base"
max_seq_length = 4096

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    max_seq_length = max_seq_length,
    dtype = torch.bfloat16, # Optimized for Ampere/Ada Lovelace architectures
    load_in_4bit = True,
)

# Configure LoRA adapters while explicitly saving the new token weights
model = FastLanguageModel.get_peft_model(
    model,
    r = 32,
    lora_alpha = 32,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout = 0,
    bias = "none",
    modules_to_save = ["embed_tokens", "lm_head"], # Crucial for freezing/saving the 7 injected characters
)

⚠️ Intended Usage Notice

This repository hosts a foundational Base Model, not a conversational assistant or instruction-tuned checkpoint. It requires Supervised Fine-Tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF) before handling chat applications.

Model provider

nassimjp

Model tree

Base

Qwen/Qwen3.5-9B

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

🦥 Unsloth Patch Details

This model was structured and optimized 2x faster using Unsloth to complete the native tokenization space for the Pashto language.

Developed by: nassimjp
License: apache-2.0
Finetuned from model: Qwen/Qwen3.5-9B
Architecture Base: Qwen3.5 (4-bit quantized)

🎯 Project Overview & Core Motivation

🧩 The Fixed 7 Native Characters:

text
ټ  ډ  ږ  ڼ  څ  ځ  ښ

📊 Pashto Alphabet Coverage Matrix

Table with columns: Tokenizer Profile, Native Pashto Characters Supported, Missing Native Alphabet entries, Evaluation Status
Tokenizer Profile	Native Pashto Characters Supported	Missing Native Alphabet entries	Evaluation Status
Original Qwen3.5	42	7	⚠️ Subword Fragmentation
Qwen3.5-Pashto-Base (Ours)	49	0	✅ 100% Alphabet Integrity

🧪 Scientific Validation & Empirical Tests

Phase 1: Single Token Isolation Assertions

Each of the 7 newly added characters now strictly processes into exactly 1 unified token instead of leaking into multi-byte arrays:

ټ $\to$ 1 token (ID 248077)
ډ $\to$ 1 token (ID 248078)
ږ $\to$ 1 token (ID 248079)
ڼ $\to$ 1 token (ID 248080)
څ 1 token (ID )

Phase 2: Sequence Efficiency Compression Benchmarking

We evaluated token consumption using a standard native Pashto sample phrase:

text
"زه نن ښوونځي ته ځم ځکه چې هلته کتابونه لولم."

Original Qwen Tokenizer Count: 28 tokens (Bloated due to byte fallbacks like Ú and ģ)
Modified Pashto Tokenizer Count: 24 tokens
Net Performance Gain: −4 tokens (~14.3% sequence compression efficiency)

💻 Local Training Implementation

python
import torch
from unsloth import FastLanguageModel

model_name = "nassimjp/qwen3.5-9b-4bit-pashto-base"
max_seq_length = 4096

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    max_seq_length = max_seq_length,
    dtype = torch.bfloat16, # Optimized for Ampere/Ada Lovelace architectures
    load_in_4bit = True,
)

# Configure LoRA adapters while explicitly saving the new token weights
model = FastLanguageModel.get_peft_model(
    model,
    r = 32,
    lora_alpha = 32,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout = 0,
    bias = "none",
    modules_to_save = ["embed_tokens", "lm_head"], # Crucial for freezing/saving the 7 injected characters
)

qwen3.5-9b-4bit-pashto-base

Get help setting up a custom Dedicated Endpoints.

README

🦥 Unsloth Patch Details

🎯 Project Overview & Core Motivation

🧩 The Fixed 7 Native Characters:

📊 Pashto Alphabet Coverage Matrix

🧪 Scientific Validation & Empirical Tests

Phase 1: Single Token Isolation Assertions

Phase 2: Sequence Efficiency Compression Benchmarking

💻 Local Training Implementation

⚠️ Intended Usage Notice

Explore FriendliAI today

README

🦥 Unsloth Patch Details

🎯 Project Overview & Core Motivation

🧩 The Fixed 7 Native Characters:

📊 Pashto Alphabet Coverage Matrix

🧪 Scientific Validation & Empirical Tests

Phase 1: Single Token Isolation Assertions

Phase 2: Sequence Efficiency Compression Benchmarking

💻 Local Training Implementation

⚠️ Intended Usage Notice