Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

✨ Key Features

  • Truly bilingual (Persian + English) – Understands and generates both languages fluently, with natural code‑switching (e.g., “بگو hello world به انگلیسی”).
  • Ultra‑lightweight – Only 1.9B parameters, ~1.6 GB (FP16) / ~0.9 GB (INT8). Runs on 4 GB RAM devices.
  • Offline & private – No internet connection or API key needed after download.
  • Fast inference – 40–60 tok/s on T4 GPU, 8–12 tok/s on Intel i7 CPU, 2–3 tok/s on Raspberry Pi 4.
  • Long context – 32,768 tokens (≈24,000 Persian words), enough for long conversations or short stories.
  • Iranian‑made, Apache 2.0 – Free for commercial and personal use, with full transparency.
  • Research‑friendly – Released as a research model to help the Persian AI community fine‑tune, quantise, or build upon it.

🎯 Suitable Use Cases

  • Daily chit‑chat – Casual conversation, small talk, jokes, and friendly assistant tasks.
  • Simple Q&A – Answering general knowledge questions (e.g., “پایتخت فرانسه کجاست?” / “What is the capital of France?”).
  • Informal translation – Translating short sentences or phrases between Persian and English (not professional/legal grade).
  • Light summarisation – Summarising a paragraph or a short article in Persian or English.
  • Brainstorming & writing help – Generating ideas, rewriting a sentence, fixing simple grammar.
  • Educational tool for language learning – Practicing Persian or English conversations (basic to intermediate level).
  • Offline assistant for edge devices – Embedded in chatbots, local web UIs, or Telegram bots (simple integration).

❌ Not suitable for:

  • Code generation, debugging, or programming assistance.
  • Complex mathematical reasoning or multi‑step logic.
  • Professional translation (e.g., legal, medical).
  • Long document processing (>32k tokens).
  • Any task requiring up‑to‑date information after mid‑2024.

📊 Evaluation & Performance Metrics

We evaluated Neura‑FA‑EN‑1.9B on standard Persian and English benchmarks for conversational models.

DatasetMetricScoreNote
ParsiMMLU (5‑shot)Accuracy48.7%General knowledge in Persian
PersianQAExact Match56.2%Reading comprehension (questions in Persian)
MMLU (English, 5‑shot)Accuracy51.3%General knowledge in English
XNLI (fa)Accuracy62.1%Natural language inference (Persian)
XNLI (en)Accuracy68.5%Natural language inference (English)
Perplexity (fa‑wikitext)PPL18.3Fluency on Persian texts

Interpretation: The model performs on par with much larger multilingual models (e.g., XLM‑R 3B) on Persian tasks while being 40% smaller. For English, it stays competitive with dedicated 1.5B models.


📈 Comparison with Similar‑Sized Models

ModelParamsPersian MMLUEnglish MMLUVRAM (FP16)Speed (tok/s, T4)License
Neura-FA-EN-1.9B1.9B48.7%51.3%~3.8 GB48Apache 2.0
Arian‑2B (Persian)2.0B44.2%28.7%~4.0 GB45Apache 2.0
Phi‑2 (2.7B, English‑only)2.7BN/A57.8%~5.4 GB40MIT
Gemma‑2B (English‑only)2.0BN/A52.6%~4.0 GB52Gemma

Key points: Neura‑FA‑EN is the only 1.9B model that provides strong performance on both Persian and English.


🧪 Technical Details & Training Process

Built on the Qwen2 architecture (only the architecture, not derived from any existing model) and trained from scratch by Neuracoder.

Architecture

  • Layers: 28 decoder‑only layers.
  • Attention: Grouped Query Attention (GQA) – 12 query heads, 2 key/value heads.
  • Activation: SwiGLU.
  • Context length: 32,768 tokens.
  • Embedding size: 2048.
  • Intermediate size: 5632.

Pre‑training

  • Data: 350 billion tokens – 60% Persian (web texts, books, news, forums), 35% English (common crawl, books, Wikipedia), 5% code (to preserve basic formatting).
  • Duration: 18 days on 8× NVIDIA A100 (80GB) using DeepSpeed ZeRO‑3.
  • Hyperparameters: AdamW (lr=3e-4), cosine decay, warmup 2000 steps, batch size 512, seq len 2048 (later extended to 8192 with RoPE scaling).

Supervised Fine‑Tuning (SFT)

  • Data: 150,000 conversation pairs in Persian and English:
    • 80,000 from public Persian chat datasets (ParsiNLU, FaChat).
    • 50,000 from translated and cleaned ShareGPT data.
    • 20,000 hand‑written by Neuracoder team for natural code‑switching and cultural relevance.
  • Format: {"system": "You are a helpful assistant.", "user": "...", "assistant": "..."}
  • Hyperparameters: 3 epochs, lr=1e-5, batch size 128, LoRA (rank=32) then full fine‑tune last 6 layers.

Validation

  • Every 500 steps evaluated on held‑out Persian and English test sets.
  • Final checkpoint chosen by lowest perplexity on Persian validation and highest MMLU score.

⚡ Inference Speed & Hardware Requirements

HardwareWeight formatAvg tokens/sec (gen 256 tokens)Memory usage
NVIDIA A100 (40GB)FP1678 tok/s4.1 GB
NVIDIA T4 (16GB)FP1648 tok/s3.9 GB
NVIDIA T4 (16GB)INT855 tok/s2.3 GB
NVIDIA GTX 1060 (6GB)FP1628 tok/s3.9 GB
CPU (Intel i7-12700K)INT89 tok/s2.1 GB
Raspberry Pi 4 (4GB)INT8 (ONNX)2–3 tok/s1.6 GB

Recommendation: Use FP16 on any GPU with 6+ GB VRAM. For CPU or low‑memory devices, use INT8 quantised version (available separately).


🚀 Usage Guide

Installation

markdown

pip install transformers torch accelerate sentencepiece

Example 1: Basic Persian conversation

markdown

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "neuracoder/neura-fa-en-1.9b"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
trust_remote_code=True,
torch_dtype=torch.float16,
device_map="auto"
)
prompt = "به نظرت بهترین راه برای یادگیری زبان انگلیسی چیه؟"
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7, do_sample=True)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Example 2: Mixed Persian‑English query

markdown

prompt = "یه جمله انگلیسی بنویس که معنی 'خورشید می‌تابد' رو برسونه"
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.6)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Example 3: Simple summarisation (English)

markdown

article = """The Persian cat is a long-haired breed characterized by its round face and short muzzle.
It is one of the oldest cat breeds, originating from Persia (modern-day Iran)."""
prompt = f"Summarise the following text in one sentence:\n\n{article}"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=80, temperature=0.3)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Download the model directly

markdown

git lfs install
git clone https://huggingface.co/neuracoder/neura-fa-en-1.9b

Or via Python:

markdown

from huggingface_hub import snapshot_download
snapshot_download(repo_id="neuracoder/neura-fa-en-1.9b", local_dir="./neura-fa-en-1.9b")

⚠️ Limitations

  • Not a code model – Cannot write or debug programs reliably.
  • Not a mathematical engine – Struggles with multi‑step arithmetic or symbolic reasoning.
  • Knowledge cutoff – Mid‑2024. Unaware of very recent events or new APIs.
  • Persian dialect – Trained on standard Persian (Farsi); may not understand Dari or Tajik well.
  • Formal translation – Not suitable for legal, medical, or highly technical documents.
  • Hallucinations – Like all LLMs, may produce plausible but incorrect facts.
  • Context length – While 32k is generous, very long documents may degrade attention quality.

🗺️ Roadmap

  • Q1 2026: Release of quantised versions (INT4, INT8, GGUF) for even lighter deployment.
  • Q2 2026: Neura‑FA‑EN‑3B – 3.5B parameters, expanded Persian vocabulary, improved reasoning.
  • Q3 2026: Fine‑tuned variant for formal translation (Persian ↔ English).
  • Ongoing: Open‑source training datasets (Persian conversational data) and evaluation benchmarks.

🤝 Contribute

This model is free and open‑source. You can help by:

  • Reporting bugs or suggesting improvements in the Discussions tab.
  • Providing high‑quality Persian conversational data (anonymised) to improve future versions.
  • Building tools (Gradio UI, Ollama modelfile, Telegram bot) using this model.
  • Financial sponsorship – Contact the Neuracoder team.
  • Spreading the word – Every user helps the Persian AI community grow.

📜 License

Apache License 2.0 – You may freely use, modify, distribute, and even sell this model as part of your product, provided you include the original license and copyright notice. No other restrictions.


📞 Contact


ساخته شده با ❤️ در ایران – تیم neuracoder
دموکراتیزه کردن هوش مصنوعی مکالمه‌ای برای فارسی‌زبانان، سریع، محلی و رایگان برای همه.

Model provider

neuracoder

Model tree

Base

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today