azherali

Aqal-1.0-8B-Instruct

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Quick start

python
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048  # Choose any! We auto support RoPE Scaling internally!
dtype = (
    None  # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
)
load_in_4bit = False  # Use 4bit quantization to reduce memory usage. Can be False.
load_in_8bit = False  # Use 8bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="azherali/Aqal-1.0-8B-Instruct",  # Choose ANY
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
    load_in_8bit=load_in_8bit,
    # token = "YOUR_HF_TOKEN", # HF Token for gated models
)
FastLanguageModel.for_inference(model)  # Enable native 2x faster inference

messages = [
    {
        "role": "user",
        "content": "پانچ بچوں نے 20 چاکلیٹس برابر بانٹیں۔ ہر بچے کو کتنی چاکلیٹس ملیں گی؟",
    }
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,  # Must add for generation
)

from transformers import TextStreamer

_ = model.generate(
    **tokenizer(text, return_tensors="pt").to("cuda"),
    temperature=0.6,
    top_p=0.95,
    top_k=20,  # For non thinking
    streamer=TextStreamer(tokenizer, skip_prompt=True),
)