Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Overview

This checkpoint was quantized using BitsAndBytes and evaluated with standard text similarity metrics.


Model Architecture

AttributeValue
Model classQwen2ForCausalLM
Number of parameters17,161,065,472
Hidden size5120
Number of layers64
Attention heads40
Vocabulary size152064
Compute dtypebfloat16

Quantization Configuration

json

{
"quant_method": "bitsandbytes",
"_load_in_8bit": false,
"_load_in_4bit": true,
"llm_int8_threshold": 6.0,
"llm_int8_skip_modules": null,
"llm_int8_enable_fp32_cpu_offload": false,
"llm_int8_has_fp16_weight": false,
"bnb_4bit_quant_type": "nf4",
"bnb_4bit_use_double_quant": true,
"bnb_4bit_compute_dtype": "bfloat16",
"bnb_4bit_quant_storage": "uint8",
"load_in_4bit": true,
"load_in_8bit": false
}

Intended Use

  • Research and experimentation.
  • Instruction-following tasks in resource-constrained environments.
  • Demonstrations of quantized model capabilities.

Limitations

  • May reproduce biases from the original model.
  • Quantization may reduce generation diversity and factual accuracy.
  • Not intended for production without additional evaluation.

Usage

python

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("pbhappliedsystems/Qwen2.5-32B-Instruct-4bit-20260527_122210")
model = AutoModelForCausalLM.from_pretrained("pbhappliedsystems/Qwen2.5-32B-Instruct-4bit-20260527_122210", device_map="auto")
prompt = "Explain the concept of reinforcement learning."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Generation Settings

This model produces best results when generated with:

  • temperature: 0.3
  • top_p: 0.9

Model Files Metadata

FilenameSize (bytes)SHA-256
model-00001-of-00004.safetensors4,933,190,348b2a0e8a735e99b3a59bb3139541c444808aff3793a28c314c0f02bf17a00b5f7
model-00002-of-00004.safetensors4,958,587,236fd4b028d13261c8da0e29ed57b95189d666f62f3e8d4ab232c17c4e4e131543a
model-00003-of-00004.safetensors4,999,136,1840446d1c6da46a5daea91bed161fd62f2f48a658d879f58a14b7ab5528eb66935
model-00004-of-00004.safetensors4,324,534,02139002c4ed64520809793fb2b2023caf9bdbf0914feb4786d553c418139457018
quant_config.json4261bd2332861a3d1a8f387a9d04a1432b5bb57dec1a112ab6cfe594f67c5e66823

Notes

  • Produced on 2026-05-27T12:33:55.921152.
  • Quantized automatically using BitsAndBytes.

Intended primarily for research and experimentation.

Citation

Qwen2.5-32B-Instruct

Qwen2.5 Technical Report

License

This model is distributed under the apache-2.0 license, consistent with the original /mnt/d/Development/Libraries/Qwen2.5-32B-Instruct.

Model Card Authors

This quantized model was prepared by PBH Applied Systems.

Model provider

pbhappliedsystems

Model tree

Base

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today