Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Overview
This checkpoint was quantized using BitsAndBytes and evaluated with standard text similarity metrics.
Model Architecture
| Attribute | Value |
|---|---|
| Model class | Qwen2ForCausalLM |
| Number of parameters | 17,161,065,472 |
| Hidden size | 5120 |
| Number of layers | 64 |
| Attention heads | 40 |
| Vocabulary size | 152064 |
| Compute dtype | bfloat16 |
Quantization Configuration
json
{"quant_method": "bitsandbytes","_load_in_8bit": false,"_load_in_4bit": true,"llm_int8_threshold": 6.0,"llm_int8_skip_modules": null,"llm_int8_enable_fp32_cpu_offload": false,"llm_int8_has_fp16_weight": false,"bnb_4bit_quant_type": "nf4","bnb_4bit_use_double_quant": true,"bnb_4bit_compute_dtype": "bfloat16","bnb_4bit_quant_storage": "uint8","load_in_4bit": true,"load_in_8bit": false}
Intended Use
- Research and experimentation.
- Instruction-following tasks in resource-constrained environments.
- Demonstrations of quantized model capabilities.
Limitations
- May reproduce biases from the original model.
- Quantization may reduce generation diversity and factual accuracy.
- Not intended for production without additional evaluation.
Usage
python
from transformers import AutoTokenizer, AutoModelForCausalLMtokenizer = AutoTokenizer.from_pretrained("pbhappliedsystems/Qwen2.5-32B-Instruct-4bit-20260527_122210")model = AutoModelForCausalLM.from_pretrained("pbhappliedsystems/Qwen2.5-32B-Instruct-4bit-20260527_122210", device_map="auto")prompt = "Explain the concept of reinforcement learning."inputs = tokenizer(prompt, return_tensors="pt").to("cuda")outputs = model.generate(**inputs, max_new_tokens=256)print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Generation Settings
This model produces best results when generated with:
- temperature: 0.3
- top_p: 0.9
Model Files Metadata
| Filename | Size (bytes) | SHA-256 |
|---|---|---|
model-00001-of-00004.safetensors | 4,933,190,348 | b2a0e8a735e99b3a59bb3139541c444808aff3793a28c314c0f02bf17a00b5f7 |
model-00002-of-00004.safetensors | 4,958,587,236 | fd4b028d13261c8da0e29ed57b95189d666f62f3e8d4ab232c17c4e4e131543a |
model-00003-of-00004.safetensors | 4,999,136,184 | 0446d1c6da46a5daea91bed161fd62f2f48a658d879f58a14b7ab5528eb66935 |
model-00004-of-00004.safetensors | 4,324,534,021 | 39002c4ed64520809793fb2b2023caf9bdbf0914feb4786d553c418139457018 |
quant_config.json | 426 | 1bd2332861a3d1a8f387a9d04a1432b5bb57dec1a112ab6cfe594f67c5e66823 |
Notes
- Produced on 2026-05-27T12:33:55.921152.
- Quantized automatically using BitsAndBytes.
Intended primarily for research and experimentation.
Citation
License
This model is distributed under the apache-2.0 license, consistent with the original /mnt/d/Development/Libraries/Qwen2.5-32B-Instruct.
Model Card Authors
This quantized model was prepared by PBH Applied Systems.
Model provider
pbhappliedsystems
Model tree
Base
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information