patrickamadeus
Qwen2.5-VL-3B-Instruct-distill
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Install
bash
pip install -U transformers accelerate qwen-vl-utils
Load
python
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGenerationmodel_id = "patrickamadeus/Qwen2.5-VL-3B-Instruct-distill"model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_id,torch_dtype="auto",device_map="auto",)processor = AutoProcessor.from_pretrained(model_id)
Text-only Inference
python
messages = [{"role": "user","content": [{"type": "text", "text": "Explain what this model is useful for in one sentence."},],}]text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)inputs = processor(text=[text], padding=True, return_tensors="pt").to(model.device)generated_ids = model.generate(**inputs, max_new_tokens=64)generated_ids = [output_ids[len(input_ids):]for input_ids, output_ids in zip(inputs.input_ids, generated_ids)]response = processor.batch_decode(generated_ids,skip_special_tokens=True,clean_up_tokenization_spaces=False,)[0]print(response)
Expected output: a short natural-language answer, for example a one-sentence description of the model's use.
Image + Text Inference
python
from qwen_vl_utils import process_vision_infomessages = [{"role": "user","content": [{"type": "image","image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",},{"type": "text", "text": "Describe this image."},],}]text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)image_inputs, video_inputs = process_vision_info(messages)inputs = processor(text=[text],images=image_inputs,videos=video_inputs,padding=True,return_tensors="pt",).to(model.device)generated_ids = model.generate(**inputs, max_new_tokens=128)generated_ids = [output_ids[len(input_ids):]for input_ids, output_ids in zip(inputs.input_ids, generated_ids)]response = processor.batch_decode(generated_ids,skip_special_tokens=True,clean_up_tokenization_spaces=False,)[0]print(response)
Expected output: a concise image description, typically mentioning the major objects and scene.
Source
- Base model:
Qwen/Qwen2.5-VL-3B-Instruct - Converted checkpoint:
patrickamadeus/qwen2_5vl-distill-full-no-bridge-fixed-1000
Model provider
patrickamadeus
Model tree
Base
Qwen/Qwen2.5-VL-3B-Instruct
Fine-tuned
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information