patrickamadeus

Qwen2.5-VL-3B-Instruct-distill

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Install

bash

pip install -U transformers accelerate qwen-vl-utils

Load

python

from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
model_id = "patrickamadeus/Qwen2.5-VL-3B-Instruct-distill"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)

Text-only Inference

python

messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Explain what this model is useful for in one sentence."},
],
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], padding=True, return_tensors="pt").to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=64)
generated_ids = [
output_ids[len(input_ids):]
for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
]
response = processor.batch_decode(
generated_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False,
)[0]
print(response)

Expected output: a short natural-language answer, for example a one-sentence description of the model's use.

Image + Text Inference

python

from qwen_vl_utils import process_vision_info
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
output_ids[len(input_ids):]
for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
]
response = processor.batch_decode(
generated_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False,
)[0]
print(response)

Expected output: a concise image description, typically mentioning the major objects and scene.

Source

  • Base model: Qwen/Qwen2.5-VL-3B-Instruct
  • Converted checkpoint: patrickamadeus/qwen2_5vl-distill-full-no-bridge-fixed-1000

Model provider

patrickamadeus

Model tree

Base

Qwen/Qwen2.5-VL-3B-Instruct

Fine-tuned

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today