Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Model Performance
Multimodal performance

Pure text performance

Quickstart
Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers.
The code of Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command:
markdown
pip install git+https://github.com/huggingface/transformers# pip install transformers==4.57.0 # currently, V4.57.0 is not released
Using 🤗 Transformers to Chat
Here we show a code snippet to show how to use the chat model with transformers:
python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor# default: Load the model on the available device(s)model = Qwen3VLForConditionalGeneration.from_pretrained("Qwen/Qwen3-VL-8B-Instruct", dtype="auto", device_map="auto")# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.# model = Qwen3VLForConditionalGeneration.from_pretrained(# "Qwen/Qwen3-VL-8B-Instruct",# dtype=torch.bfloat16,# attn_implementation="flash_attention_2",# device_map="auto",# )processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-8B-Instruct")messages = [{"role": "user","content": [{"type": "image","image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",},{"type": "text", "text": "Describe this image."},],}]# Preparation for inferenceinputs = processor.apply_chat_template(messages,tokenize=True,add_generation_prompt=True,return_dict=True,return_tensors="pt")inputs = inputs.to(model.device)# Inference: Generation of the outputgenerated_ids = model.generate(**inputs, max_new_tokens=128)generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)print(output_text)
Generation Hyperparameters
VL
bash
export greedy='false'export top_p=0.8export top_k=20export temperature=0.7export repetition_penalty=1.0export presence_penalty=1.5export out_seq_length=16384
Text
bash
export greedy='false'export top_p=1.0export top_k=40export repetition_penalty=1.0export presence_penalty=2.0export temperature=1.0export out_seq_length=32768
[ more to come ]
Model provider
gdang333
Model tree
Base
coder3101/Qwen3-VL-8B-Instruct-heretic
Fine-tuned
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information