mokau
Zero-To-CAD-Qwen3-VL-2B
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Related Resources
| Resource | Link |
|---|---|
| 📄 Paper | Zero-to-CAD: Agentic Synthesis of Interpretable CAD Programs at Million-Scale Without Real Data |
| 📦 Zero-to-CAD 1M (full dataset) | ADSKAILab/Zero-To-CAD-1m |
| 📦 Zero-to-CAD 100K (curated subset) | ADSKAILab/Zero-To-CAD-100k |
| 🤖 Fine-tuned Model (this model) | You are here |
| 🗂️ Collection | ADSKAILab/Zero-To-CAD |
Model Description
This model is a fully fine-tuned Qwen3-VL-2B-Instruct that takes 8 rendered views of a 3D shape (4 front, 4 rear at 256×256) and generates executable CadQuery Python code that reproduces the geometry.
The model was trained entirely on synthetic data from Zero-to-CAD 1M (979,633 training samples) — no real-world CAD files were used.
Key Results
| Benchmark | Success Rate | Mean IoU | Median IoU | P90 IoU |
|---|---|---|---|---|
| Zero-to-CAD test | 82.1% | 0.747 | 0.847 | 0.999 |
| ABC (out-of-distribution) | 61.0% | 0.377 | 0.303 | 0.854 |
Comparison with Baselines
| Model | Zero-to-CAD Success | Zero-to-CAD Mean IoU | ABC Success | ABC Mean IoU |
|---|---|---|---|---|
| This model | 82.1% | 0.747 | 61.0% | 0.377 |
| GPT-5.2 High | 72.2% | 0.485 | 66.2% | 0.344 |
| GPT-5.2 Medium | 71.1% | 0.495 | 62.6% | 0.346 |
| Qwen3-VL-2B (base) | 6.6% | 0.184 | 5.4% | 0.131 |
Quick Start
Inference
python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessorfrom datasets import load_datasetfrom PIL import Imageimport iomodel_name = "ADSKAILab/Zero-To-CAD-Qwen3-VL-2B"model = Qwen3VLForConditionalGeneration.from_pretrained(model_name, torch_dtype="auto", device_map="auto")processor = AutoProcessor.from_pretrained(model_name)# Load 8 rendered views from the datasetds = load_dataset("ADSKAILab/Zero-To-CAD-1m", split="train", streaming=True)sample = next(iter(ds))views = [Image.open(io.BytesIO(sample[f"image_{i}"])) if isinstance(sample[f"image_{i}"], bytes)else sample[f"image_{i}"]for i in range(8)]# Or load 8 views from local files:# views = [Image.open(f"view_{i}.png") for i in range(8)]messages = [{"role": "system","content": "You are a CAD code assistant. Given multiple rendered views of a 3D shape, generate clean, well-structured CadQuery Python code that accurately reproduces the geometry."},{"role": "user","content": [*[{"type": "image", "image": view} for view in views],{"type": "text", "text": "Generate CadQuery code for this shape."}]}]text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)inputs = processor(text=text, images=views, return_tensors="pt").to(model.device)output_ids = model.generate(**inputs, max_new_tokens=4096)output_text = processor.batch_decode(output_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]print(output_text)
Execute the generated code
python
import cadquery as cqexec(output_text)# `result` contains the reconstructed CadQuery solid# Exportcq.exporters.export(result, "output.step")cq.exporters.export(result, "output.stl")
Training Details
| Hyperparameter | Value |
|---|---|
| Base model | Qwen3-VL-2B-Instruct |
| Training mode | Full fine-tuning |
| Max sequence length | 4,096 tokens |
| Optimizer | AdamW |
| Learning rate | 1 × 10⁻⁴ |
| Weight decay | 0.0 |
| LR scheduler | Cosine |
| Warmup ratio | 0.03 |
| Attention dropout | 0.1 |
| GPUs | 16 × NVIDIA H100 80GB |
| Per-GPU batch size | 1 |
| Effective batch size | 16 |
| Epochs | 3 |
| Precision | bfloat16 |
| Distributed strategy | DDP |
Evaluation Protocol
- Metric: Voxelized IoU at 64³ resolution between generated and ground-truth solids
- Rotational alignment: Maximum IoU over 45° rotation increments
- Success rate: Percentage of generations producing valid, executable CadQuery code
Intended Uses
- Image-to-CAD reconstruction — reconstruct editable parametric CAD from rendered views
- Research baseline — starting point for Image-to-Sequence CAD generation research
- Integration — combine with rendering pipelines for end-to-end 3D reconstruction
Limitations
- Trained on synthetic data only; may struggle with photorealistic or noisy inputs
- Expects 8 clean rendered views at 256×256 — other configurations are untested
- Outputs CadQuery code only; other CAD formats require post-processing
- Complex multi-part assemblies may exceed the 4,096 token context window
Citation
If you use this model, please cite:
bibtex
@misc{ataei2026zerotocadagenticsynthesisinterpretable,title={Zero-to-CAD: Agentic Synthesis of Interpretable CAD Programs at Million-Scale Without Real Data},author={Mohammadmehdi Ataei and Farzaneh Askari and Kamal Rahimi Malekshan and Pradeep Kumar Jayaraman},year={2026},eprint={2604.24479},archivePrefix={arXiv},primaryClass={cs.CV},url={https://arxiv.org/abs/2604.24479}}
License
This model is released under the Apache License 2.0.
Model provider
mokau
Model tree
Base
Qwen/Qwen3-VL-2B-Instruct
Fine-tuned
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information