Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: mitModel description
The model translates a 256×256 semiconductor SEM image into a NetDSL-L2 program — a Domain-Specific Language describing Manhattan-routed circuit layouts as a sequence of CANVAS, WIRE, and VIA commands. Rendering the generated DSL reproduces the input geometry as a binary mask, enabling controlled augmentation, parameter editing, and downstream metrology.
- Base model:
Qwen/Qwen3-VL-8B-Instruct - Fine-tuning: full SFT (vision encoder, multimodal projector, and language model are all trainable)
- Training data: 18,900 synthetic (image, DSL) pairs generated by the DSL renderer in
src/dsl2_dataset_v3.py; topology mixture of Vertical Stripes, Horizontal Lines, and Manhattan layouts - Optimization: 3 epochs, batch size 8 × grad-accum 12 (effective 96), LR 2.0e-5 cosine + 10% warmup, weight decay 0.01, pure BF16 + gradient checkpointing
- Hardware: single NVIDIA H200 (141 GB)
- Final training loss: 0.2832
- Chat template:
qwen3_vl_nothink(no reasoning tokens emitted)
Intended use
Input a binarized (global threshold 100) SEM image of a circuit pattern; the model emits NetDSL-L2 code. Render the code with src/pattern_dsl.py from the companion repo to obtain a reconstructed binary mask.
Evaluation results (MIIC, 1034 test images)
Mean ± std over executable outputs. Binary input (proposed) ↔ Raw input (baseline):
| Metric | Raw | Binary (ours) |
|---|---|---|
| IoU | 0.2865 ± 0.0802 | 0.3619 ± 0.0882 |
| Dice coefficient | 0.4393 ± 0.0980 | 0.5256 ± 0.0912 |
| BF1 @ 2 px | 0.4054 ± 0.1106 | 0.4412 ± 0.1098 |
| SkF1 @ 1 px | 0.1276 ± 0.0960 | 0.1746 ± 0.1145 |
| ASSD | 4.7768 ± 3.4596 | 4.1327 ± 1.2757 |
Executable rate: 957/1034 (binary) and 1019/1034 (raw).
Usage
python
from transformers import AutoProcessor, AutoModelForImageTextToTextfrom PIL import Imageimport torchREPO = "utsubo12/qwen3-vl-8b-netdsl-l2"model = AutoModelForImageTextToText.from_pretrained(REPO, torch_dtype=torch.bfloat16, device_map="cuda")processor = AutoProcessor.from_pretrained(REPO)# Real SEM image, globally thresholded at 100 (binary preprocessing).image = Image.open("test_normal_00169_binary.png").convert("RGB")messages = [{"role": "user","content": [{"type": "image", "image": image},{"type": "text","text": "Reconstruct this circuit pattern in NetDSL-L2."},],}]inputs = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt",tokenize=True, return_dict=True,).to(model.device)out = model.generate(**inputs, max_new_tokens=1024, do_sample=False)dsl_code = processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0]print(dsl_code)
Render the predicted DSL with the helpers in the companion repo:
python
from src.pattern_dsl import parse_and_rendermask = parse_and_render(dsl_code, canvas=(256, 256))
NetDSL-L2 example
markdown
CANVAS 256 256WIRE 0 12 8 0 0 0 0 H 256WIRE 12 0 10 0 0 30 14 V 60VIA 35 40 6
WIRE(x0, y0, w_base, l_s, w_s, l_e, w_e, segments) describes a wire with optional dogbone end-caps; segments is a chain of H <length> / V <length> relative moves. See §III-A of the paper.
Limitations
- Trained only on synthetic Manhattan-style layouts; non-Manhattan or analog layouts are out of distribution.
- At inference time, real SEM images must be binarized (global threshold ≈ 100) to obtain the reported numbers; raw grayscale input significantly degrades quality.
- The model produces NetDSL-L2 strings up to ~2048 tokens. Very dense layouts may be truncated.
- Reconstruction quality decreases as pattern complexity (compressed DSL length) grows; see Fig. 4 of the paper.
Citation
bibtex
@inproceedings{ohtsubo2026bridging,author = {Ohtsubo, Yusuke and Dohi, Kota and Yawata, Koichiro andTakeshita, Koki and Sasaki, Tatsuya},title = {Bridging the Sim-to-Real Gap in Semiconductor Visual ProgramSynthesis via Input Binarization},booktitle = {Proceedings of the 34th European Signal Processing Conference (EUSIPCO)},year = {2026},publisher = {EURASIP},note = {Accepted; final citation/DOI to be updated upon publication}}
License
MIT for both the code and these weights.
The base model Qwen3-VL-8B-Instruct is subject to its own license; please review.
Contact
Yusuke Ohtsubo — yusuke.ohtsubo.nb@hitachi.com
Model provider
utsubo12
Model tree
Base
Qwen/Qwen3-VL-8B-Instruct
Fine-tuned
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information