utsubo12

qwen3-vl-8b-netdsl-l2

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Model description

The model translates a 256×256 semiconductor SEM image into a NetDSL-L2 program — a Domain-Specific Language describing Manhattan-routed circuit layouts as a sequence of CANVAS, WIRE, and VIA commands. Rendering the generated DSL reproduces the input geometry as a binary mask, enabling controlled augmentation, parameter editing, and downstream metrology.

Base model: Qwen/Qwen3-VL-8B-Instruct
Fine-tuning: full SFT (vision encoder, multimodal projector, and language model are all trainable)
Training data: 18,900 synthetic (image, DSL) pairs generated by the DSL renderer in src/dsl2_dataset_v3.py; topology mixture of Vertical Stripes, Horizontal Lines, and Manhattan layouts
Optimization: 3 epochs, batch size 8 × grad-accum 12 (effective 96), LR 2.0e-5 cosine + 10% warmup, weight decay 0.01, pure BF16 + gradient checkpointing
Hardware: single NVIDIA H200 (141 GB)
Final training loss: 0.2832
Chat template: qwen3_vl_nothink (no reasoning tokens emitted)

Intended use

Input a binarized (global threshold 100) SEM image of a circuit pattern; the model emits NetDSL-L2 code. Render the code with src/pattern_dsl.py from the companion repo to obtain a reconstructed binary mask.

Evaluation results (MIIC, 1034 test images)

Mean ± std over executable outputs. Binary input (proposed) ↔ Raw input (baseline):

Table with columns: Metric, Raw, Binary (ours)
Metric	Raw	Binary (ours)
IoU	0.2865 ± 0.0802	0.3619 ± 0.0882
Dice coefficient	0.4393 ± 0.0980	0.5256 ± 0.0912
BF1 @ 2 px	0.4054 ± 0.1106	0.4412 ± 0.1098
SkF1 @ 1 px	0.1276 ± 0.0960	0.1746 ± 0.1145
ASSD	4.7768 ± 3.4596	4.1327 ± 1.2757

Executable rate: 957/1034 (binary) and 1019/1034 (raw).

Usage

python
from transformers import AutoProcessor, AutoModelForImageTextToText
from PIL import Image
import torch

REPO = "utsubo12/qwen3-vl-8b-netdsl-l2"

model = AutoModelForImageTextToText.from_pretrained(
    REPO, torch_dtype=torch.bfloat16, device_map="cuda"
)
processor = AutoProcessor.from_pretrained(REPO)

# Real SEM image, globally thresholded at 100 (binary preprocessing).
image = Image.open("test_normal_00169_binary.png").convert("RGB")

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text",
         "text": "Reconstruct this circuit pattern in NetDSL-L2."},
    ],
}]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, return_tensors="pt",
    tokenize=True, return_dict=True,
).to(model.device)

out = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
dsl_code = processor.batch_decode(
    out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True
)[0]
print(dsl_code)

Render the predicted DSL with the helpers in the companion repo:

python
from src.pattern_dsl import parse_and_render
mask = parse_and_render(dsl_code, canvas=(256, 256))

NetDSL-L2 example

markdown
CANVAS 256 256
WIRE   0 12 8  0 0  0 0  H 256
WIRE   12 0 10  0 0  30 14  V 60
VIA    35 40 6

WIRE(x0, y0, w_base, l_s, w_s, l_e, w_e, segments) describes a wire with optional dogbone end-caps; segments is a chain of H <length> / V <length> relative moves. See §III-A of the paper.

Limitations

Trained only on synthetic Manhattan-style layouts; non-Manhattan or analog layouts are out of distribution.
At inference time, real SEM images must be binarized (global threshold ≈ 100) to obtain the reported numbers; raw grayscale input significantly degrades quality.
The model produces NetDSL-L2 strings up to ~2048 tokens. Very dense layouts may be truncated.
Reconstruction quality decreases as pattern complexity (compressed DSL length) grows; see Fig. 4 of the paper.

Citation

bibtex
@inproceedings{ohtsubo2026bridging,
  author    = {Ohtsubo, Yusuke and Dohi, Kota and Yawata, Koichiro and
               Takeshita, Koki and Sasaki, Tatsuya},
  title     = {Bridging the Sim-to-Real Gap in Semiconductor Visual Program
               Synthesis via Input Binarization},
  booktitle = {Proceedings of the 34th European Signal Processing Conference (EUSIPCO)},
  year      = {2026},
  publisher = {EURASIP},
  note      = {Accepted; final citation/DOI to be updated upon publication}
}

License

MIT for both the code and these weights.

The base model Qwen3-VL-8B-Instruct is subject to its own license; please review.

Contact

Yusuke Ohtsubo — yusuke.ohtsubo.nb@hitachi.com

Model provider

utsubo12

Model tree

Base

Qwen/Qwen3-VL-8B-Instruct

Fine-tuned

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Model description

Base model: Qwen/Qwen3-VL-8B-Instruct
Fine-tuning: full SFT (vision encoder, multimodal projector, and language model are all trainable)
Training data: 18,900 synthetic (image, DSL) pairs generated by the DSL renderer in src/dsl2_dataset_v3.py; topology mixture of Vertical Stripes, Horizontal Lines, and Manhattan layouts
Optimization: 3 epochs, batch size 8 × grad-accum 12 (effective 96), LR 2.0e-5 cosine + 10% warmup, weight decay 0.01, pure BF16 + gradient checkpointing
Hardware: single NVIDIA H200 (141 GB)
Final training loss: 0.2832
Chat template: qwen3_vl_nothink (no reasoning tokens emitted)

Intended use

Evaluation results (MIIC, 1034 test images)

Mean ± std over executable outputs. Binary input (proposed) ↔ Raw input (baseline):

Table with columns: Metric, Raw, Binary (ours)
Metric	Raw	Binary (ours)
IoU	0.2865 ± 0.0802	0.3619 ± 0.0882
Dice coefficient	0.4393 ± 0.0980	0.5256 ± 0.0912
BF1 @ 2 px	0.4054 ± 0.1106	0.4412 ± 0.1098
SkF1 @ 1 px	0.1276 ± 0.0960	0.1746 ± 0.1145
ASSD	4.7768 ± 3.4596	4.1327 ± 1.2757

Executable rate: 957/1034 (binary) and 1019/1034 (raw).

Usage

python
from transformers import AutoProcessor, AutoModelForImageTextToText
from PIL import Image
import torch

REPO = "utsubo12/qwen3-vl-8b-netdsl-l2"

model = AutoModelForImageTextToText.from_pretrained(
    REPO, torch_dtype=torch.bfloat16, device_map="cuda"
)
processor = AutoProcessor.from_pretrained(REPO)

# Real SEM image, globally thresholded at 100 (binary preprocessing).
image = Image.open("test_normal_00169_binary.png").convert("RGB")

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text",
         "text": "Reconstruct this circuit pattern in NetDSL-L2."},
    ],
}]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, return_tensors="pt",
    tokenize=True, return_dict=True,
).to(model.device)

out = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
dsl_code = processor.batch_decode(
    out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True
)[0]
print(dsl_code)

Render the predicted DSL with the helpers in the companion repo:

python
from src.pattern_dsl import parse_and_render
mask = parse_and_render(dsl_code, canvas=(256, 256))

NetDSL-L2 example

markdown
CANVAS 256 256
WIRE   0 12 8  0 0  0 0  H 256
WIRE   12 0 10  0 0  30 14  V 60
VIA    35 40 6

WIRE(x0, y0, w_base, l_s, w_s, l_e, w_e, segments) describes a wire with optional dogbone end-caps; segments is a chain of H <length> / V <length> relative moves. See §III-A of the paper.

Limitations

Trained only on synthetic Manhattan-style layouts; non-Manhattan or analog layouts are out of distribution.
At inference time, real SEM images must be binarized (global threshold ≈ 100) to obtain the reported numbers; raw grayscale input significantly degrades quality.
The model produces NetDSL-L2 strings up to ~2048 tokens. Very dense layouts may be truncated.
Reconstruction quality decreases as pattern complexity (compressed DSL length) grows; see Fig. 4 of the paper.

Citation

bibtex
@inproceedings{ohtsubo2026bridging,
  author    = {Ohtsubo, Yusuke and Dohi, Kota and Yawata, Koichiro and
               Takeshita, Koki and Sasaki, Tatsuya},
  title     = {Bridging the Sim-to-Real Gap in Semiconductor Visual Program
               Synthesis via Input Binarization},
  booktitle = {Proceedings of the 34th European Signal Processing Conference (EUSIPCO)},
  year      = {2026},
  publisher = {EURASIP},
  note      = {Accepted; final citation/DOI to be updated upon publication}
}

License

MIT for both the code and these weights.

The base model Qwen3-VL-8B-Instruct is subject to its own license; please review.

Contact

Yusuke Ohtsubo — yusuke.ohtsubo.nb@hitachi.com

qwen3-vl-8b-netdsl-l2

Get help setting up a custom Dedicated Endpoints.

README

Model description

Intended use

Evaluation results (MIIC, 1034 test images)

Usage

NetDSL-L2 example

Limitations

Citation

License

Contact

Explore FriendliAI today

README

Model description

Intended use

Evaluation results (MIIC, 1034 test images)

Usage

NetDSL-L2 example

Limitations

Citation

License

Contact