Noahsabb

spec2rtl-qwen32b-lora-rl-v2

README

License: apache-2.0

Benchmark Results

Evaluated on CVDP cid003 — 78 RTL natural-language-spec-to-code problems, scored with the full cocotb simulation harness (functional correctness, not just syntax).

Table with columns: System, Overall, Easy (41), Medium (37)
System	Overall	Easy (41)	Medium (37)
Base Qwen2.5-Coder-32B-Instruct	14.10% (11/78)	21.95%	5.41%
+ SFT fine-tuning	19.23% (15/78)	24.39%	13.51%
+ RL GRPO v2 (this adapter)	29.49% (23/78)	36.59%	21.62%
+ Agentic loop v10 (Qwen+Sonnet reflector)	53.85% (42/78)	70.73%	35.14%
Final system (agentic v10+v11 cherry-pick)	58.97% (46/78)	75.61%	40.54%
Claude Sonnet 4.6 standalone (baseline)	55.13% (43/78)	—	—

The final agentic system beats Claude Sonnet 4.6 standalone by +3.84pp using this adapter as the Generator.

Model Details

Base model: Qwen/Qwen2.5-Coder-32B-Instruct
Adapter type: LoRA (via PEFT)
LoRA rank: r=16, alpha=32, dropout=0.05
Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable parameters: 134,217,728 / 32,898,094,080 (0.408%)
Adapter size: ~513 MB

Training Pipeline

Stage 1 — SFT (separate adapter, not in this repo):

Dataset: 13,568 examples built from shailja/Verilog_GitHub (~7,500 validated Verilog modules)
Task types: spec-to-RTL (8,128), editing (4,015), debugging (1,425)
Config: QLoRA r=32, α=64, 5 epochs, lr=1e-4, seq_len=4096
Infrastructure: 1× H100 80GB, ~21h wall time

Stage 2 — GRPO RL (this adapter):

Starting point: SFT adapter merged into base weights; fresh r=16 LoRA head
Reward: tiered iverilog compile signal — hard fail 0.0, soft fail (malformed) 0.2, clean compile 1.0
Config: G=2 completions, max_new_tokens=256, lr=5e-6, 3 epochs
Infrastructure: 1× H100 80GB, ~5.5h wall time
Training compile rate: 7–10% → confirms reward signal is meaningful (not trivially solved)

Agentic Loop (for full system results)

This adapter serves as the Generator in a Reflector–Generator loop:

Generator (this adapter) produces initial Verilog from spec
Compiler (iverilog) checks syntax → Reflector (Claude Sonnet 4.6) diagnoses errors → Generator repairs
Simulator (cocotb harness) checks functional correctness → Reflector diagnoses → Generator repairs
Loop runs up to 3 compile iterations + 4 cocotb iterations

The +24.36pp improvement from RL v2 (29.49%) to agentic v10 (53.85%) comes from the Reflector providing structured, testbench-aware diagnosis at each iteration.

How to Use

Load adapter for inference

python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

base_model_id = "Qwen/Qwen2.5-Coder-32B-Instruct"
adapter_id = "Noahsabb/spec2rtl-qwen32b-lora-rl-v2"

# Load base model in bf16 (requires ~65GB VRAM — fits a single H100 or A100 80GB)
tokenizer = AutoTokenizer.from_pretrained(adapter_id)  # tokenizer is included in adapter repo
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
)
model = PeftModel.from_pretrained(model, adapter_id)
model = model.merge_and_unload()   # merge LoRA into base for faster inference
model = model.to("cuda:0")
model.eval()

Generate Verilog from a specification

python
spec = """
## Specification

Design a synchronous 4-bit up-counter with active-high reset.
- Inputs: clk (clock), rst (synchronous reset, active high), en (count enable)
- Outputs: count [3:0] (counter value)
- Behavior: On rising clock edge, if rst is high, count resets to 0.
  If en is high and rst is low, count increments by 1, wrapping from 15 to 0.
"""

prompt = f"Generate synthesizable Verilog RTL for the following specification.\n\n{spec}"
messages = [{"role": "user", "content": prompt}]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(text, return_tensors="pt").to("cuda:0")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=2048,
        temperature=0.2,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
    )

generated = tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[1]:],
    skip_special_tokens=True,
)
print(generated)

Memory-efficient inference (if VRAM is limited)

For GPUs with less than 65GB VRAM, skip merge_and_unload() and use the adapter directly without merging. The model will use slightly more memory during inference but avoids the merge overhead.

Limitations

Single-shot pass rate is 29.49% — the adapter is designed for use in an agentic loop, not standalone generation. Raw single-shot results are well below the agentic system's 58.97%.
Training reward is compile-only — the RL reward signal checks iverilog syntax, not functional correctness. The model learns to produce compilable Verilog but not necessarily correct Verilog.
Complex multi-bug problems still fail — problems requiring precise timing, multi-cycle FSM coordination, or ambiguous specs require the Reflector to provide targeted feedback.
Max 256 tokens during RL training — the RL generator was trained with short max_new_tokens for compute reasons. Inference with longer outputs (up to 2048 tokens) is fine but was not the training distribution.

Citation

This adapter was developed as part of a course project (CS153, Stanford University) implementing NVIDIA's ACE-RTL system at academic scale.

bibtex
@misc{spec2rtl2026,
  author = {Sabbavarapu, Noah},
  title  = {Spec2RTL: Fine-tuned Qwen2.5-Coder-32B + Agentic Self-Correction for Verilog RTL Generation},
  year   = {2026},
  url    = {https://github.com/Noahsabb/spec2RTL}
}

Related work:

ACE-RTL: arXiv:2602.10218
CVDP Benchmark: arXiv:2506.14074
Qwen2.5-Coder: arXiv:2409.12186

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

Noahsabb

Model Tree

Base

Qwen/Qwen2.5-Coder-32B-Instruct

Adapter

this model

Input Modalities

Text

Output Modalities

Text

Supported Functionality

Dedicated Endpoints

Container

Explore FriendliAI today

Get started Talk to an engineer

README

License: apache-2.0

Benchmark Results

Evaluated on CVDP cid003 — 78 RTL natural-language-spec-to-code problems, scored with the full cocotb simulation harness (functional correctness, not just syntax).

Table with columns: System, Overall, Easy (41), Medium (37)
System	Overall	Easy (41)	Medium (37)
Base Qwen2.5-Coder-32B-Instruct	14.10% (11/78)	21.95%	5.41%
+ SFT fine-tuning	19.23% (15/78)	24.39%	13.51%
+ RL GRPO v2 (this adapter)	29.49% (23/78)	36.59%	21.62%
+ Agentic loop v10 (Qwen+Sonnet reflector)	53.85% (42/78)	70.73%	35.14%
Final system (agentic v10+v11 cherry-pick)	58.97% (46/78)	75.61%	40.54%
Claude Sonnet 4.6 standalone (baseline)	55.13% (43/78)	—	—

The final agentic system beats Claude Sonnet 4.6 standalone by +3.84pp using this adapter as the Generator.

Model Details

Base model: Qwen/Qwen2.5-Coder-32B-Instruct
Adapter type: LoRA (via PEFT)
LoRA rank: r=16, alpha=32, dropout=0.05
Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable parameters: 134,217,728 / 32,898,094,080 (0.408%)
Adapter size: ~513 MB

Training Pipeline

Stage 1 — SFT (separate adapter, not in this repo):

Dataset: 13,568 examples built from shailja/Verilog_GitHub (~7,500 validated Verilog modules)
Task types: spec-to-RTL (8,128), editing (4,015), debugging (1,425)
Config: QLoRA r=32, α=64, 5 epochs, lr=1e-4, seq_len=4096
Infrastructure: 1× H100 80GB, ~21h wall time

Stage 2 — GRPO RL (this adapter):

Starting point: SFT adapter merged into base weights; fresh r=16 LoRA head
Reward: tiered iverilog compile signal — hard fail 0.0, soft fail (malformed) 0.2, clean compile 1.0
Config: G=2 completions, max_new_tokens=256, lr=5e-6, 3 epochs
Infrastructure: 1× H100 80GB, ~5.5h wall time
Training compile rate: 7–10% → confirms reward signal is meaningful (not trivially solved)

Agentic Loop (for full system results)

This adapter serves as the Generator in a Reflector–Generator loop:

Generator (this adapter) produces initial Verilog from spec
Compiler (iverilog) checks syntax → Reflector (Claude Sonnet 4.6) diagnoses errors → Generator repairs
Simulator (cocotb harness) checks functional correctness → Reflector diagnoses → Generator repairs
Loop runs up to 3 compile iterations + 4 cocotb iterations

The +24.36pp improvement from RL v2 (29.49%) to agentic v10 (53.85%) comes from the Reflector providing structured, testbench-aware diagnosis at each iteration.

How to Use

Load adapter for inference

python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

base_model_id = "Qwen/Qwen2.5-Coder-32B-Instruct"
adapter_id = "Noahsabb/spec2rtl-qwen32b-lora-rl-v2"

# Load base model in bf16 (requires ~65GB VRAM — fits a single H100 or A100 80GB)
tokenizer = AutoTokenizer.from_pretrained(adapter_id)  # tokenizer is included in adapter repo
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
)
model = PeftModel.from_pretrained(model, adapter_id)
model = model.merge_and_unload()   # merge LoRA into base for faster inference
model = model.to("cuda:0")
model.eval()

Generate Verilog from a specification

python
spec = """
## Specification

Design a synchronous 4-bit up-counter with active-high reset.
- Inputs: clk (clock), rst (synchronous reset, active high), en (count enable)
- Outputs: count [3:0] (counter value)
- Behavior: On rising clock edge, if rst is high, count resets to 0.
  If en is high and rst is low, count increments by 1, wrapping from 15 to 0.
"""

prompt = f"Generate synthesizable Verilog RTL for the following specification.\n\n{spec}"
messages = [{"role": "user", "content": prompt}]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(text, return_tensors="pt").to("cuda:0")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=2048,
        temperature=0.2,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
    )

generated = tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[1]:],
    skip_special_tokens=True,
)
print(generated)

Memory-efficient inference (if VRAM is limited)

For GPUs with less than 65GB VRAM, skip merge_and_unload() and use the adapter directly without merging. The model will use slightly more memory during inference but avoids the merge overhead.

Limitations

Single-shot pass rate is 29.49% — the adapter is designed for use in an agentic loop, not standalone generation. Raw single-shot results are well below the agentic system's 58.97%.
Training reward is compile-only — the RL reward signal checks iverilog syntax, not functional correctness. The model learns to produce compilable Verilog but not necessarily correct Verilog.
Complex multi-bug problems still fail — problems requiring precise timing, multi-cycle FSM coordination, or ambiguous specs require the Reflector to provide targeted feedback.
Max 256 tokens during RL training — the RL generator was trained with short max_new_tokens for compute reasons. Inference with longer outputs (up to 2048 tokens) is fine but was not the training distribution.

Citation

This adapter was developed as part of a course project (CS153, Stanford University) implementing NVIDIA's ACE-RTL system at academic scale.

bibtex
@misc{spec2rtl2026,
  author = {Sabbavarapu, Noah},
  title  = {Spec2RTL: Fine-tuned Qwen2.5-Coder-32B + Agentic Self-Correction for Verilog RTL Generation},
  year   = {2026},
  url    = {https://github.com/Noahsabb/spec2RTL}
}

Related work:

ACE-RTL: arXiv:2602.10218
CVDP Benchmark: arXiv:2506.14074
Qwen2.5-Coder: arXiv:2409.12186