Barath

Barath

minicpmv4-floorplan-lora

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Results (held-out FloorPlanCAD, detection F1 @ IoU 0.5, greedy decoding)

Table
JSON validPrecisionRecallF1
MiniCPM-V-4 zero-shot32%0.0270.0100.015
LoRA, LLM-only (300 steps, 800 samples)16%0.00.00.0
LoRA + vision tuning (800 steps, 3.3k samples)40%0.4390.0600.105

The decisive factor was unfreezing the vision tower: LLM-only LoRA learned the output format but stayed image-blind (high-confidence repetition of dataset priors). With vision tuning, precision rose 16x over zero-shot — the model genuinely grounds boxes in the drawing. Recall remains the open weakness (long element lists; outputs sometimes truncate before the JSON closes).

Training

  • Base: openbmb/MiniCPM-V-4, official MiniCPM-V finetune harness (llm_type ChatML; note: the harness's qwen2 target-span detection needs a patch for V-4's tokenizer — spans located by token-id lookup of '<|im_start|>'/'assistant' only exist in Qwen2's vocab)
  • LoRA on LLM attention projections (q/k/v/o) + full vision-tower tuning
  • 3,281 train / held-out eval from FloorPlanCAD (FiftyOne detections converted to conversation JSON)
  • 800 steps, effective batch 8, lr 1e-5 cosine, bf16, single NVIDIA L4 (Modal), ~3.5 h

Usage

python

import torch
from transformers import AutoModel, AutoTokenizer
from peft import PeftModel
base = "openbmb/MiniCPM-V-4"
tokenizer = AutoTokenizer.from_pretrained(base, trust_remote_code=True)
model = AutoModel.from_pretrained(base, trust_remote_code=True,
torch_dtype=torch.bfloat16, device_map="cuda")
model = PeftModel.from_pretrained(model, "Barath/minicpmv4-floorplan-lora")
model = model.merge_and_unload().eval()
from PIL import Image
img = Image.open("floorplan.png").convert("RGB")
prompt = ('Detect the architectural elements in this floor plan (walls, doors, '
'windows, stairs, fixtures, furniture). Return only JSON: '
'{"elements": [{"type": str, "bbox": [x1, y1, x2, y2]}]} with integer '
'coordinates normalized to [0, 1000].')
out = model.chat(msgs=[{"role": "user", "content": [img, prompt]}],
tokenizer=tokenizer, sampling=False,
max_new_tokens=1500, repetition_penalty=1.1)
print(out)

Limitations

  • Recall is low (0.06): dense plans with dozens of elements are only partially enumerated, and long outputs may truncate mid-JSON. Use a truncation-tolerant parser in production.
  • Trained on CAD-style monochrome drawings (FloorPlanCAD); photographed or hand-drawn plans are out of distribution.
  • Coordinates are model estimates, not measurements.

Trained for the Hugging Face Build Small Hackathon 2026.

Model provider

Barath

Barath

Model tree

Base

openbmb/MiniCPM-V-4

Adapter

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today