Barath
minicpmv4-floorplan-lora
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Results (held-out FloorPlanCAD, detection F1 @ IoU 0.5, greedy decoding)
| JSON valid | Precision | Recall | F1 | |
|---|---|---|---|---|
| MiniCPM-V-4 zero-shot | 32% | 0.027 | 0.010 | 0.015 |
| LoRA, LLM-only (300 steps, 800 samples) | 16% | 0.0 | 0.0 | 0.0 |
| LoRA + vision tuning (800 steps, 3.3k samples) | 40% | 0.439 | 0.060 | 0.105 |
The decisive factor was unfreezing the vision tower: LLM-only LoRA learned the output format but stayed image-blind (high-confidence repetition of dataset priors). With vision tuning, precision rose 16x over zero-shot — the model genuinely grounds boxes in the drawing. Recall remains the open weakness (long element lists; outputs sometimes truncate before the JSON closes).
Training
- Base:
openbmb/MiniCPM-V-4, official MiniCPM-V finetune harness (llm_typeChatML; note: the harness's qwen2 target-span detection needs a patch for V-4's tokenizer — spans located by token-id lookup of'<|im_start|>'/'assistant'only exist in Qwen2's vocab) - LoRA on LLM attention projections (q/k/v/o) + full vision-tower tuning
- 3,281 train / held-out eval from FloorPlanCAD (FiftyOne detections converted to conversation JSON)
- 800 steps, effective batch 8, lr 1e-5 cosine, bf16, single NVIDIA L4 (Modal), ~3.5 h
Usage
python
import torchfrom transformers import AutoModel, AutoTokenizerfrom peft import PeftModelbase = "openbmb/MiniCPM-V-4"tokenizer = AutoTokenizer.from_pretrained(base, trust_remote_code=True)model = AutoModel.from_pretrained(base, trust_remote_code=True,torch_dtype=torch.bfloat16, device_map="cuda")model = PeftModel.from_pretrained(model, "Barath/minicpmv4-floorplan-lora")model = model.merge_and_unload().eval()from PIL import Imageimg = Image.open("floorplan.png").convert("RGB")prompt = ('Detect the architectural elements in this floor plan (walls, doors, ''windows, stairs, fixtures, furniture). Return only JSON: ''{"elements": [{"type": str, "bbox": [x1, y1, x2, y2]}]} with integer ''coordinates normalized to [0, 1000].')out = model.chat(msgs=[{"role": "user", "content": [img, prompt]}],tokenizer=tokenizer, sampling=False,max_new_tokens=1500, repetition_penalty=1.1)print(out)
Limitations
- Recall is low (0.06): dense plans with dozens of elements are only partially enumerated, and long outputs may truncate mid-JSON. Use a truncation-tolerant parser in production.
- Trained on CAD-style monochrome drawings (FloorPlanCAD); photographed or hand-drawn plans are out of distribution.
- Coordinates are model estimates, not measurements.
Trained for the Hugging Face Build Small Hackathon 2026.
Model provider
Barath
Model tree
Base
openbmb/MiniCPM-V-4
Adapter
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information