Thamed-Chowdhury

eg-arsa-qwen3vl-8b-lora

README

License: apache-2.0

Results (held-out, location-disjoint test set, n = 3,447)

Table with columns: Model / setting, Risk QWK, Exact-risk accuracy
Model / setting	Risk QWK	Exact-risk accuracy
Zero-shot base (Qwen3-VL-8B-Instruct)	0.077 [0.044, 0.108]	0.544
EG-ARSA (this adapter)	0.482 [0.454, 0.510]	0.717

Fine-tuning lifts ordinal risk agreement by +0.40 QWK (non-overlapping bootstrap CIs). Under a blind human-expert evaluation, EG-ARSA is risk-correct 81% of the time vs 58% for Gemini-2.5-flash and 42% for the 31B teacher run leakage-free; fully automated risk accuracy reproduces the ranking (0.74 / 0.59 / 0.36). Per-class F1 (raw operating point): Low 0.04 / Medium 0.67 / High 0.77 — residual errors are predominantly between adjacent risk levels. See the reports/ folder and the paper for the full evaluation, including the multi-model comparison and human-eval rubric.

Usage

python
import torch
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
from PIL import Image

BASE = "Qwen/Qwen3-VL-8B-Instruct"
LORA = "Thamed-Chowdhury/eg-arsa-qwen3vl-8b-lora"

proc  = AutoProcessor.from_pretrained(BASE, trust_remote_code=True)
model = Qwen3VLForConditionalGeneration.from_pretrained(BASE, dtype=torch.bfloat16, device_map="cuda")
model = PeftModel.from_pretrained(model, LORA).eval()

SYSTEM = ("You are a road-safety auditor applying the LGED (Local Government Engineering "
          "Department) 12-category visual audit methodology to road imagery in Bangladesh.")
# The canonical leakage-free instruction is FULL_AUDIT_INSTRUCTION in
# prompts/finetune_prompts.py (shipped with the dataset and the code repo).
INSTRUCTION = "Audit this road image for safety hazards. Return the structured JSON audit."

img = Image.open("road.jpg").convert("RGB")
msgs = [
    {"role": "system", "content": [{"type": "text", "text": SYSTEM}]},
    {"role": "user",   "content": [{"type": "image", "image": img},
                                   {"type": "text", "text": INSTRUCTION}]},
]
text   = proc.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
inputs = proc(text=[text], images=[img], return_tensors="pt").to("cuda")
out    = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
print(proc.tokenizer.decode(out[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True))

For exact parity with the paper, use the canonical FULL_AUDIT_INSTRUCTION (full road scene) / SINGLE_HAZARD_INSTRUCTION from prompts/finetune_prompts.py and 1024-px native resolution. See the code repo for the wrapped inference helper (apps/streetview_infer.py) and the evaluation pipeline.

Training

Base: Qwen3-VL-8B-Instruct, vision encoder frozen.
LoRA: r=16, α=32, dropout 0.05, targets = LM attention q/k/v/o_proj.
Precision/memory: bf16, gradient checkpointing, effective batch 16 (micro 2 × accum 8).
Resolution: 1024 px (native max dimension, selected by a zero-shot resolution probe).
Objective: per-task normalized cross-entropy, weights hazard 1.0 / risk 1.0 / recommendation 0.5, with per-record loss masking by tasks_available.
Imbalance: train-only logit adjustment (τ=1) on the risk-token logits (priors Low 225 / Med 6,340 / High 9,517); raw logits at inference, with an optional post-hoc operating-point offset fit on validation.
Schedule: LR 1e-4 cosine, 3% warmup, 2 epochs, early-stop on validation QWK.
Data: BD-ARSA (train 16,082 records).
Compute: a single NVIDIA A100 40 GB, ≈ 6.3 h.

Intended use & limitations

Intended use. Proactive, low-cost screening of rural/suburban (LGED-class) road imagery to surface visible safety hazards and an interpretable risk rating where formal Road Safety Audits are unaffordable. It is a decision-support tool, not a replacement for a formal multidisciplinary RSA.

Limitations / out of scope.

Audits a single street-view image: full road geometry and the non-visual extremes of skid_resistance (friction) and drainage (wet-weather behaviour) are recoverable only in obvious cases. All 12 categories are reported; these two score lowest.
Targets the rural/suburban LGED road class. National highways (RHD) and dense city streets (City Corporations) fall under other jurisdictions and are future work.

Data provenance & terms

The training supervision was produced via Expert-Grounded Distillation. Two points of provenance matter for downstream use:

Street-view imagery (© Google). Part of the training data depicts Google Street View imagery, which is not redistributed with this model or its dataset. The BD-ARSA dataset ships street-view annotations + panorama IDs/coordinates only; images are re-fetched by the user via the official Street View Static API under the Google Maps Platform Terms of Service. The model weights do not contain any imagery.
Gemma-generated supervision. The street-view audit labels were generated by Google's Gemma model. Under the Gemma Terms of Use, a model trained on Gemma outputs is a "Model Derivative", so use of this model is additionally subject to the Gemma Terms of Use and the Gemma Prohibited Use Policy.

License

The LoRA adapter weights and code in this repository are released under Apache-2.0. The base model Qwen/Qwen3-VL-8B-Instruct is governed by its own license; only the adapter is redistributed here. Because the model was distilled in part from Gemma-generated supervision (see Data provenance & terms above), use of the model is also subject to the Gemma Terms of Use and Prohibited Use Policy. The BD-ARSA dataset's annotations are released under CC BY 4.0 (the street-view imagery is not redistributed).

Citation

bibtex
@article{chowdhury_egarsa,
  title   = {EG-ARSA: An Expert-Grounded Open Dataset and Model for Visual Road Safety
             Auditing in Low-Resource Settings},
  author  = {Chowdhury, Md Thamed Bin Zaman and Hossain, Moazzem},
  year    = {2026},
  note    = {Preprint. Code: https://github.com/Thamed-Chowdhury/EG-ARSA}
}

Acknowledgements

The expert ground truth derives from on-site Road Safety Audits conducted by faculty of the Accident Research Institute (ARI), Bangladesh University of Engineering and Technology (BUET), commissioned by the Local Government Engineering Department (LGED) under the World Bank–financed Second Rural Transport Improvement Project (RTIP-II, Additional Financing; P166295).

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

Thamed-Chowdhury

Model Tree

Base

Qwen/Qwen3-VL-8B-Instruct

Adapter

this model

Input Modalities

Text

Image

Output Modalities