SAVANT-scene-description-lora API & Inference Endpoint

Model Description

LoRA adapter for Qwen/Qwen2.5-VL-7B-Instruct, fine-tuned to generate structured scene descriptions from driving scene images. This is Phase 1 of the SAVANT (Semantic Anomaly Verification/Analysis Toolkit) two-phase pipeline.

Given a front-camera image, the model produces a structured JSON description across four semantic layers:

Street layer: geometry, topology, surface condition, lane markings
Infrastructure layer: traffic lights, signs, cones, barriers, construction sites
Movable objects layer: vehicles, pedestrians, other dynamic objects
Environmental layer: weather, visibility, lighting conditions

Training Details

Base model: Qwen/Qwen2.5-VL-7B-Instruct
Method: LoRA (Low-Rank Adaptation)
Dataset: 4,260 samples with structured scene descriptions
Epochs: 3
Learning rate: 1e-4 (cosine schedule)
Precision: bfloat16 with Flash Attention 2

LoRA Configuration

Table with columns: Parameter, Value
Parameter	Value
Rank (r)	16
Alpha	32
Dropout	0.05
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, fc1, fc2, qkv, mlp.0, mlp.2

Usage

python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
import torch

base_model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "u94fmn391j/SAVANT-scene-description-lora")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

Limitations

Trained on the CODA dataset; generalization to other driving domains not evaluated
Single-frame analysis only (no temporal context)

Citation

bibtex
@article{brusnicki2025savant,
  title={Can VLMs Unlock Semantic Anomaly Detection? A Framework for Structured Reasoning},
  author={Brusnicki, Roberto and Pop, David and Gao, Yuan and Piccinini, Mattia and Betz, Johannes},
  journal={arXiv preprint arXiv:2510.18034},
  year={2025}
}

Model Description

Given a front-camera image, the model produces a structured JSON description across four semantic layers:

Street layer: geometry, topology, surface condition, lane markings
Infrastructure layer: traffic lights, signs, cones, barriers, construction sites
Movable objects layer: vehicles, pedestrians, other dynamic objects
Environmental layer: weather, visibility, lighting conditions

Training Details

Base model: Qwen/Qwen2.5-VL-7B-Instruct
Method: LoRA (Low-Rank Adaptation)
Dataset: 4,260 samples with structured scene descriptions
Epochs: 3
Learning rate: 1e-4 (cosine schedule)
Precision: bfloat16 with Flash Attention 2

LoRA Configuration

Table with columns: Parameter, Value
Parameter	Value
Rank (r)	16
Alpha	32
Dropout	0.05
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, fc1, fc2, qkv, mlp.0, mlp.2

Usage

python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
import torch

base_model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "u94fmn391j/SAVANT-scene-description-lora")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

Limitations

Trained on the CODA dataset; generalization to other driving domains not evaluated
Single-frame analysis only (no temporal context)

Citation

bibtex
@article{brusnicki2025savant,
  title={Can VLMs Unlock Semantic Anomaly Detection? A Framework for Structured Reasoning},
  author={Brusnicki, Roberto and Pop, David and Gao, Yuan and Piccinini, Mattia and Betz, Johannes},
  journal={arXiv preprint arXiv:2510.18034},
  year={2025}
}

SAVANT-scene-description-lora

README

Model Description

Training Details

LoRA Configuration

Usage

Limitations

Citation

Explore FriendliAI today

README

Model Description

Training Details

LoRA Configuration

Usage

Limitations

Citation