SAVANT-multimodal-evaluation-lora API & Inference Endpoint

Model Description

LoRA adapter for Qwen/Qwen2.5-VL-7B-Instruct, fine-tuned for anomaly evaluation using both the driving scene image and a structured scene description. This is Phase 2 of the SAVANT two-phase pipeline.

The model receives:

The original front-camera image
A structured scene description (generated by the Phase 1 model)

And outputs a binary anomaly classification with detailed reasoning.

Pipeline Performance

When used as part of the full SAVANT pipeline (Phase 1 + Phase 2), evaluated on a balanced test set of 1,020 driving scene images:

Table with columns: Metric, Value
Metric	Value
Accuracy	83.7%
Precision	85.1%
Recall	81.8%
F1-Score	83.4%

Training Details

Base model: Qwen/Qwen2.5-VL-7B-Instruct
Method: LoRA (Low-Rank Adaptation)
Dataset: 4,260 samples with image + scene description + anomaly labels
Epochs: 3
Learning rate: 1e-4 (cosine schedule)
Precision: bfloat16 with Flash Attention 2

LoRA Configuration

Table with columns: Parameter, Value
Parameter	Value
Rank (r)	16
Alpha	32
Dropout	0.05
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, fc1, fc2, qkv, mlp.0, mlp.2

Usage

python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
import torch

base_model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "u94fmn391j/SAVANT-multimodal-evaluation-lora")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

Limitations

Trained on the CODA dataset; generalization to other driving domains not evaluated
Single-frame analysis only (no temporal context)
Pipeline performance depends on the quality of the Phase 1 scene description

Model Description

The model receives:

The original front-camera image
A structured scene description (generated by the Phase 1 model)

And outputs a binary anomaly classification with detailed reasoning.

Pipeline Performance

When used as part of the full SAVANT pipeline (Phase 1 + Phase 2), evaluated on a balanced test set of 1,020 driving scene images:

Table with columns: Metric, Value
Metric	Value
Accuracy	83.7%
Precision	85.1%
Recall	81.8%
F1-Score	83.4%

Training Details

Base model: Qwen/Qwen2.5-VL-7B-Instruct
Method: LoRA (Low-Rank Adaptation)
Dataset: 4,260 samples with image + scene description + anomaly labels
Epochs: 3
Learning rate: 1e-4 (cosine schedule)
Precision: bfloat16 with Flash Attention 2

LoRA Configuration

Table with columns: Parameter, Value
Parameter	Value
Rank (r)	16
Alpha	32
Dropout	0.05
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, fc1, fc2, qkv, mlp.0, mlp.2

Usage

python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
import torch

base_model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "u94fmn391j/SAVANT-multimodal-evaluation-lora")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

Limitations

Trained on the CODA dataset; generalization to other driving domains not evaluated
Single-frame analysis only (no temporal context)
Pipeline performance depends on the quality of the Phase 1 scene description

SAVANT-multimodal-evaluation-lora

README

Model Description

Pipeline Performance

Training Details

LoRA Configuration

Usage

Limitations

Explore FriendliAI today

README

Model Description

Pipeline Performance

Training Details

LoRA Configuration

Usage

Limitations