vectionlabs

Salience-1-9B

README

License: apache-2.0

Abstract

Salience 1 (9B) is a dense, 9-billion-parameter vision-language model built for hard, practical work: writing and debugging real code, driving tools and agents, multi-step mathematical reasoning, and visual understanding over images and video — inside a single model with a context window of up to 1M tokens.

It is the successor of Maestro1-9B, engineered around a single goal: push the axis users ask for most — code and agentic/tool use — without giving up the deep reasoning, vision, and million-token context the family is known for.

It is designed for people who care less about chat pleasantries and more about whether the model can do the thing: ship the function, find the bug, call the right tool, read the diagram, finish the proof.

Highlights

Code & agentic first. Built with a coding/DevOps donor on top of a reasoning core; tuned to produce runnable code and well-formed tool calls.
Reasoning that shows its work. Structured, inspectable chains of thought for math, logic, code.
Genuinely multimodal. Images and video are first-class inputs, not bolted-on captioning.
Long context. Up to 1M tokens via interleaved multimodal RoPE — whole repos, long papers, or long videos in a single prompt.
Fast on modest hardware. Runs on 2× T4 with no GGUF (fp16 sharded, or 4-bit on a single T4), with lossless speculative decoding and hybrid-thinking latency control.
Open weights. Apache-2.0, transformers-native, single-file deployment.

Model overview

Table

Parameters	9B (dense)
Modalities	text, image, video → text
Context window	up to 1,000,000 tokens (interleaved multimodal RoPE)
Precision	bfloat16 master weights
Architecture	Qwen3-VL (Qwen3-8B language model, 36 layers) + native vision encoder
License	Apache-2.0
Library	🤗 `transformers` (`AutoModelForImageTextToText`)

Architecture & capabilities

Salience 1 is a dense Qwen3-VL model: a 36-layer Qwen3-8B language model coupled to a native vision encoder, with interleaved multimodal RoPE carrying the context window from 256K up to 1M tokens.

Its capability profile is built around three pillars:

Code & agentic execution — runnable code, repo-scale edits, and well-formed tool calls.
Deep reasoning — structured, inspectable chains of thought for math and logic.
Multimodal perception — images and video as first-class inputs, not bolted-on captioning.

The vision pathway and long-context behavior are preserved end to end, so the same reasoning that solves an olympiad problem also reads a chart, a UI screenshot, or a short clip.

Intended use

Salience 1 targets technical assistance, coding agents, and research:

Code generation, explanation, debugging, review, and repo-scale tasks.
Agentic / tool-using workflows that emit structured calls.
Step-by-step math and quantitative reasoning.
Visual question answering and document/diagram/chart understanding.
Video understanding over short clips, and long-document / long-context analysis.

It is not intended for high-stakes decisions without human review, nor as a source of truth for medical, legal, or financial advice.

Benchmarks

All results use a single reproducible evaluation harness with greedy/CoT settings; the Maestro1-9B column is run under the identical protocol for a like-for-like comparison.

Reasoning, math & code

Table with columns: Benchmark, Setting, Maestro1-9B, Salience-1-9B
Benchmark	Setting	Maestro1-9B	Salience-1-9B
GSM8K	0-shot CoT, exact match	—	—
MATH-500	0-shot CoT, exact match	—	—
HumanEval	0-shot, pass@1	—	—
MBPP	3-shot, pass@1	—	—

Multimodal

Table with columns: Benchmark, Setting, Maestro1-9B, Salience-1-9B
Benchmark	Setting	Maestro1-9B	Salience-1-9B
MMMU (val)	0-shot	—	—
MathVista (testmini)	0-shot	—	—
DocVQA (val)	0-shot, ANLS	—	—

The evaluation protocol, prompts, and answer-extraction logic are fixed and reproducible end-to-end.

Quickstart

python
from transformers import AutoModelForImageTextToText, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

model_id = "vectionlabs/Salience-1-9B"
proc = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, dtype=torch.bfloat16, device_map="auto"
)

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": "https://example.com/diagram.png"},
        {"type": "text", "text": "Explain what this diagram proves, step by step."},
    ],
}]

text = proc.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
imgs, vids = process_vision_info(messages)
inputs = proc(text=[text], images=imgs, videos=vids, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=1024)
print(proc.batch_decode(out[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0])

Text-only works the same way with a plain {"type": "text", ...} message.

Speed & efficiency

Salience 1 is built to be fast in production, not just accurate:

Speculative decoding delivers a 1.5–2.5× speedup on code and structured text with no change to outputs — a lightweight draft proposes tokens and the model verifies them in a single pass. Supported natively in transformers (assistant_model=) and in vLLM (--speculative-model).
Adaptive thinking. Append /no_think for instant direct answers, or /think to unlock deep step-by-step reasoning on hard math and multi-step agentic planning — you spend latency only when the task is worth it.
Runs on consumer hardware. 4-bit quantization brings the full model onto a single consumer GPU; bf16/fp16 serves comfortably on one modern accelerator with room for long context.

Prompting tips

Code: specify language, constraints ("no external libraries"), and the exact I/O contract.
Agentic / tools: give the tool schema and ask for the call as strict JSON.
Math/logic: ask it to reason step by step; it is tuned to externalize its work.
Vision: put the image/video before the question in the message content.
Sampling (Qwen3 family): thinking → temperature=0.6, top_p=0.95, top_k=20; direct answers → temperature=0.7, top_p=0.8, top_k=20.

Deployment

Single-GPU: loads in bf16/fp16 with device_map="auto" on one modern accelerator; 4-bit quantization fits the model on a single consumer GPU.
Serving: integrates with standard transformers generation and vision-capable serving stacks such as vLLM (with optional speculative decoding) for high-throughput production use.
Quantized formats: GGUF and other community quantizations are supported.

Limitations & responsible use

Salience 1 can be confidently wrong. Verify mathematical and factual claims.
Generated code may be insecure or incorrect — review before running, never execute untrusted output.
Long-context and long-video inputs increase latency and memory substantially.
It inherits the licenses, biases, and failure modes of all source models. Do not use it for surveillance, manipulation, or any use that violates applicable law or the Apache-2.0 terms.
No audio modality.

Citation

bibtex
@misc{vectionlabs2026salience1,
  title  = {Salience 1 (9B): A Multimodal Reasoning and Coding Model},
  author = {Vection Labs},
  year   = {2026},
  url    = {https://huggingface.co/vectionlabs/Salience-1-9B}
}

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

vectionlabs

Model Tree

Base

this model

Input Modalities

TextImage

Output Modalities

Text

Supported Functionality

Dedicated EndpointsContainer

Explore FriendliAI today

Get started Talk to an engineer