vectionlabs

Maestro1-9B

README

License: apache-2.0

Abstract

Maestro1-9B is a dense, 9-billion-parameter vision-language model built for hard problems: multi-step mathematical proof, competitive-programming-grade code synthesis, and visual reasoning over images and video — within a single model and a single context window of up to 1M tokens.

It is designed for users who care less about chat pleasantries and more about whether the model can actually solve the thing: derive the bound, find the bug, read the diagram, finish the proof. Maestro1-9B pairs an explicit step-by-step reasoning mode with native multimodal perception, so the same chain of thought that solves a math olympiad problem can also reason about a chart, a UI screenshot, or a short clip.

Highlights

Reasoning-first. Produces structured, inspectable chains of thought for math, logic, and code.
Genuinely multimodal. Images and video are first-class inputs, not bolted-on captioning.
Long context. Up to 1M tokens via interleaved multimodal RoPE — whole codebases, long papers, or long videos in a single prompt.
Open weights. Apache-2.0, transformers-native, single-file deployment.
9B dense. Runs on a single modern accelerator; no mixture-of-experts routing to manage.

Model overview

Table

Parameters	9B (dense)
Modalities	text, image, video → text
Context window	up to 1,000,000 tokens (interleaved multimodal RoPE)
Precision	bfloat16
Architecture	decoder-only transformer LM + native vision encoder
License	Apache-2.0
Library	🤗 `transformers` (`AutoModelForImageTextToText`)

Intended use

Maestro1-9B targets technical assistance and research:

Step-by-step math and quantitative reasoning.
Code generation, explanation, debugging, and review.
Visual question answering and document/diagram/chart understanding.
Video understanding over short clips.
Long-document and long-context analysis.

It is not intended for high-stakes decisions without human review, nor as a source of truth for medical, legal, or financial advice.

Benchmarks

Reasoning, math & code

Table with columns: Benchmark, Setting, Maestro1-9B
Benchmark	Setting	Maestro1-9B
GSM8K	0-shot CoT, exact match	—
MATH-500	0-shot CoT, exact match	—
AIME 2024	0-shot, pass@1	—
HumanEval	0-shot, pass@1	—
MBPP	3-shot, pass@1	—
MMLU	0-shot

Multimodal

Table with columns: Benchmark, Setting, Maestro1-9B
Benchmark	Setting	Maestro1-9B
MMMU (val)	0-shot	—
MathVista (testmini)	0-shot	—
DocVQA (val)	0-shot, ANLS	—

Quickstart

python
from transformers import AutoModelForImageTextToText, AutoProcessor
import torch

model_id = "vectionlabs/Maestro1-9B"
proc = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, dtype=torch.bfloat16, device_map="auto"
)

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": "https://example.com/diagram.png"},
        {"type": "text", "text": "Explain what this diagram proves, step by step."},
    ],
}]

text = proc.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
from qwen_vl_utils import process_vision_info
imgs, vids = process_vision_info(messages)
inputs = proc(text=[text], images=imgs, videos=vids, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=1024)
print(proc.batch_decode(out[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0])

Text-only works the same way with a plain {"type": "text", ...} message.

Prompting tips

For math/logic, ask the model to reason step by step; it is tuned to externalize its work.
For code, specify language, constraints ("no external libraries"), and the exact I/O contract.
For vision, put the image/video before the question in the message content.
Lower temperature (0.2–0.7) for deterministic reasoning; raise it for brainstorming.

Deployment

GPU: a single 24–80 GB GPU in bf16 (device_map="auto").
Serving: compatible with standard transformers generation; for high throughput use a vision-capable serving stack.

Limitations & responsible use

Maestro1-9B can be confidently wrong. Verify mathematical and factual claims.
Generated code may be insecure or incorrect — review before running, never execute untrusted output.
Long-context and long-video inputs increase latency and memory substantially.
It inherits the biases and failure modes of large web-trained models. Do not use it for surveillance, manipulation, or any use that violates applicable law or the Apache-2.0 terms.
No audio modality.

Citation

bibtex
@misc{vectionlabs2026maestro1,
  title  = {Maestro1-9B: A Multimodal Reasoning Model},
  author = {Vection Labs},
  year   = {2026},
  url    = {https://huggingface.co/vectionlabs/Maestro1-9B}
}

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

vectionlabs

Model Tree

Base

this model

Input Modalities

TextImage

Output Modalities

Text

Supported Functionality

Dedicated EndpointsContainer

Explore FriendliAI today

Get started Talk to an engineer