Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Abstract

Maestro1-9B is a dense, 9-billion-parameter vision-language model built for hard problems: multi-step mathematical proof, competitive-programming-grade code synthesis, and visual reasoning over images and video — within a single model and a single context window of up to 1M tokens.

It is designed for users who care less about chat pleasantries and more about whether the model can actually solve the thing: derive the bound, find the bug, read the diagram, finish the proof. Maestro1-9B pairs an explicit step-by-step reasoning mode with native multimodal perception, so the same chain of thought that solves a math olympiad problem can also reason about a chart, a UI screenshot, or a short clip.

Highlights

  • Reasoning-first. Produces structured, inspectable chains of thought for math, logic, and code.
  • Genuinely multimodal. Images and video are first-class inputs, not bolted-on captioning.
  • Long context. Up to 1M tokens via interleaved multimodal RoPE — whole codebases, long papers, or long videos in a single prompt.
  • Open weights. Apache-2.0, transformers-native, single-file deployment.
  • 9B dense. Runs on a single modern accelerator; no mixture-of-experts routing to manage.

Model overview

Parameters9B (dense)
Modalitiestext, image, video → text
Context windowup to 1,000,000 tokens (interleaved multimodal RoPE)
Precisionbfloat16
Architecturedecoder-only transformer LM + native vision encoder
LicenseApache-2.0
Library🤗 transformers (AutoModelForImageTextToText)

Intended use

Maestro1-9B targets technical assistance and research:

  • Step-by-step math and quantitative reasoning.
  • Code generation, explanation, debugging, and review.
  • Visual question answering and document/diagram/chart understanding.
  • Video understanding over short clips.
  • Long-document and long-context analysis.

It is not intended for high-stakes decisions without human review, nor as a source of truth for medical, legal, or financial advice.

Benchmarks

Reasoning, math & code

BenchmarkSettingMaestro1-9B
GSM8K0-shot CoT, exact match
MATH-5000-shot CoT, exact match
AIME 20240-shot, pass@1
HumanEval0-shot, pass@1
MBPP3-shot, pass@1
MMLU0-shot

Multimodal

BenchmarkSettingMaestro1-9B
MMMU (val)0-shot
MathVista (testmini)0-shot
DocVQA (val)0-shot, ANLS

Quickstart

python

from transformers import AutoModelForImageTextToText, AutoProcessor
import torch
model_id = "vectionlabs/Maestro1-9B"
proc = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
model_id, dtype=torch.bfloat16, device_map="auto"
)
messages = [{
"role": "user",
"content": [
{"type": "image", "image": "https://example.com/diagram.png"},
{"type": "text", "text": "Explain what this diagram proves, step by step."},
],
}]
text = proc.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
from qwen_vl_utils import process_vision_info
imgs, vids = process_vision_info(messages)
inputs = proc(text=[text], images=imgs, videos=vids, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=1024)
print(proc.batch_decode(out[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0])

Text-only works the same way with a plain {"type": "text", ...} message.

Prompting tips

  • For math/logic, ask the model to reason step by step; it is tuned to externalize its work.
  • For code, specify language, constraints ("no external libraries"), and the exact I/O contract.
  • For vision, put the image/video before the question in the message content.
  • Lower temperature (0.2–0.7) for deterministic reasoning; raise it for brainstorming.

Deployment

  • GPU: a single 24–80 GB GPU in bf16 (device_map="auto").
  • Serving: compatible with standard transformers generation; for high throughput use a vision-capable serving stack.

Limitations & responsible use

  • Maestro1-9B can be confidently wrong. Verify mathematical and factual claims.
  • Generated code may be insecure or incorrect — review before running, never execute untrusted output.
  • Long-context and long-video inputs increase latency and memory substantially.
  • It inherits the biases and failure modes of large web-trained models. Do not use it for surveillance, manipulation, or any use that violates applicable law or the Apache-2.0 terms.
  • No audio modality.

Citation

bibtex

@misc{vectionlabs2026maestro1,
title = {Maestro1-9B: A Multimodal Reasoning Model},
author = {Vection Labs},
year = {2026},
url = {https://huggingface.co/vectionlabs/Maestro1-9B}
}

Model provider

vectionlabs

Model tree

Base

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today