Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Abstract
Maestro1-9B is a dense, 9-billion-parameter vision-language model built for hard problems: multi-step mathematical proof, competitive-programming-grade code synthesis, and visual reasoning over images and video — within a single model and a single context window of up to 1M tokens.
It is designed for users who care less about chat pleasantries and more about whether the model can actually solve the thing: derive the bound, find the bug, read the diagram, finish the proof. Maestro1-9B pairs an explicit step-by-step reasoning mode with native multimodal perception, so the same chain of thought that solves a math olympiad problem can also reason about a chart, a UI screenshot, or a short clip.
Highlights
- Reasoning-first. Produces structured, inspectable chains of thought for math, logic, and code.
- Genuinely multimodal. Images and video are first-class inputs, not bolted-on captioning.
- Long context. Up to 1M tokens via interleaved multimodal RoPE — whole codebases, long papers, or long videos in a single prompt.
- Open weights. Apache-2.0,
transformers-native, single-file deployment. - 9B dense. Runs on a single modern accelerator; no mixture-of-experts routing to manage.
Model overview
| Parameters | 9B (dense) |
| Modalities | text, image, video → text |
| Context window | up to 1,000,000 tokens (interleaved multimodal RoPE) |
| Precision | bfloat16 |
| Architecture | decoder-only transformer LM + native vision encoder |
| License | Apache-2.0 |
| Library | 🤗 transformers (AutoModelForImageTextToText) |
Intended use
Maestro1-9B targets technical assistance and research:
- Step-by-step math and quantitative reasoning.
- Code generation, explanation, debugging, and review.
- Visual question answering and document/diagram/chart understanding.
- Video understanding over short clips.
- Long-document and long-context analysis.
It is not intended for high-stakes decisions without human review, nor as a source of truth for medical, legal, or financial advice.
Benchmarks
Reasoning, math & code
| Benchmark | Setting | Maestro1-9B |
|---|---|---|
| GSM8K | 0-shot CoT, exact match | — |
| MATH-500 | 0-shot CoT, exact match | — |
| AIME 2024 | 0-shot, pass@1 | — |
| HumanEval | 0-shot, pass@1 | — |
| MBPP | 3-shot, pass@1 | — |
| MMLU | 0-shot | — |
Multimodal
| Benchmark | Setting | Maestro1-9B |
|---|---|---|
| MMMU (val) | 0-shot | — |
| MathVista (testmini) | 0-shot | — |
| DocVQA (val) | 0-shot, ANLS | — |
Quickstart
python
from transformers import AutoModelForImageTextToText, AutoProcessorimport torchmodel_id = "vectionlabs/Maestro1-9B"proc = AutoProcessor.from_pretrained(model_id)model = AutoModelForImageTextToText.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")messages = [{"role": "user","content": [{"type": "image", "image": "https://example.com/diagram.png"},{"type": "text", "text": "Explain what this diagram proves, step by step."},],}]text = proc.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)from qwen_vl_utils import process_vision_infoimgs, vids = process_vision_info(messages)inputs = proc(text=[text], images=imgs, videos=vids, return_tensors="pt").to(model.device)out = model.generate(**inputs, max_new_tokens=1024)print(proc.batch_decode(out[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0])
Text-only works the same way with a plain {"type": "text", ...} message.
Prompting tips
- For math/logic, ask the model to reason step by step; it is tuned to externalize its work.
- For code, specify language, constraints ("no external libraries"), and the exact I/O contract.
- For vision, put the image/video before the question in the message content.
- Lower temperature (0.2–0.7) for deterministic reasoning; raise it for brainstorming.
Deployment
- GPU: a single 24–80 GB GPU in bf16 (
device_map="auto"). - Serving: compatible with standard
transformersgeneration; for high throughput use a vision-capable serving stack.
Limitations & responsible use
- Maestro1-9B can be confidently wrong. Verify mathematical and factual claims.
- Generated code may be insecure or incorrect — review before running, never execute untrusted output.
- Long-context and long-video inputs increase latency and memory substantially.
- It inherits the biases and failure modes of large web-trained models. Do not use it for surveillance, manipulation, or any use that violates applicable law or the Apache-2.0 terms.
- No audio modality.
Citation
bibtex
@misc{vectionlabs2026maestro1,title = {Maestro1-9B: A Multimodal Reasoning Model},author = {Vection Labs},year = {2026},url = {https://huggingface.co/vectionlabs/Maestro1-9B}}
Model provider
vectionlabs
Model tree
Base
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information