Stanford-CongLab

LabHorizon-Model

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

🔎 Overview

This repository releases the LabHorizon Qwen3.6 LoRA adapter trained from Qwen/Qwen3.6-35B-A3B on the 6,000-sample LabHorizon training split. The model is optimized for Protocol-Aligned Action Prediction:

Level 1: connect multi-view laboratory assets and historical actions to the gold next action.
Level 2: produce a structured long-horizon experimental action sequence from context, constraints, available inputs, and an action pool.

This model repository is the model-side companion to the LabHorizon code and dataset releases. The GitHub repository is the full project entry point; the two dataset cards describe Level 1 and Level 2 data; this card focuses on the trained Qwen3.6 adapter, its files, training signal, evaluation result, and loading instructions.

📰 News

2026-06-03: Released the LabHorizon LoRA adapter weights and reproducibility files on Hugging Face.
2026-06-03: Updated the public LabHorizon leaderboards with Claude Opus 4.8 and MiniMax M3 direct-prompting evaluations.

✨ Highlights

📦 Datasets

The adapter is trained on the same public LabHorizon train split described by the two dataset cards. The evaluation results below use the same v20260510-repaired test split as the GitHub README and the dataset READMEs.

Table with columns: Level, Hugging Face Dataset, Input, Target, Metric
Level	Hugging Face Dataset	Input	Target	Metric
Level 1	LabHorizon-3D-Asset-Perception	Three asset views, historical actions, candidate next actions	Gold next action	Next-action accuracy
Level 2	LabHorizon Protocol-Aligned Planning	Context, goal, constraints, available inputs, action pool	Gold experimental action sequence	L2 Action Sequence Similarity, L2 Parameter Accuracy

📦 Model

🧾 Model Card

Table with columns: Field, Value
Field	Value
Base model	`Qwen/Qwen3.6-35B-A3B`
Adapter type	LoRA / PEFT adapter
Training data	6,000 LabHorizon train samples
Level 1 training split	3,000 multimodal laboratory 3D asset samples
Level 2 training split	3,000 text-only protocol-aligned planning samples
Main task	Protocol-aligned laboratory action prediction
Main metrics	Level 1 Next Action Accuracy; L2 Action Sequence Similarity and L2 Parameter Accuracy
Intended loading mode

The released weights are an adapter, not the base model. Users must load them with the corresponding Qwen3.6-35B-A3B base model.

📁 Files

Table with columns: File, Meaning
File	Meaning
`adapter_model.safetensors`	LoRA adapter weights.
`adapter_config.json`	PEFT adapter configuration.
`tokenizer.json`, `tokenizer_config.json`, `chat_template.jinja`	Tokenizer and chat template files used for training/evaluation.
`processor_config.json`	Processor configuration.
`train_results.json`, ,

📏 Evaluation

LabHorizon uses the same evaluation contracts across direct-prompting models, the trained adapter, and the trained+agents setting.

Table with columns: Level, Output format, Metric
Level	Output format	Metric
Level 1	Reasoning followed by a final next action	Next Action Accuracy
Level 2	Structured action sequence parsed by Python AST	L2 Action Sequence Similarity, L2 Parameter Accuracy, L2 Final Score

For Level 1, the evaluator maps the final next action back to the candidate list. For Level 2, the evaluator parses action names, keyword parameters, assigned intermediate variables, and dependency references with Python AST. This model card reports the same metrics as the GitHub and dataset READMEs.

🏆 Leaderboard

The tables below report direct-prompting baselines on the same test split used for the trained model comparison. The full code and evaluation scripts are maintained in the LabHorizon GitHub repository.

🔬 Level 1: 3D Asset Perception

Table with columns: Rank, Model, Next Action Accuracy
Rank	Model	Next Action Accuracy
🥇	Grok 4.3	0.555
🥈	Kimi K2.6	0.550
🥉	GPT-5.5	0.535
4	GPT-5.4	0.520
5	Claude Opus 4.8	0.515
6	MiniMax M3	0.510

🧪 Level 2: Protocol-Aligned Planning

Table with columns: Rank, Model, L2 Final Score, L2 Action Sequence Similarity, L2 Parameter Accuracy
Rank	Model	L2 Final Score	L2 Action Sequence Similarity	L2 Parameter Accuracy
🥇	Gemini 3.1 Pro	0.3263	0.3195	0.3331
🥈	Grok 4.3	0.3244	0.3339	0.3148
🥉	Kimi K2.6	0.3150	0.2845	0.3456

🧬 Training Data and Setup

The adapter is trained on the public LabHorizon training split:

Table with columns: Component, Size, Role
Component	Size	Role
Level 1 train	3,000	Multi-view laboratory asset perception and next-action prediction
Level 2 train	3,000	Protocol-aligned long-horizon experimental action-sequence planning
Total train	6,000	Unified supervised fine-tuning data for laboratory action prediction

The training data are converted into Qwen chat format and then into the LLaMA-Factory ShareGPT-VL-style format. Level 1 keeps the three asset images and candidate next actions; Level 2 uses text-only context, constraints, available inputs, action pool, and gold experimental action sequence.

Main training settings:

Table with columns: Setting, Value
Setting	Value
LoRA rank / alpha / dropout	`32 / 64 / 0.10`
Learning rate	`1.0e-4`
Scheduler	Cosine
Warmup ratio	`0.10`
Cutoff length	`4096`
Image max pixels	`501760`

🧠 Training Result

The table compares direct-prompting SOTA/baseline systems, the base Qwen model, and the trained+agents system evaluated on the same LabHorizon test splits.

Table with columns: System, Level 1 Next Action Accuracy, L2 Action Sequence Similarity, L2 Parameter Accuracy, L2 Final Score
System	Level 1 Next Action Accuracy	L2 Action Sequence Similarity	L2 Parameter Accuracy	L2 Final Score
Grok 4.3	0.555	0.3339	0.3148	0.3244
Gemini 3.1 Pro	0.465	0.3195	0.3331	0.3263
GPT-5.5	0.535	0.2092	0.2459	0.2276

Agent setting: Qwen3.6-35B-A3B(trained) is used as Actor, and Gemini 3.1 Pro is used as Simulator/Selector. The Simulator/Selector choice is the current setting and has not been exhaustively ablated.

The trained adapter improves both levels over the direct Qwen3.6-35B-A3B baseline. Level 1 improves from 0.475 to 0.635, indicating better laboratory asset-to-action alignment. L2 Final Score improves from 0.2534 to 0.4100, indicating better action ordering, parameter retention, and dependency tracking. The trained+agents setting further improves consistency by selecting candidates with stronger symbolic protocol-state validity.

🤖 Actor-Simulator-Selector Agent

The trained+agents result uses this adapter as the Actor and combines it with a separate Simulator/Selector model. The agent is not a physical simulator and does not execute wet-lab actions. It samples candidate next actions or action sequences, checks symbolic protocol-state consistency, and selects the most consistent candidate.

The trained Actor reads the same task inputs used by the public datasets: multi-view asset images, historical actions, and candidate next actions for Level 1, or wet experiment context, constraints, available inputs, and an action pool for Level 2. The Simulator builds current and target symbolic protocol states and predicts candidate reagent/instrument state transitions. The Selector compares the candidate-state pairs and returns the selected action prediction, which is evaluated with Level 1 next-action accuracy or Level 2 AST-based action-sequence and parameter metrics.

Agent setting: Qwen3.6-35B-A3B(trained) is used as Actor, and Gemini 3.1 Pro is used as Simulator/Selector. This Simulator/Selector choice is the current setting and has not been exhaustively ablated.

🚀 Quick Start

Load Adapter

python
from transformers import AutoModelForCausalLM, AutoProcessor
from peft import PeftModel

base_id = "Qwen/Qwen3.6-35B-A3B"
adapter_id = "Stanford-CongLab/LabHorizon-Model"

processor = AutoProcessor.from_pretrained(adapter_id, trust_remote_code=True)
base = AutoModelForCausalLM.from_pretrained(
    base_id,
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(base, adapter_id)

Evaluate with LabHorizon

Use the public code repository for evaluation and agent workflows:

bash
git clone https://github.com/Stanford-CongLab/LabHorizon
cd LabHorizon

Configure an OpenAI-compatible endpoint in .env, then run the Level 1 / Level 2 evaluators or the Actor-Simulator-Selector agent following the GitHub README.

For evaluation, use the public LabHorizon code repository and point the evaluator to a compatible model endpoint or local serving stack. The model card itself only releases the adapter and training artifacts.

⚠️ Intended Use

This adapter is intended for academic research on laboratory action prediction, experimental planning, and AI scientist systems. It is not an autonomous wet-lab controller. Outputs should be treated as model predictions and should not be used for safety-critical experimental decisions without expert review.

Recommended use cases:

Evaluate protocol-aligned next-action prediction and long-horizon planning.
Study how training data improves laboratory action prediction.
Use the adapter as the Actor in the Actor-Simulator-Selector framework.
Analyze remaining failures in action order, parameter copying, dependency tracking, and protocol-stage consistency.

Not intended for:

Autonomous wet-lab execution.
Clinical, safety-critical, or regulated decision-making.
Generating executable biological protocols without expert validation.

📜 Citation

Coming soon...

Model provider

Stanford-CongLab

Model tree

Base

Qwen/Qwen3.6-35B-A3B

Adapter

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

🔎 Overview

Level 1: connect multi-view laboratory assets and historical actions to the gold next action.
Level 2: produce a structured long-horizon experimental action sequence from context, constraints, available inputs, and an action pool.

📰 News

2026-06-03: Released the LabHorizon LoRA adapter weights and reproducibility files on Hugging Face.
2026-06-03: Updated the public LabHorizon leaderboards with Claude Opus 4.8 and MiniMax M3 direct-prompting evaluations.

✨ Highlights

📦 Datasets

Table with columns: Level, Hugging Face Dataset, Input, Target, Metric
Level	Hugging Face Dataset	Input	Target	Metric
Level 1	LabHorizon-3D-Asset-Perception	Three asset views, historical actions, candidate next actions	Gold next action	Next-action accuracy
Level 2	LabHorizon Protocol-Aligned Planning	Context, goal, constraints, available inputs, action pool	Gold experimental action sequence	L2 Action Sequence Similarity, L2 Parameter Accuracy

📦 Model

🧾 Model Card

Table with columns: Field, Value
Field	Value
Base model	`Qwen/Qwen3.6-35B-A3B`
Adapter type	LoRA / PEFT adapter
Training data	6,000 LabHorizon train samples
Level 1 training split	3,000 multimodal laboratory 3D asset samples
Level 2 training split	3,000 text-only protocol-aligned planning samples
Main task	Protocol-aligned laboratory action prediction
Main metrics	Level 1 Next Action Accuracy; L2 Action Sequence Similarity and L2 Parameter Accuracy
Intended loading mode

The released weights are an adapter, not the base model. Users must load them with the corresponding Qwen3.6-35B-A3B base model.

📁 Files

Table with columns: File, Meaning
File	Meaning
`adapter_model.safetensors`	LoRA adapter weights.
`adapter_config.json`	PEFT adapter configuration.
`tokenizer.json`, `tokenizer_config.json`, `chat_template.jinja`	Tokenizer and chat template files used for training/evaluation.
`processor_config.json`	Processor configuration.
`train_results.json`, ,

📏 Evaluation

LabHorizon uses the same evaluation contracts across direct-prompting models, the trained adapter, and the trained+agents setting.

Table with columns: Level, Output format, Metric
Level	Output format	Metric
Level 1	Reasoning followed by a final next action	Next Action Accuracy
Level 2	Structured action sequence parsed by Python AST	L2 Action Sequence Similarity, L2 Parameter Accuracy, L2 Final Score

🏆 Leaderboard

🔬 Level 1: 3D Asset Perception

Table with columns: Rank, Model, Next Action Accuracy
Rank	Model	Next Action Accuracy
🥇	Grok 4.3	0.555
🥈	Kimi K2.6	0.550
🥉	GPT-5.5	0.535
4	GPT-5.4	0.520
5	Claude Opus 4.8	0.515
6	MiniMax M3	0.510

🧪 Level 2: Protocol-Aligned Planning

Table with columns: Rank, Model, L2 Final Score, L2 Action Sequence Similarity, L2 Parameter Accuracy
Rank	Model	L2 Final Score	L2 Action Sequence Similarity	L2 Parameter Accuracy
🥇	Gemini 3.1 Pro	0.3263	0.3195	0.3331
🥈	Grok 4.3	0.3244	0.3339	0.3148
🥉	Kimi K2.6	0.3150	0.2845	0.3456

🧬 Training Data and Setup

The adapter is trained on the public LabHorizon training split:

Table with columns: Component, Size, Role
Component	Size	Role
Level 1 train	3,000	Multi-view laboratory asset perception and next-action prediction
Level 2 train	3,000	Protocol-aligned long-horizon experimental action-sequence planning
Total train	6,000	Unified supervised fine-tuning data for laboratory action prediction

Main training settings:

Table with columns: Setting, Value
Setting	Value
LoRA rank / alpha / dropout	`32 / 64 / 0.10`
Learning rate	`1.0e-4`
Scheduler	Cosine
Warmup ratio	`0.10`
Cutoff length	`4096`
Image max pixels	`501760`

🧠 Training Result

The table compares direct-prompting SOTA/baseline systems, the base Qwen model, and the trained+agents system evaluated on the same LabHorizon test splits.

Table with columns: System, Level 1 Next Action Accuracy, L2 Action Sequence Similarity, L2 Parameter Accuracy, L2 Final Score
System	Level 1 Next Action Accuracy	L2 Action Sequence Similarity	L2 Parameter Accuracy	L2 Final Score
Grok 4.3	0.555	0.3339	0.3148	0.3244
Gemini 3.1 Pro	0.465	0.3195	0.3331	0.3263
GPT-5.5	0.535	0.2092	0.2459	0.2276

🤖 Actor-Simulator-Selector Agent

🚀 Quick Start

Load Adapter

python
from transformers import AutoModelForCausalLM, AutoProcessor
from peft import PeftModel

base_id = "Qwen/Qwen3.6-35B-A3B"
adapter_id = "Stanford-CongLab/LabHorizon-Model"

processor = AutoProcessor.from_pretrained(adapter_id, trust_remote_code=True)
base = AutoModelForCausalLM.from_pretrained(
    base_id,
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(base, adapter_id)

Evaluate with LabHorizon

Use the public code repository for evaluation and agent workflows:

bash
git clone https://github.com/Stanford-CongLab/LabHorizon
cd LabHorizon

Configure an OpenAI-compatible endpoint in .env, then run the Level 1 / Level 2 evaluators or the Actor-Simulator-Selector agent following the GitHub README.

⚠️ Intended Use

Recommended use cases:

Evaluate protocol-aligned next-action prediction and long-horizon planning.
Study how training data improves laboratory action prediction.
Use the adapter as the Actor in the Actor-Simulator-Selector framework.
Analyze remaining failures in action order, parameter copying, dependency tracking, and protocol-stage consistency.

Not intended for:

Autonomous wet-lab execution.
Clinical, safety-critical, or regulated decision-making.
Generating executable biological protocols without expert validation.

📜 Citation

Coming soon...