Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: mit

Overview

HiDream-O1-Image is a natively unified image generative foundation model built on a Pixel-level Unified Transformer (UiT) without external VAEs or disjoint text encoders. It natively encodes raw pixels, text, and task-specific conditions in a single shared token space — supporting text-to-image, image editing, and subject-driven personalization at up to 2,048 × 2,048.

This repository contains the BF16-converted version for optimized storage and deployment.

Key Benefits of BF16 Conversion

  • 📦 50% Smaller Storage: Reduced from ~32 GB (FP32) to ~16 GB (BF16)
  • Faster Inference: ~1.5-2x speedup on modern GPUs with BF16 support
  • 💾 Lower VRAM Usage: Requires ~16 GB VRAM instead of ~32 GB
  • Same Quality: BF16 maintains full precision for image generation with negligible quality loss (<0.1%)
  • 🔧 Ready to Use: Compatible with original inference scripts and pipelines

Conversion Details

PropertyOriginal (FP32)Converted (BF16)
Storage Size~32 GB~16 GB
Weight PrecisionFloat32BFloat16
Inference PrecisionBF16 (via torch_dtype=torch.bfloat16)BF16 (native)
VRAM Requirement~32 GB~16 GB
Quality LossN/A<0.1% (negligible)

Conversion Method

All safetensors files were converted using direct tensor manipulation:

python

tensor.to(torch.bfloat16) # FP32 → BF16

Configuration files (config.json, tokenizer_config.json, etc.) were updated to reflect dtype: "bfloat16".

Original Model Information

Project Updates

  • 🚀 May 14, 2026: HiDream-O1-Image-Dev-2604 with prompt refiner
  • 🛠️ May 13, 2026: Inference & pipeline updates — accelerated IP inference; IP pipeline now supports layout and skeleton conditioning
  • 🤗 May 10, 2026: Try online on Hugging Face Spaces — 🤗 HiDream-O1-Image
  • 📕 May 10, 2026: Technical report — 📑 HiDream-O1-Image.pdf
  • 🚀 May 8, 2026: Open-sourced HiDream-O1-Image (8B) with undistilled and distilled Dev variants

Key Features (from Original Model)

  • 🧬 Pixel-Level Unified Transformer — One end-to-end model on raw pixels, no VAE, no disjoint text encoder
  • 🎨 One Model, Many Tasks — Text-to-image, long-text rendering, instruction editing, subject-driven personalization, storyboard generation
  • 🧠 Reasoning-Driven Prompt Agent — Built-in "thinking" agent for layout, attributes, physical logic, text-rendering
  • 🖼️ Native High Resolution — Direct synthesis up to 2,048 × 2,048
  • Exceptional Efficiency at 8B Scale — 8B parameters, performance parity with larger models

Usage

Installation

  1. Clone the original repository:

bash

git clone https://github.com/HiDream-ai/HiDream-O1-Image.git
cd HiDream-O1-Image
  1. Install dependencies:

bash

pip install -r requirements.txt
  1. Download this BF16 model or use it directly from HuggingFace:

python

from huggingface_hub import snapshot_download
snapshot_download(
repo_id="morikomorizz/HiDream-O1-Image-BF16",
local_dir="./HiDream-O1-Image-BF16"
)

1. Text-to-Image Generation

bash

python inference.py \
--model_path /path/to/HiDream-O1-Image-BF16 \
--prompt "your prompt here" \
--output_image results/output.png \
--height 2048 \
--width 2048

2. Image Editing

bash

python inference.py \
--model_path /path/to/HiDream-O1-Image-BF16 \
--prompt "remove the earphones" \
--ref_images assets/edit/test.jpg \
--output_image results/edit.png \
--keep_original_aspect

3. Subject-Driven Personalization

bash

python inference.py \
--model_path /path/to/HiDream-O1-Image-BF16 \
--shift 1 \
--prompt "A young boy with blonde hair..." \
--ref_images assets/IP/1.jpg assets/IP/2.jpg assets/IP/3.jpg \
--output_image results/subject.png

4. Multi-Reference Subject-Driven Personalization with Skeleton

bash

python inference.py \
--model_path /path/to/HiDream-O1-Image \
--shift 1 \
--seed 42 \
--prompt "Create a realistic try-on image of the person wearing the provided clothing." \
--ref_images assets/IP_skeleton/0.face.jpg assets/IP_skeleton/0.bg.jpg assets/IP_skeleton/0.openpose.jpg assets/IP_skeleton/0.part_1.jpg assets/IP_skeleton/0.part_2.jpg assets/IP_skeleton/0.part_3.jpg \
--output_image results/subject.png

5. Multi-Reference Subject-Driven Personalization with Layout

bash

python inference.py \
--model_path /path/to/HiDream-O1-Image \
--shift 1 \
--seed 42 \
--prompt "City council members pose with relaxed smiles on a sunlit terrace, warm approachable mood, golden hour, cinematic soft glow." \
--ref_images assets/IP_layout/0.jpg assets/IP_layout/1.jpg \
--layout_bboxes "[[0.20507812, 0.43945312, 0.48828125, 0.7421875 ], [0.57617188, 0.80078125, 0.08789062, 0.34179688]]" \
--output_image results/ip_layout.png

Command Line Arguments

  • --model_path: Path to this BF16 model directory
  • --prompt: Text prompt for generation or editing
  • --ref_images: Paths to reference images (optional, space-separated)
  • --output_image: Path to save generated image (default: output.png)
  • --height / --width: Output dimensions (default: 2048 × 2048)
  • --model_type: full or dev (default: full)
  • --seed: Random seed (default: 32)
  • --guidance_scale: Guidance scale (default: 5.0, only for full model)

See original README for complete documentation.

Model Architecture

ComponentConfiguration
Base ArchitectureQwen3VLForConditionalGeneration
Vision EncoderQwen3VLVisionModel (27 layers, hidden_size=1152)
Language ModelQwen3VLTextModel (36 layers, hidden_size=4096, 8B parameters)
Vocabulary Size151,936
AttentionMulti-Head Attention with RoPE
Total Parameters~8B

Evaluation

See original model page for detailed benchmarks:

  • GenEval: 0.90 Overall (2nd best)
  • DPG-Bench: 89.83 Overall (2nd best)
  • HPSv3: 10.37 All (2nd best)
  • CVTG-2K: 0.9128 Average (2nd best)
  • LongText-Bench: 0.979 EN, 0.978 ZH (2nd best)

License

This converted model inherits the MIT License from the original HiDream-O1-Image model.

Citation

If you use this model, please cite the original work:

bibtex

@article{hidreamolimage,
title={HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer},
author={Cai, Qi and Chen, Jingwen and Gao, Chengmin and Gong, Zijian and Li, Yehao and Mei, Tao and Pan, Yingwei and Peng, Yi and Qiu, Zhaofan and Yao, Ting and Yu, Kai and Zhang, Yiheng and others},
journal={arXiv preprint arXiv:2605.11061},
year={2026}
}

Acknowledgments


Note: This is an unofficial conversion. For the official model, visit HiDream-ai/HiDream-O1-Image.

Model provider

morikomorizz

Model tree

Base

HiDream-ai/HiDream-O1-Image

Fine-tuned

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today