morikomorizz

HiDream-O1-Image-BF16

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Overview

HiDream-O1-Image is a natively unified image generative foundation model built on a Pixel-level Unified Transformer (UiT) without external VAEs or disjoint text encoders. It natively encodes raw pixels, text, and task-specific conditions in a single shared token space — supporting text-to-image, image editing, and subject-driven personalization at up to 2,048 × 2,048.

This repository contains the BF16-converted version for optimized storage and deployment.

Key Benefits of BF16 Conversion

📦 50% Smaller Storage: Reduced from ~32 GB (FP32) to ~16 GB (BF16)
⚡ Faster Inference: ~1.5-2x speedup on modern GPUs with BF16 support
💾 Lower VRAM Usage: Requires ~16 GB VRAM instead of ~32 GB
✅ Same Quality: BF16 maintains full precision for image generation with negligible quality loss (<0.1%)
🔧 Ready to Use: Compatible with original inference scripts and pipelines

Conversion Details

Table with columns: Property, Original (FP32), Converted (BF16)
Property	Original (FP32)	Converted (BF16)
Storage Size	~32 GB	~16 GB
Weight Precision	Float32	BFloat16
Inference Precision	BF16 (via `torch_dtype=torch.bfloat16`)	BF16 (native)
VRAM Requirement	~32 GB	~16 GB
Quality Loss	N/A	<0.1% (negligible)

Conversion Method

All safetensors files were converted using direct tensor manipulation:

python
tensor.to(torch.bfloat16)  # FP32 → BF16

Configuration files (config.json, tokenizer_config.json, etc.) were updated to reflect dtype: "bfloat16".

Original Model Information

Project Updates

🚀 May 14, 2026: HiDream-O1-Image-Dev-2604 with prompt refiner
🛠️ May 13, 2026: Inference & pipeline updates — accelerated IP inference; IP pipeline now supports layout and skeleton conditioning
🤗 May 10, 2026: Try online on Hugging Face Spaces — 🤗 HiDream-O1-Image
📕 May 10, 2026: Technical report — 📑 HiDream-O1-Image.pdf
🚀 May 8, 2026: Open-sourced HiDream-O1-Image (8B) with undistilled and distilled Dev variants

Key Features (from Original Model)

🧬 Pixel-Level Unified Transformer — One end-to-end model on raw pixels, no VAE, no disjoint text encoder
🎨 One Model, Many Tasks — Text-to-image, long-text rendering, instruction editing, subject-driven personalization, storyboard generation
🧠 Reasoning-Driven Prompt Agent — Built-in "thinking" agent for layout, attributes, physical logic, text-rendering
🖼️ Native High Resolution — Direct synthesis up to 2,048 × 2,048
⚡ Exceptional Efficiency at 8B Scale — 8B parameters, performance parity with larger models

Usage

Installation

Clone the original repository:

bash
git clone https://github.com/HiDream-ai/HiDream-O1-Image.git
cd HiDream-O1-Image

Install dependencies:

bash
pip install -r requirements.txt

Download this BF16 model or use it directly from HuggingFace:

python
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="morikomorizz/HiDream-O1-Image-BF16",
    local_dir="./HiDream-O1-Image-BF16"
)

1. Text-to-Image Generation

bash
python inference.py \
    --model_path /path/to/HiDream-O1-Image-BF16 \
    --prompt "your prompt here" \
    --output_image results/output.png \
    --height 2048 \
    --width 2048

2. Image Editing

bash
python inference.py \
    --model_path /path/to/HiDream-O1-Image-BF16 \
    --prompt "remove the earphones" \
    --ref_images assets/edit/test.jpg \
    --output_image results/edit.png \
    --keep_original_aspect

3. Subject-Driven Personalization

bash
python inference.py \
    --model_path /path/to/HiDream-O1-Image-BF16 \
    --shift 1 \
    --prompt "A young boy with blonde hair..." \
    --ref_images assets/IP/1.jpg assets/IP/2.jpg assets/IP/3.jpg \
    --output_image results/subject.png

4. Multi-Reference Subject-Driven Personalization with Skeleton

bash
python inference.py \
    --model_path /path/to/HiDream-O1-Image \
    --shift 1 \
    --seed 42 \
    --prompt "Create a realistic try-on image of the person wearing the provided clothing." \
    --ref_images assets/IP_skeleton/0.face.jpg assets/IP_skeleton/0.bg.jpg assets/IP_skeleton/0.openpose.jpg assets/IP_skeleton/0.part_1.jpg assets/IP_skeleton/0.part_2.jpg assets/IP_skeleton/0.part_3.jpg  \
    --output_image results/subject.png

5. Multi-Reference Subject-Driven Personalization with Layout

bash
python inference.py \
    --model_path /path/to/HiDream-O1-Image \
    --shift 1 \
    --seed 42 \
    --prompt "City council members pose with relaxed smiles on a sunlit terrace, warm approachable mood, golden hour, cinematic soft glow." \
    --ref_images assets/IP_layout/0.jpg assets/IP_layout/1.jpg \
    --layout_bboxes "[[0.20507812, 0.43945312, 0.48828125, 0.7421875 ], [0.57617188, 0.80078125, 0.08789062, 0.34179688]]" \
    --output_image results/ip_layout.png

Command Line Arguments

--model_path: Path to this BF16 model directory
--prompt: Text prompt for generation or editing
--ref_images: Paths to reference images (optional, space-separated)
--output_image: Path to save generated image (default: output.png)
--height / --width: Output dimensions (default: 2048 × 2048)
--model_type: full or dev (default: )

See original README for complete documentation.

Model Architecture

Table with columns: Component, Configuration
Component	Configuration
Base Architecture	Qwen3VLForConditionalGeneration
Vision Encoder	Qwen3VLVisionModel (27 layers, hidden_size=1152)
Language Model	Qwen3VLTextModel (36 layers, hidden_size=4096, 8B parameters)
Vocabulary Size	151,936
Attention	Multi-Head Attention with RoPE
Total Parameters	~8B

Evaluation

See original model page for detailed benchmarks:

GenEval: 0.90 Overall (2nd best)
DPG-Bench: 89.83 Overall (2nd best)
HPSv3: 10.37 All (2nd best)
CVTG-2K: 0.9128 Average (2nd best)
LongText-Bench: 0.979 EN, 0.978 ZH (2nd best)

License

This converted model inherits the MIT License from the original HiDream-O1-Image model.

Citation

If you use this model, please cite the original work:

bibtex
@article{hidreamolimage,
  title={HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer},
  author={Cai, Qi and Chen, Jingwen and Gao, Chengmin and Gong, Zijian and Li, Yehao and Mei, Tao and Pan, Yingwei and Peng, Yi and Qiu, Zhaofan and Yao, Ting and Yu, Kai and Zhang, Yiheng and others},
  journal={arXiv preprint arXiv:2605.11061},
  year={2026}
}

Acknowledgments

Original model by HiDream.ai
BF16 conversion by morikomorizz
Based on HiDream-O1-Image

Note: This is an unofficial conversion. For the official model, visit HiDream-ai/HiDream-O1-Image.

Model provider

morikomorizz

Model tree

Base

HiDream-ai/HiDream-O1-Image

Fine-tuned

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Overview

This repository contains the BF16-converted version for optimized storage and deployment.

Key Benefits of BF16 Conversion

📦 50% Smaller Storage: Reduced from ~32 GB (FP32) to ~16 GB (BF16)
⚡ Faster Inference: ~1.5-2x speedup on modern GPUs with BF16 support
💾 Lower VRAM Usage: Requires ~16 GB VRAM instead of ~32 GB
✅ Same Quality: BF16 maintains full precision for image generation with negligible quality loss (<0.1%)
🔧 Ready to Use: Compatible with original inference scripts and pipelines

Conversion Details

Table with columns: Property, Original (FP32), Converted (BF16)
Property	Original (FP32)	Converted (BF16)
Storage Size	~32 GB	~16 GB
Weight Precision	Float32	BFloat16
Inference Precision	BF16 (via `torch_dtype=torch.bfloat16`)	BF16 (native)
VRAM Requirement	~32 GB	~16 GB
Quality Loss	N/A	<0.1% (negligible)

Conversion Method

All safetensors files were converted using direct tensor manipulation:

python
tensor.to(torch.bfloat16)  # FP32 → BF16

Configuration files (config.json, tokenizer_config.json, etc.) were updated to reflect dtype: "bfloat16".

Original Model Information

Project Updates

🚀 May 14, 2026: HiDream-O1-Image-Dev-2604 with prompt refiner
🛠️ May 13, 2026: Inference & pipeline updates — accelerated IP inference; IP pipeline now supports layout and skeleton conditioning
🤗 May 10, 2026: Try online on Hugging Face Spaces — 🤗 HiDream-O1-Image
📕 May 10, 2026: Technical report — 📑 HiDream-O1-Image.pdf
🚀 May 8, 2026: Open-sourced HiDream-O1-Image (8B) with undistilled and distilled Dev variants

Key Features (from Original Model)

🧬 Pixel-Level Unified Transformer — One end-to-end model on raw pixels, no VAE, no disjoint text encoder
🎨 One Model, Many Tasks — Text-to-image, long-text rendering, instruction editing, subject-driven personalization, storyboard generation
🧠 Reasoning-Driven Prompt Agent — Built-in "thinking" agent for layout, attributes, physical logic, text-rendering
🖼️ Native High Resolution — Direct synthesis up to 2,048 × 2,048
⚡ Exceptional Efficiency at 8B Scale — 8B parameters, performance parity with larger models

Usage

Installation

Clone the original repository:

bash
git clone https://github.com/HiDream-ai/HiDream-O1-Image.git
cd HiDream-O1-Image

Install dependencies:

bash
pip install -r requirements.txt

Download this BF16 model or use it directly from HuggingFace:

python
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="morikomorizz/HiDream-O1-Image-BF16",
    local_dir="./HiDream-O1-Image-BF16"
)

1. Text-to-Image Generation

bash
python inference.py \
    --model_path /path/to/HiDream-O1-Image-BF16 \
    --prompt "your prompt here" \
    --output_image results/output.png \
    --height 2048 \
    --width 2048

2. Image Editing

bash
python inference.py \
    --model_path /path/to/HiDream-O1-Image-BF16 \
    --prompt "remove the earphones" \
    --ref_images assets/edit/test.jpg \
    --output_image results/edit.png \
    --keep_original_aspect

3. Subject-Driven Personalization

bash
python inference.py \
    --model_path /path/to/HiDream-O1-Image-BF16 \
    --shift 1 \
    --prompt "A young boy with blonde hair..." \
    --ref_images assets/IP/1.jpg assets/IP/2.jpg assets/IP/3.jpg \
    --output_image results/subject.png

4. Multi-Reference Subject-Driven Personalization with Skeleton

bash
python inference.py \
    --model_path /path/to/HiDream-O1-Image \
    --shift 1 \
    --seed 42 \
    --prompt "Create a realistic try-on image of the person wearing the provided clothing." \
    --ref_images assets/IP_skeleton/0.face.jpg assets/IP_skeleton/0.bg.jpg assets/IP_skeleton/0.openpose.jpg assets/IP_skeleton/0.part_1.jpg assets/IP_skeleton/0.part_2.jpg assets/IP_skeleton/0.part_3.jpg  \
    --output_image results/subject.png

5. Multi-Reference Subject-Driven Personalization with Layout

bash
python inference.py \
    --model_path /path/to/HiDream-O1-Image \
    --shift 1 \
    --seed 42 \
    --prompt "City council members pose with relaxed smiles on a sunlit terrace, warm approachable mood, golden hour, cinematic soft glow." \
    --ref_images assets/IP_layout/0.jpg assets/IP_layout/1.jpg \
    --layout_bboxes "[[0.20507812, 0.43945312, 0.48828125, 0.7421875 ], [0.57617188, 0.80078125, 0.08789062, 0.34179688]]" \
    --output_image results/ip_layout.png

Command Line Arguments

--model_path: Path to this BF16 model directory
--prompt: Text prompt for generation or editing
--ref_images: Paths to reference images (optional, space-separated)
--output_image: Path to save generated image (default: output.png)
--height / --width: Output dimensions (default: 2048 × 2048)
--model_type: full or dev (default: )

See original README for complete documentation.

Model Architecture

Table with columns: Component, Configuration
Component	Configuration
Base Architecture	Qwen3VLForConditionalGeneration
Vision Encoder	Qwen3VLVisionModel (27 layers, hidden_size=1152)
Language Model	Qwen3VLTextModel (36 layers, hidden_size=4096, 8B parameters)
Vocabulary Size	151,936
Attention	Multi-Head Attention with RoPE
Total Parameters	~8B

Evaluation

See original model page for detailed benchmarks:

GenEval: 0.90 Overall (2nd best)
DPG-Bench: 89.83 Overall (2nd best)
HPSv3: 10.37 All (2nd best)
CVTG-2K: 0.9128 Average (2nd best)
LongText-Bench: 0.979 EN, 0.978 ZH (2nd best)

License

This converted model inherits the MIT License from the original HiDream-O1-Image model.

Citation

If you use this model, please cite the original work:

bibtex
@article{hidreamolimage,
  title={HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer},
  author={Cai, Qi and Chen, Jingwen and Gao, Chengmin and Gong, Zijian and Li, Yehao and Mei, Tao and Pan, Yingwei and Peng, Yi and Qiu, Zhaofan and Yao, Ting and Yu, Kai and Zhang, Yiheng and others},
  journal={arXiv preprint arXiv:2605.11061},
  year={2026}
}

Acknowledgments

Original model by HiDream.ai
BF16 conversion by morikomorizz
Based on HiDream-O1-Image

Note: This is an unofficial conversion. For the official model, visit HiDream-ai/HiDream-O1-Image.