Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: mitOverview
HiDream-O1-Image is a natively unified image generative foundation model built on a Pixel-level Unified Transformer (UiT) without external VAEs or disjoint text encoders. It natively encodes raw pixels, text, and task-specific conditions in a single shared token space — supporting text-to-image, image editing, and subject-driven personalization at up to 2,048 × 2,048.
This repository contains the BF16-converted version for optimized storage and deployment.
Key Benefits of BF16 Conversion
- 📦 50% Smaller Storage: Reduced from ~32 GB (FP32) to ~16 GB (BF16)
- ⚡ Faster Inference: ~1.5-2x speedup on modern GPUs with BF16 support
- 💾 Lower VRAM Usage: Requires ~16 GB VRAM instead of ~32 GB
- ✅ Same Quality: BF16 maintains full precision for image generation with negligible quality loss (<0.1%)
- 🔧 Ready to Use: Compatible with original inference scripts and pipelines
Conversion Details
| Property | Original (FP32) | Converted (BF16) |
|---|---|---|
| Storage Size | ~32 GB | ~16 GB |
| Weight Precision | Float32 | BFloat16 |
| Inference Precision | BF16 (via torch_dtype=torch.bfloat16) | BF16 (native) |
| VRAM Requirement | ~32 GB | ~16 GB |
| Quality Loss | N/A | <0.1% (negligible) |
Conversion Method
All safetensors files were converted using direct tensor manipulation:
python
tensor.to(torch.bfloat16) # FP32 → BF16
Configuration files (config.json, tokenizer_config.json, etc.) were updated to reflect dtype: "bfloat16".
Original Model Information
Project Updates
- 🚀 May 14, 2026: HiDream-O1-Image-Dev-2604 with prompt refiner
- 🛠️ May 13, 2026: Inference & pipeline updates — accelerated IP inference; IP pipeline now supports layout and skeleton conditioning
- 🤗 May 10, 2026: Try online on Hugging Face Spaces — 🤗 HiDream-O1-Image
- 📕 May 10, 2026: Technical report — 📑 HiDream-O1-Image.pdf
- 🚀 May 8, 2026: Open-sourced HiDream-O1-Image (8B) with undistilled and distilled Dev variants
Key Features (from Original Model)
- 🧬 Pixel-Level Unified Transformer — One end-to-end model on raw pixels, no VAE, no disjoint text encoder
- 🎨 One Model, Many Tasks — Text-to-image, long-text rendering, instruction editing, subject-driven personalization, storyboard generation
- 🧠 Reasoning-Driven Prompt Agent — Built-in "thinking" agent for layout, attributes, physical logic, text-rendering
- 🖼️ Native High Resolution — Direct synthesis up to 2,048 × 2,048
- ⚡ Exceptional Efficiency at 8B Scale — 8B parameters, performance parity with larger models
Usage
Installation
- Clone the original repository:
bash
git clone https://github.com/HiDream-ai/HiDream-O1-Image.gitcd HiDream-O1-Image
- Install dependencies:
bash
pip install -r requirements.txt
- Download this BF16 model or use it directly from HuggingFace:
python
from huggingface_hub import snapshot_downloadsnapshot_download(repo_id="morikomorizz/HiDream-O1-Image-BF16",local_dir="./HiDream-O1-Image-BF16")
1. Text-to-Image Generation
bash
python inference.py \--model_path /path/to/HiDream-O1-Image-BF16 \--prompt "your prompt here" \--output_image results/output.png \--height 2048 \--width 2048
2. Image Editing
bash
python inference.py \--model_path /path/to/HiDream-O1-Image-BF16 \--prompt "remove the earphones" \--ref_images assets/edit/test.jpg \--output_image results/edit.png \--keep_original_aspect
3. Subject-Driven Personalization
bash
python inference.py \--model_path /path/to/HiDream-O1-Image-BF16 \--shift 1 \--prompt "A young boy with blonde hair..." \--ref_images assets/IP/1.jpg assets/IP/2.jpg assets/IP/3.jpg \--output_image results/subject.png
4. Multi-Reference Subject-Driven Personalization with Skeleton
bash
python inference.py \--model_path /path/to/HiDream-O1-Image \--shift 1 \--seed 42 \--prompt "Create a realistic try-on image of the person wearing the provided clothing." \--ref_images assets/IP_skeleton/0.face.jpg assets/IP_skeleton/0.bg.jpg assets/IP_skeleton/0.openpose.jpg assets/IP_skeleton/0.part_1.jpg assets/IP_skeleton/0.part_2.jpg assets/IP_skeleton/0.part_3.jpg \--output_image results/subject.png
5. Multi-Reference Subject-Driven Personalization with Layout
bash
python inference.py \--model_path /path/to/HiDream-O1-Image \--shift 1 \--seed 42 \--prompt "City council members pose with relaxed smiles on a sunlit terrace, warm approachable mood, golden hour, cinematic soft glow." \--ref_images assets/IP_layout/0.jpg assets/IP_layout/1.jpg \--layout_bboxes "[[0.20507812, 0.43945312, 0.48828125, 0.7421875 ], [0.57617188, 0.80078125, 0.08789062, 0.34179688]]" \--output_image results/ip_layout.png
Command Line Arguments
--model_path: Path to this BF16 model directory--prompt: Text prompt for generation or editing--ref_images: Paths to reference images (optional, space-separated)--output_image: Path to save generated image (default:output.png)--height/--width: Output dimensions (default:2048×2048)--model_type:fullordev(default:full)--seed: Random seed (default:32)--guidance_scale: Guidance scale (default:5.0, only forfullmodel)
See original README for complete documentation.
Model Architecture
| Component | Configuration |
|---|---|
| Base Architecture | Qwen3VLForConditionalGeneration |
| Vision Encoder | Qwen3VLVisionModel (27 layers, hidden_size=1152) |
| Language Model | Qwen3VLTextModel (36 layers, hidden_size=4096, 8B parameters) |
| Vocabulary Size | 151,936 |
| Attention | Multi-Head Attention with RoPE |
| Total Parameters | ~8B |
Evaluation
See original model page for detailed benchmarks:
- GenEval: 0.90 Overall (2nd best)
- DPG-Bench: 89.83 Overall (2nd best)
- HPSv3: 10.37 All (2nd best)
- CVTG-2K: 0.9128 Average (2nd best)
- LongText-Bench: 0.979 EN, 0.978 ZH (2nd best)
License
This converted model inherits the MIT License from the original HiDream-O1-Image model.
Citation
If you use this model, please cite the original work:
bibtex
@article{hidreamolimage,title={HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer},author={Cai, Qi and Chen, Jingwen and Gao, Chengmin and Gong, Zijian and Li, Yehao and Mei, Tao and Pan, Yingwei and Peng, Yi and Qiu, Zhaofan and Yao, Ting and Yu, Kai and Zhang, Yiheng and others},journal={arXiv preprint arXiv:2605.11061},year={2026}}
Acknowledgments
- Original model by HiDream.ai
- BF16 conversion by morikomorizz
- Based on HiDream-O1-Image
Note: This is an unofficial conversion. For the official model, visit HiDream-ai/HiDream-O1-Image.
Model provider
morikomorizz
Model tree
Base
HiDream-ai/HiDream-O1-Image
Fine-tuned
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information