README

License: mit

Highlights

  • Efficient Foundation — Trained on Lens-800M, an 800M image-text corpus with long GPT-4.1 captions, maximizing information density per training batch.
  • Compact & Expressive — A 48-block MMDiT denoiser leverages FLUX.2 latents and concatenated multi-layer GPT-OSS features for stronger prompt following and multilingual generalization.
  • Flexible Resolution — Mixed-resolution training enables inference across aspect ratios from 1:2 to 2:1 and resolutions up to 1440×1440.
  • Post-trained Variants — RL tuning improves visual quality and artifact suppression; the distilled Lens-Turbo supports fast 4-step generation.

Gallery

Installation

Tested environment: Python 3.12 · CUDA 12.6 · PyTorch 2.11.0+cu126 · TorchVision 0.26.0+cu126

bash

conda create -n lens python=3.12 -y
conda activate lens
uv pip install torch==2.11.0+cu126 torchvision==0.26.0+cu126 \
--index-url https://download.pytorch.org/whl/cu126
uv pip install -r requirements.txt

The default GPT-OSS encoder and FLUX.2 VAE are loaded from Hugging Face. Make sure your environment has access to any gated model repositories you use.

Checkpoints

RepoDescriptionStepsCFG
microsoft/LensDefault. RL-tuned for visual quality205.0
microsoft/Lens-TurboDistilled from the RL model for fast 4-step sampling41.0
microsoft/Lens-BaseSupervised base model (no RL, no distillation)505.0

Pick a variant by passing its repo id to --repo_id (CLI) or LensPipeline.from_pretrained(...) (Python).

Inference

Important: run from the cloned repo root so from lens import LensPipeline resolves to this package — importing lens is what registers LensGptOssEncoder / LensTransformer2DModel with the transformers and diffusers namespaces that model_index.json references.

Python API:

python

import torch
from lens import LensPipeline
pipe = LensPipeline.from_pretrained(
"microsoft/Lens", torch_dtype=torch.bfloat16
).to("cuda")
image = pipe(
prompt="A cat holding a sign that says \"hello world\"",
base_resolution=1440, aspect_ratio="1:1",
num_inference_steps=20, guidance_scale=5.0,
generator=torch.Generator("cuda").manual_seed(0),
).images[0]
image.save("lens.png")

To trade speed for VRAM, replace .to("cuda") with pipe.enable_model_cpu_offload().

CLI — basic usage:

bash

python inference.py \
--repo_id "microsoft/Lens" \
--prompt "A cinematic mountain lake at sunrise, soft mist, detailed reflections" \
--base_resolution 1440 --aspect_ratio 1:1 \
--steps 20 --cfg 5.0 --n 1 --seed 42 \
--out ./outputs

Batch generation — join multiple prompts with |:

bash

python inference.py \
--repo_id "microsoft/Lens" \
--steps 20 --cfg 5.0 \
--prompt "a red fox in snow|a glass greenhouse at night"

A100 / V100 (no MXFP4 kernels) — dequantize the GPT-OSS encoder to bf16:

bash

python inference.py \
--repo_id "microsoft/Lens" \
--steps 20 --cfg 5.0 \
--prompt "a cat" \
--disable_mxfp4 --offload

Options

FlagDescriptionDefault
--repo_idHF repo id (or local path) of the assembled Lens pipelinemicrosoft/Lens
--base_resolution1024 or 14401440
--aspect_ratio1:2, 9:16, 2:3, 3:4, 1:1, 4:3, 3:2, 16:9, 2:11:1
--stepsNumber of denoising steps20
--cfgClassifier-free guidance scale5.0
--nNumber of images per prompt1
--seedRandom seed (omit for non-deterministic)
--outOutput directory./outputs
--dtypeCompute dtype: bfloat16, float16, float32bfloat16
--disable_mxfp4Dequantize the GPT-OSS text encoder to --dtype (required on A100 / V100; Hopper+ keeps MXFP4 by default for less VRAM)
--offloadEnable diffusers CPU offload (text_encoder->transformer->vae) to reduce peak VRAM
--reasonerRefine prompts with the loaded GPT-OSS encoder before generation
--api_url / --api_key / --api_modelUse an OpenAI-compatible API for prompt refinement (takes precedence over --reasoner)

Citation

bibtex

@article{zhao2026lens,
title = {Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models},
author = {Guo, Baining and Luo, Chong and Chen, Dong and Chen, Dongdong and Wei, Fangyun and Li, Ji and Bao, Jianmin and Zhang, Jiawei and Zhao, Jinjing and Shi, Lei and Yang, Qinhong and Zhang, Sirui and Wu, Xiuyu and Feng, Xuelu and Lu, Yan and Dong, Yanchen and Yue, Yang and Wang, Yitong and Chen, Yunuo and Liang, Zhiyang and Wan, Ziyu},
journal = {arXiv preprint arXiv:2605.21573},
year = {2026}
}

Responsible AI

The model is released for research purposes only and is not intended for product or service deployment. Responsible AI considerations were incorporated throughout the development process, including data selection, model training, and evaluation. The training data includes a combination of public, licensed, and internal datasets that were processed to remove clearly identifiable personal information and reduce harmful content where possible. However, as the data is largely sourced from web-scale collections, it may contain biases or uneven representation. As a result, the model may generate outputs that are inaccurate, biased, or inappropriate under certain prompts, including content that could be misleading or raise copyright or IP-related concerns. Given these limitations, the model should be used in controlled research settings, with appropriate human oversight. Downstream users are responsible for applying additional safeguards, such as content moderation, validation, and compliance checks, before using the model in broader applications.

Privacy

This project does not collect any usage data. For more information, see the Microsoft Privacy Statement.

License

This project is released under the MIT License.

Model provider

microsoft

microsoft

Model tree

Base

this model

Modalities

Input

-

Output

-

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today