openai

gpt-oss-120b

A MoE model designed for powerful reasoning and agentic tasks, with configurable reasoning effort, full chain-of-thought access, and MXFP4 quantization enabling single 80GB GPU deployment.

README

License: apache-2.0

Transformers

You can use gpt-oss-120b and gpt-oss-20b with Transformers. If you use the Transformers chat template, it will automatically apply the harmony response format. If you use model.generate directly, you need to apply the harmony format manually using the chat template or use our openai-harmony package.

To get started, install the necessary dependencies to setup your environment:

markdown
pip install -U transformers kernels torch

Once, setup you can proceed to run the model by running the snippet below:

py
from transformers import pipeline
import torch

model_id = "openai/gpt-oss-120b"

pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]

outputs = pipe(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

Alternatively, you can run the model via Transformers Serve to spin up a OpenAI-compatible webserver:

markdown
transformers serve
transformers chat localhost:8000 --model-name-or-path openai/gpt-oss-120b

Learn more about how to use gpt-oss with Transformers.

vLLM

vLLM recommends using uv for Python dependency management. You can use vLLM to spin up an OpenAI-compatible webserver. The following command will automatically download the model and start the server.

bash
uv pip install --pre vllm==0.10.1+gptoss \
    --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
    --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
    --index-strategy unsafe-best-match

vllm serve openai/gpt-oss-120b

Learn more about how to use gpt-oss with vLLM.

PyTorch / Triton

To learn about how to use this model with PyTorch and Triton, check out our reference implementations in the gpt-oss repository.

Ollama

If you are trying to run gpt-oss on consumer hardware, you can use Ollama by running the following commands after installing Ollama.

bash
# gpt-oss-120b
ollama pull gpt-oss:120b
ollama run gpt-oss:120b

Learn more about how to use gpt-oss with Ollama.

LM Studio

If you are using LM Studio you can use the following commands to download.

bash
# gpt-oss-120b
lms get openai/gpt-oss-120b

Check out our awesome list for a broader collection of gpt-oss resources and inference partners.

Download the model

You can download the model weights from the Hugging Face Hub directly from Hugging Face CLI:

shell
# gpt-oss-120b
huggingface-cli download openai/gpt-oss-120b --include "original/*" --local-dir gpt-oss-120b/
pip install gpt-oss
python -m gpt_oss.chat model/

Reasoning levels

You can adjust the reasoning level that suits your task across three levels:

Low: Fast responses for general dialogue.
Medium: Balanced speed and detail.
High: Deep and detailed analysis.

The reasoning level can be set in the system prompts, e.g., "Reasoning: high".

Tool use

The gpt-oss models are excellent for:

Web browsing (using built-in browsing tools)
Function calling with defined schemas
Agentic operations like browser tasks

Fine-tuning

Both gpt-oss models can be fine-tuned for a variety of specialized use cases.

This larger model gpt-oss-120b can be fine-tuned on a single H100 node, whereas the smaller gpt-oss-20b can even be fine-tuned on consumer hardware.

Citation

bibtex
@misc{openai2025gptoss120bgptoss20bmodel,
      title={gpt-oss-120b & gpt-oss-20b Model Card}, 
      author={OpenAI},
      year={2025},
      eprint={2508.10925},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.10925}, 
}

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.