Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Usage

Cohere Transcribe is supported natively in transformers. This is the recommended way to use the model for offline inference. For online inference, see the vLLM integration example below.

bash

pip install transformers>=5.4.0 torch huggingface_hub soundfile librosa sentencepiece protobuf
pip install datasets # only needed for long-form and non-English examples

Testing was carried out with torch==2.10.0 but it is expected to work with other versions.

Quick Start 🤗

Transcribe any audio file in a few lines:

python

from transformers import AutoProcessor, CohereAsrForConditionalGeneration
from transformers.audio_utils import load_audio
from huggingface_hub import hf_hub_download
processor = AutoProcessor.from_pretrained("CohereLabs/cohere-transcribe-03-2026")
model = CohereAsrForConditionalGeneration.from_pretrained("CohereLabs/cohere-transcribe-03-2026", device_map="auto")
audio_file = hf_hub_download(
repo_id="CohereLabs/cohere-transcribe-03-2026",
filename="demo/voxpopuli_test_en_demo.wav",
)
audio = load_audio(audio_file, sampling_rate=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", language="en")
inputs.to(model.device, dtype=model.dtype)
outputs = model.generate(**inputs, max_new_tokens=256)
text = processor.decode(outputs, skip_special_tokens=True)
print(text)

For audio longer than the feature extractor's max_audio_clip_s, the feature extractor automatically splits the waveform into chunks. The processor reassembles the per-chunk transcriptions using the returned audio_chunk_index.

This example transcribes a 55 minute earnings call:

python

from transformers import AutoProcessor, CohereAsrForConditionalGeneration
from datasets import load_dataset
import time
processor = AutoProcessor.from_pretrained("CohereLabs/cohere-transcribe-03-2026")
model = CohereAsrForConditionalGeneration.from_pretrained("CohereLabs/cohere-transcribe-03-2026", device_map="auto")
ds = load_dataset("distil-whisper/earnings22", "full", split="test", streaming=True)
sample = next(iter(ds))
audio_array = sample["audio"]["array"]
sr = sample["audio"]["sampling_rate"]
duration_s = len(audio_array) / sr
print(f"Audio duration: {duration_s / 60:.1f} minutes")
inputs = processor(audio=audio_array, sampling_rate=sr, return_tensors="pt", language="en")
audio_chunk_index = inputs.get("audio_chunk_index")
inputs.to(model.device, dtype=model.dtype)
start = time.time()
outputs = model.generate(**inputs, max_new_tokens=256)
text = processor.decode(outputs, skip_special_tokens=True, audio_chunk_index=audio_chunk_index, language="en")[0]
elapsed = time.time() - start
rtfx = duration_s / elapsed
print(f"Transcribed in {elapsed:.1f}s — RTFx: {rtfx:.1f}")
print(f"Transcription ({len(text.split())} words):")
print(text[:500] + "...")

Pass punctuation=False to obtain lower-cased output without punctuation marks.

python

inputs_pnc = processor(audio, sampling_rate=16000, return_tensors="pt", language="en", punctuation=True)
inputs_nopnc = processor(audio, sampling_rate=16000, return_tensors="pt", language="en", punctuation=False)

By default, punctuation is enabled.

Multiple audio files can be processed in a single call. When the batch mixes short-form and long-form audio, the processor handles chunking and reassembly.

python

from transformers import AutoProcessor, CohereAsrForConditionalGeneration
from transformers.audio_utils import load_audio
processor = AutoProcessor.from_pretrained("CohereLabs/cohere-transcribe-03-2026")
model = CohereAsrForConditionalGeneration.from_pretrained("CohereLabs/cohere-transcribe-03-2026", device_map="auto")
audio_short = load_audio(
"https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",
sampling_rate=16000,
)
audio_long = load_audio(
"https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama_first_45_secs.mp3",
sampling_rate=16000,
)
inputs = processor([audio_short, audio_long], sampling_rate=16000, return_tensors="pt", language="en")
audio_chunk_index = inputs.get("audio_chunk_index")
inputs.to(model.device, dtype=model.dtype)
outputs = model.generate(**inputs, max_new_tokens=256)
text = processor.decode(
outputs, skip_special_tokens=True, audio_chunk_index=audio_chunk_index, language="en"
)
print(text)

Specify the language code to transcribe in any of the 14 supported languages. This example transcribes Japanese audio from the FLEURS dataset:

python

from transformers import AutoProcessor, CohereAsrForConditionalGeneration
from datasets import load_dataset
processor = AutoProcessor.from_pretrained("CohereLabs/cohere-transcribe-03-2026")
model = CohereAsrForConditionalGeneration.from_pretrained("CohereLabs/cohere-transcribe-03-2026", device_map="auto")
ds = load_dataset("google/fleurs", "ja_jp", split="test", streaming=True)
ds_iter = iter(ds)
samples = [next(ds_iter) for _ in range(3)]
for sample in samples:
audio = sample["audio"]["array"]
sr = sample["audio"]["sampling_rate"]
inputs = processor(audio, sampling_rate=sr, return_tensors="pt", language="ja")
inputs.to(model.device, dtype=model.dtype)
outputs = model.generate(**inputs, max_new_tokens=256)
text = processor.decode(outputs, skip_special_tokens=True)
print(f"REF: {sample['transcription']}\nHYP: {text}\n")

vLLM Integration

For production serving we recommend running via vLLM following the instructions below.

First install vLLM (refer to vLLM installation instructions):

bash

uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install -U vllm==0.19.0 --torch-backend=auto
uv pip install vllm[audio]
uv pip install librosa

Start vLLM server

bash

vllm serve CohereLabs/cohere-transcribe-03-2026 --trust-remote-code

Send request

bash

curl -v -X POST http://localhost:8000/v1/audio/transcriptions \
-H "Authorization: Bearer $VLLM_API_KEY" \
-F "file=@$(realpath ${AUDIO_PATH})" \
-F "model=CohereLabs/cohere-transcribe-03-2026"

Results

Link to the live leaderboard: Open ASR Leaderboard.

Human-preference results

We observe similarly strong performance in human evaluations, where trained annotators assess transcription quality across real-world audio for accuracy, coherence and usability. The consistency between automated metrics and human judgments suggests that the model’s improvements translate beyond controlled benchmarks to practical transcription settings.

Figure: Human preference evaluation of model transcripts. In a head-to-head comparison, annotators were asked to express preferences for generations which primarily preserved meaning - but also avoided hallucination, correctly identified named entities, and provided verbatim transcripts with appropriate formatting. A score of 50% or higher indicates that Cohere Transcribe was preferred on average in the comparison.

Figure: per-language error rate averaged over FLEURS, Common Voice 17.0, MLS and Wenet tests sets (where relevant for a given language). CER for zh, ja, ko — WER otherwise

Resources

For more details and results:

Strengths and Limitations

Cohere Transcribe is a performant, dedicated ASR model intended for efficient speech transcription.

Strengths

Cohere Transcribe demonstrates best-in-class transcription accuracy in 14 languages. As a dedicated speech recognition model, it is also efficient, benefitting from a real-time factor up to three times faster than that of other, dedicated ASR models in the same size range. The model was trained from scratch, and from the outset, we deliberately focused on maximizing transcription accuracy while keeping production readiness top-of-mind.

Limitations

  • Single language. The model performs best when remaining in-distribution of a single, pre-specified language amongst the 14 in the range it supports. It does not feature explicit, automatic language detection and exhibits inconsistent performance on code-switched audio.

  • Timestamps/Speaker diarization. The model does not feature either of these.

  • Silence. Like most AED speech models, Cohere Transcribe is eager to transcribe, even non-speech sounds. The model thus benefits from prepending a noise gate or VAD (voice activity detection) model in order to prevent low-volume, floor noise from turning into hallucinations.

Ecosystem support 🚀

Cohere Transcribe is supported on the following libraries/platforms:

If you have added support for the model somewhere not included above please raise an issue/PR!

If you find issues with any of these please raise an issue with the respective library.

Model Card Contact

For errors or additional questions about details in this model card, contact labs@cohere.com or raise an issue.

Terms of Use: We hope that the release of this model will make community-based research efforts more accessible, by releasing the weights of a highly performant 2 billion parameter model to researchers all over the world. This model is governed by an Apache 2.0 license.

Citation

To cite this model please use the following bibtex:

bibtex

@misc{julian_mack_2026,
author = { Julian Mack and Ekagra Ranjan and Walter Beller-Morales and Bharat Venkitesh and Pierre Richemond },
title = { cohere-transcribe-03-2026 (Revision d96e814) },
year = 2026,
url = { https://huggingface.co/CohereLabs/cohere-transcribe-03-2026 },
doi = { 10.57967/hf/8653 },
publisher = { Hugging Face }
}

Model provider

CohereLabs

CohereLabs

Model tree

Base

this model

Modalities

Input

Audio

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today