Quick start
pip install transformers==4.57.6 torch huggingface_hub soundfile librosa sentencepiece protobuf
import re
import torch
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
from transformers.audio_utils import load_audio
MODEL_ID = "syvai/cohere-transcribe-diarize"
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
MODEL_ID, dtype=torch.bfloat16
).to("cuda").eval()
PROMPT_TOKENS = [
"<|startofcontext|>", "<|startoftranscript|>",
"<|emo:undefined|>", "<|en|>", "<|en|>",
"<|pnc|>", "<|noitn|>", "<|timestamp|>", "<|diarize|>",
]
prompt_ids = torch.tensor(
[[processor.tokenizer.convert_tokens_to_ids(t) for t in PROMPT_TOKENS]]
).to(model.device)
audio = load_audio("clip.wav", sampling_rate=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
inputs = {k: v.to(model.device, dtype=model.dtype if v.is_floating_point() else None)
for k, v in inputs.items()}
with torch.inference_mode():
out = model.generate(
input_features=inputs["input_features"],
attention_mask=torch.ones(inputs["input_features"].shape[:2], device=model.device),
decoder_input_ids=prompt_ids,
max_new_tokens=400,
do_sample=False,
repetition_penalty=1.2,
)
raw = processor.tokenizer.decode(out[0], skip_special_tokens=False)
print(raw)
Parsing the output into structured segments
SEG_RE = re.compile(r"<\|spltoken(\d+)\|><\|t:(\d+\.\d+)\|>(.*?)<\|t:(\d+\.\d+)\|>", re.DOTALL)
text = raw.split("<|diarize|>", 1)[-1].replace("<|endoftext|>", "")
segments = [
{
"speaker": int(m.group(1)),
"start": float(m.group(2)),
"end": float(m.group(4)),
"text": re.sub(r"<\|[^|]+\|>", "", m.group(3)).strip(),
}
for m in SEG_RE.finditer(text)
]
for s in segments:
print(f"[{s['start']:6.2f}–{s['end']:6.2f}] SPK{s['speaker']:02d} {s['text']}")
Output:
[ 0.00– 1.50] SPK00 Welcome back.
[ 1.50– 2.40] SPK01 Thanks for having me.
[ 2.40– 3.80] SPK00 Let's get into it.
The model uses 8 reusable speaker slots per clip (<|spltoken0|>…<|spltoken7|>). IDs are local to the clip — there is no global identity across separately decoded clips. For long-form audio that's split into windows, re-link windows with the helper below.
Audio longer than 30 s exceeds the encoder's maximum window. Two helpers in this repo do the windowing + cross-chunk speaker matching for you:
diarize_long_vllm.py — recommended. Calls a local vLLM server concurrently (continuous batching) and reuses one GPU for both decode and embedding. ~44× RTF on a 10-min clip on a single 3090.
diarize_long.py — transformers-only fallback, no server needed. Slower (~7× RTF on the same clip) but minimal deps.
Both helpers:
- Slide 28 s windows with 2 s overlap over the full audio
- Decode each window with this model
- Embed each parsed segment with ReDimNet2 B6 (12 M params, 0.17 % EER, loaded automatically via
torch.hub)
- Cluster embeddings globally with cosine-distance AHC so the same speaker keeps the same ID across windows
# Assumes vLLM is already serving (see next section)
python diarize_long_vllm.py podcast.wav \
--vllm http://127.0.0.1:8000 \
--model syvai/cohere-transcribe-diarize \
--language en \
--tau 0.45 \
--concurrency 32 \
--embed-batch 32
Or via the offline transformers helper (slower, no server):
from diarize_long import diarize_long_audio
segments = diarize_long_audio(
audio="podcast.wav",
diar_model_id="syvai/cohere-transcribe-diarize",
language="en",
chunk_s=28.0,
overlap_s=2.0,
cluster_threshold=0.45,
)
Additional dependencies for long-form inference: numpy, scipy, soundfile, torchaudio (required by ReDimNet2's feature extractor), plus aiohttp if using diarize_long_vllm.py.
Tuning the clustering threshold. cluster_threshold is the cosine-distance ceiling for AHC merges over ReDimNet2 embeddings. Around 0.45 is a good default for podcast / panel-style audio: a 2-min Bernie Sanders town-hall clip cleanly resolves Bernie as one consistent ID across all 5 sliding windows and the host as a second ID, while short audience interjections get their own IDs. Drop to 0.30–0.35 if the audio has many similar-sounding speakers; raise to 0.50–0.55 for noisier conditions where you'd rather collapse near-duplicate IDs.
Serving with vLLM (recommended)
The transformers code path above works but is single-stream. For production we run this model on vLLM 0.19.0 (note: 0.19.1 is broken) — it gives continuous batching, a custom OpenAI-compatible diarized_json response format, and ~25× higher peak throughput than calling model.generate() in a loop.
One-time setup
Two scripts ship with this repo to handle the setup — both idempotent:
# Download the model locally first, then patch it
hf download syvai/cohere-transcribe-diarize --local-dir cohere-transcribe-diarize
# 1. Reshape the checkpoint files for vLLM compatibility
python fix_for_vllm.py ./cohere-transcribe-diarize
fix_for_vllm.py makes three edits to your local copy:
tokenizer_config.json: drops the legacy extra_special_tokens list (transformers 4.57+ expects a dict; the actual tokens are still in tokenizer.json).
config.json: sets head.num_classes and transf_decoder.config_dict.vocab_size to 16684 (the resized vocab).
model.safetensors: strips the model. weight-name prefix and drops the BatchNorm num_batches_tracked tensors vLLM's CohereAsr model doesn't register.
# 2. Install vLLM 0.19.0 (NOT 0.19.1 — broken)
uv pip install "vllm==0.19.0" --torch-backend=cu128
uv pip install librosa
# 3. Patch vLLM's speech_to_text endpoint to add diarized_json
python vllm_diarized_patch.py
vllm_diarized_patch.py applies five edits inside the installed vLLM (also idempotent):
protocol.py — add "diarized_json" to the AudioResponseFormat enum
protocol.py — force skip_special_tokens=False in to_sampling_params so <|spltoken*|> and <|t:*|> survive into the response text
speech_to_text.py — let the validator accept response_format="diarized_json"
speech_to_text.py — parse the raw token stream with the segment regex and return OpenAI-compatible {task, language, duration, text, segments:[{speaker, start, end, text}], speakers, usage} JSON
- — pass returns through unchanged (otherwise the diarized branch's return value gets misinterpreted as a streaming generator and the response body comes out empty)
Launch the server
vllm serve ./cohere-transcribe-diarize \
--served-model-name syvai/cohere-transcribe-diarize \
--trust-remote-code \
--host 127.0.0.1 --port 8000 \
--gpu-memory-utilization 0.55 # leaves ~10 GB for ReDimNet2 batching
--gpu-memory-utilization 0.55 is the sweet spot on a 24 GB card when you also run ReDimNet2 on the same GPU for long-form. If you only need short-form decode (≤ 30 s, no cross-chunk linking), bump it to 0.85 for better KV cache headroom.
Call the API
Plain transcription is OpenAI-compatible:
curl -X POST http://127.0.0.1:8000/v1/audio/transcriptions \
-F "file=@clip.wav" \
-F "model=syvai/cohere-transcribe-diarize" \
-F "language=en" \
-F "response_format=diarized_json" \
--form-string "prompt=<|startofcontext|><|startoftranscript|><|emo:undefined|><|en|><|en|><|pnc|><|noitn|><|timestamp|><|diarize|>"
Response shape (mirrors OpenAI's gpt-4o-transcribe-diarize):
{
"task": "transcribe",
"language": "en",
"duration": 28.0,
"text": "UM I REJECT THE IDEA I REALLY DO ...",
"segments": [
{"speaker": "SPEAKER_00", "start": 2.5, "end": 3.8, "text": "I REALLY DO"},
{"speaker": "SPEAKER_01", "start": 3.6, "end": 15.0, "text": "IT'S ONE OF THINGS THAT BOTHERS ME ..."},
{"speaker": "SPEAKER_02", "start": 15.5, "end": 28.0, "text": "IS RAISING A STARVATION MINIMUM WAGE ..."}
],
"speakers": ["SPEAKER_00", "SPEAKER_01", "SPEAKER_02"],
"usage": {"type": "duration", "seconds": 28}
}
The prompt field must be passed explicitly — vLLM's default prompt builder emits <|nodiarize|> which suppresses the speaker tokens.
Measured throughput (RTX 3090, 28 s clips)
Table with columns: Concurrency, Throughput| Concurrency | Throughput |
|---|
| 1 | 22× audio/wall |
| 8 | 117× |
| 32 | 171× |
| 128 | 249× (peak) |
vLLM does continuous (in-flight) batching automatically — fire concurrent requests at the endpoint and it batches them through one forward pass.
Training
This model was produced by full fine-tuning of CohereLabs/cohere-transcribe-03-2026 on English diarization data. The base vocabulary was extended with 8 speaker tokens and 300 100 ms timestamp tokens; the new rows of the embedding and LM-head matrices were initialised from the existing token embedding statistics.
Table with columns: Dataset, Rows, Description| Dataset | Rows | Description |
|---|
| AMI SDM (train split) | 19,928 | Single-distant-microphone meeting recordings, sliding 28 s windows with 14 s hop, up to 4 simultaneous speakers per window. Provides realistic multi-speaker conversation with overlap, hesitations, and turn-taking. |
| LibriSpeech synthetic mix | 11,813 | Synthetic K-speaker mixtures (K weighted 0.2 / 0.3 / 0.3 / 0.2 for K=1…4) constructed from LibriSpeech utterances, with realistic gap silences. Provides clean cross-talk-free speaker examples to anchor the diarization head. |
| Total | 31,741 | All segments are ≤ 30 s and capped at K ≤ 4 speakers. |
Training ran for 2 epochs at peak LR 3e-4 (linear warmup over 100 optimizer steps, then linear decay to 0). Effective batch size 128 (per-device batch 2 × 64 gradient-accumulation), bf16, gradient checkpointing, AdamW8bit optimizer. The full fine-tune updates all 2 B parameters. repetition_penalty=1.2 is baked into the generation config and is required at inference — without it, K=4 outputs occasionally loop on a single speaker token.
Limitations
- 30 s hard cap per decoder pass — use
diarize_long for longer audio. The Cohere feature extractor batches longer clips into multiple chunks, which the diarization decoder is not trained to consume.
- K ≤ 4 well-supported, K = 5–8 still emit but accuracy degrades on dense overlapping speech.
- Real-time factor ≈ 14× on RTX 3090 at bf16 — the 2 B autoregressive decoder is the bottleneck. For >100× RTF on long audio, pair with a smaller segmenter (e.g. DiariZen-base) or use this model only on the highlight regions.
- Speaker IDs are local to each generate call. Always cluster embeddings across windows when working with audio that crosses the 30 s boundary.
Citation
If you use this model, please cite Cohere Labs' base release alongside this fine-tune:
@misc{cohere-transcribe-diarize-2026,
author = {{syv.ai}},
title = {Cohere Transcribe — Diarize + Timestamps (English)},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/syvai/cohere-transcribe-diarize}},
}
License
Apache 2.0, inherited from the base model.