Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Quick start
bash
pip install transformers==4.57.6 torch huggingface_hub soundfile librosa sentencepiece protobuf
python
import reimport torchfrom transformers import AutoProcessor, AutoModelForSpeechSeq2Seqfrom transformers.audio_utils import load_audioMODEL_ID = "syvai/cohere-transcribe-diarize"processor = AutoProcessor.from_pretrained(MODEL_ID)model = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL_ID, dtype=torch.bfloat16).to("cuda").eval()# Prompt that activates diarization + timestamps. The base Cohere model# uses special control tokens to switch features on/off; we keep that contract.# `<|en|><|en|>` is the canonical Cohere prompt — the two slots are# audio-language + transcript-language; setting them to the same code means# "transcribe" (different codes would be "translate"). To run on another# Cohere language, swap BOTH tokens, e.g. `<|de|><|de|>`.# Each `<|...|>` is a single special token in the tokenizer vocab. Resolve# via convert_tokens_to_ids — running the prompt string through the tokenizer# re-tokenizes each marker into 6-12 subword pieces, which weakens the# control-token signal the model trained on.PROMPT_TOKENS = ["<|startofcontext|>", "<|startoftranscript|>","<|emo:undefined|>", "<|en|>", "<|en|>","<|pnc|>", "<|noitn|>", "<|timestamp|>", "<|diarize|>",]prompt_ids = torch.tensor([[processor.tokenizer.convert_tokens_to_ids(t) for t in PROMPT_TOKENS]]).to(model.device)# Load any ≤ 30 s audio clip.audio = load_audio("clip.wav", sampling_rate=16000)inputs = processor(audio, sampling_rate=16000, return_tensors="pt")inputs = {k: v.to(model.device, dtype=model.dtype if v.is_floating_point() else None)for k, v in inputs.items()}with torch.inference_mode():out = model.generate(input_features=inputs["input_features"],attention_mask=torch.ones(inputs["input_features"].shape[:2], device=model.device),decoder_input_ids=prompt_ids,max_new_tokens=400,do_sample=False,repetition_penalty=1.2, # baked into generation_config but explicit here)raw = processor.tokenizer.decode(out[0], skip_special_tokens=False)print(raw)# → <|spltoken0|><|t:0.0|> Welcome back. <|t:1.5|><|spltoken1|><|t:1.5|> Thanks. <|t:2.4|>...
Parsing the output into structured segments
python
SEG_RE = re.compile(r"<\|spltoken(\d+)\|><\|t:(\d+\.\d+)\|>(.*?)<\|t:(\d+\.\d+)\|>", re.DOTALL)# Drop the prompt prefix; the diarized text follows <|diarize|>text = raw.split("<|diarize|>", 1)[-1].replace("<|endoftext|>", "")segments = [{"speaker": int(m.group(1)),"start": float(m.group(2)),"end": float(m.group(4)),"text": re.sub(r"<\|[^|]+\|>", "", m.group(3)).strip(),}for m in SEG_RE.finditer(text)]for s in segments:print(f"[{s['start']:6.2f}–{s['end']:6.2f}] SPK{s['speaker']:02d} {s['text']}")
Output:
text
[ 0.00– 1.50] SPK00 Welcome back.[ 1.50– 2.40] SPK01 Thanks for having me.[ 2.40– 3.80] SPK00 Let's get into it.
The model uses 8 reusable speaker slots per clip (<|spltoken0|>…<|spltoken7|>). IDs are local to the clip — there is no global identity across separately decoded clips. For long-form audio that's split into windows, re-link windows with the helper below.
Long-form audio (> 30 s)
Audio longer than 30 s exceeds the encoder's maximum window. Two helpers in this repo do the windowing + cross-chunk speaker matching for you:
diarize_long_vllm.py— recommended. Calls a local vLLM server concurrently (continuous batching) and reuses one GPU for both decode and embedding. ~44× RTF on a 10-min clip on a single 3090.diarize_long.py— transformers-only fallback, no server needed. Slower (~7× RTF on the same clip) but minimal deps.
Both helpers:
- Slide 28 s windows with 2 s overlap over the full audio
- Decode each window with this model
- Embed each parsed segment with ReDimNet2 B6 (12 M params, 0.17 % EER, loaded automatically via
torch.hub) - Cluster embeddings globally with cosine-distance AHC so the same speaker keeps the same ID across windows
bash
# Assumes vLLM is already serving (see next section)python diarize_long_vllm.py podcast.wav \--vllm http://127.0.0.1:8000 \--model syvai/cohere-transcribe-diarize \--language en \--tau 0.45 \--concurrency 32 \--embed-batch 32
Or via the offline transformers helper (slower, no server):
python
from diarize_long import diarize_long_audiosegments = diarize_long_audio(audio="podcast.wav",diar_model_id="syvai/cohere-transcribe-diarize",language="en",chunk_s=28.0,overlap_s=2.0,cluster_threshold=0.45,)
Additional dependencies for long-form inference: numpy, scipy, soundfile, torchaudio (required by ReDimNet2's feature extractor), plus aiohttp if using diarize_long_vllm.py.
Tuning the clustering threshold. cluster_threshold is the cosine-distance ceiling for AHC merges over ReDimNet2 embeddings. Around 0.45 is a good default for podcast / panel-style audio: a 2-min Bernie Sanders town-hall clip cleanly resolves Bernie as one consistent ID across all 5 sliding windows and the host as a second ID, while short audience interjections get their own IDs. Drop to 0.30–0.35 if the audio has many similar-sounding speakers; raise to 0.50–0.55 for noisier conditions where you'd rather collapse near-duplicate IDs.
Serving with vLLM (recommended)
The transformers code path above works but is single-stream. For production we run this model on vLLM 0.19.0 (note: 0.19.1 is broken) — it gives continuous batching, a custom OpenAI-compatible diarized_json response format, and ~25× higher peak throughput than calling model.generate() in a loop.
One-time setup
Two scripts ship with this repo to handle the setup — both idempotent:
bash
# Download the model locally first, then patch ithf download syvai/cohere-transcribe-diarize --local-dir cohere-transcribe-diarize# 1. Reshape the checkpoint files for vLLM compatibilitypython fix_for_vllm.py ./cohere-transcribe-diarize
fix_for_vllm.py makes three edits to your local copy:
tokenizer_config.json: drops the legacyextra_special_tokenslist (transformers 4.57+ expects a dict; the actual tokens are still intokenizer.json).config.json: setshead.num_classesandtransf_decoder.config_dict.vocab_sizeto16684(the resized vocab).model.safetensors: strips themodel.weight-name prefix and drops the BatchNormnum_batches_trackedtensors vLLM's CohereAsr model doesn't register.
bash
# 2. Install vLLM 0.19.0 (NOT 0.19.1 — broken)uv pip install "vllm==0.19.0" --torch-backend=cu128uv pip install librosa# 3. Patch vLLM's speech_to_text endpoint to add diarized_jsonpython vllm_diarized_patch.py
vllm_diarized_patch.py applies five edits inside the installed vLLM (also idempotent):
protocol.py— add"diarized_json"to theAudioResponseFormatenumprotocol.py— forceskip_special_tokens=Falseinto_sampling_paramsso<|spltoken*|>and<|t:*|>survive into the response textspeech_to_text.py— let the validator acceptresponse_format="diarized_json"speech_to_text.py— parse the raw token stream with the segment regex and return OpenAI-compatible{task, language, duration, text, segments:[{speaker, start, end, text}], speakers, usage}JSONapi_router.py— passJSONResponsereturns through unchanged (otherwise the diarized branch's return value gets misinterpreted as a streaming generator and the response body comes out empty)
Launch the server
bash
vllm serve ./cohere-transcribe-diarize \--served-model-name syvai/cohere-transcribe-diarize \--trust-remote-code \--host 127.0.0.1 --port 8000 \--gpu-memory-utilization 0.55 # leaves ~10 GB for ReDimNet2 batching
--gpu-memory-utilization 0.55 is the sweet spot on a 24 GB card when you also run ReDimNet2 on the same GPU for long-form. If you only need short-form decode (≤ 30 s, no cross-chunk linking), bump it to 0.85 for better KV cache headroom.
Call the API
Plain transcription is OpenAI-compatible:
bash
curl -X POST http://127.0.0.1:8000/v1/audio/transcriptions \-F "file=@clip.wav" \-F "model=syvai/cohere-transcribe-diarize" \-F "language=en" \-F "response_format=diarized_json" \--form-string "prompt=<|startofcontext|><|startoftranscript|><|emo:undefined|><|en|><|en|><|pnc|><|noitn|><|timestamp|><|diarize|>"
Response shape (mirrors OpenAI's gpt-4o-transcribe-diarize):
json
{"task": "transcribe","language": "en","duration": 28.0,"text": "UM I REJECT THE IDEA I REALLY DO ...","segments": [{"speaker": "SPEAKER_00", "start": 2.5, "end": 3.8, "text": "I REALLY DO"},{"speaker": "SPEAKER_01", "start": 3.6, "end": 15.0, "text": "IT'S ONE OF THINGS THAT BOTHERS ME ..."},{"speaker": "SPEAKER_02", "start": 15.5, "end": 28.0, "text": "IS RAISING A STARVATION MINIMUM WAGE ..."}],"speakers": ["SPEAKER_00", "SPEAKER_01", "SPEAKER_02"],"usage": {"type": "duration", "seconds": 28}}
The prompt field must be passed explicitly — vLLM's default prompt builder emits <|nodiarize|> which suppresses the speaker tokens.
Measured throughput (RTX 3090, 28 s clips)
| Concurrency | Throughput |
|---|---|
| 1 | 22× audio/wall |
| 8 | 117× |
| 32 | 171× |
| 128 | 249× (peak) |
vLLM does continuous (in-flight) batching automatically — fire concurrent requests at the endpoint and it batches them through one forward pass.
Training
This model was produced by full fine-tuning of CohereLabs/cohere-transcribe-03-2026 on English diarization data. The base vocabulary was extended with 8 speaker tokens and 300 100 ms timestamp tokens; the new rows of the embedding and LM-head matrices were initialised from the existing token embedding statistics.
| Dataset | Rows | Description |
|---|---|---|
| AMI SDM (train split) | 19,928 | Single-distant-microphone meeting recordings, sliding 28 s windows with 14 s hop, up to 4 simultaneous speakers per window. Provides realistic multi-speaker conversation with overlap, hesitations, and turn-taking. |
| LibriSpeech synthetic mix | 11,813 | Synthetic K-speaker mixtures (K weighted 0.2 / 0.3 / 0.3 / 0.2 for K=1…4) constructed from LibriSpeech utterances, with realistic gap silences. Provides clean cross-talk-free speaker examples to anchor the diarization head. |
| Total | 31,741 | All segments are ≤ 30 s and capped at K ≤ 4 speakers. |
Training ran for 2 epochs at peak LR 3e-4 (linear warmup over 100 optimizer steps, then linear decay to 0). Effective batch size 128 (per-device batch 2 × 64 gradient-accumulation), bf16, gradient checkpointing, AdamW8bit optimizer. The full fine-tune updates all 2 B parameters. repetition_penalty=1.2 is baked into the generation config and is required at inference — without it, K=4 outputs occasionally loop on a single speaker token.
Limitations
- 30 s hard cap per decoder pass — use
diarize_longfor longer audio. The Cohere feature extractor batches longer clips into multiple chunks, which the diarization decoder is not trained to consume. - K ≤ 4 well-supported, K = 5–8 still emit but accuracy degrades on dense overlapping speech.
- Real-time factor ≈ 14× on RTX 3090 at bf16 — the 2 B autoregressive decoder is the bottleneck. For >100× RTF on long audio, pair with a smaller segmenter (e.g. DiariZen-base) or use this model only on the highlight regions.
- Speaker IDs are local to each generate call. Always cluster embeddings across windows when working with audio that crosses the 30 s boundary.
Citation
If you use this model, please cite Cohere Labs' base release alongside this fine-tune:
bibtex
@misc{cohere-transcribe-diarize-2026,author = {{syv.ai}},title = {Cohere Transcribe — Diarize + Timestamps (English)},year = {2026},publisher = {Hugging Face},howpublished = {\url{https://huggingface.co/syvai/cohere-transcribe-diarize}},}
License
Apache 2.0, inherited from the base model.
Model provider
syvai
Model tree
Base
CohereLabs/cohere-transcribe-03-2026
Fine-tuned
this model
Modalities
Input
Audio
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information