Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Quick start

bash

pip install transformers==4.57.6 torch huggingface_hub soundfile librosa sentencepiece protobuf

python

import re
import torch
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
from transformers.audio_utils import load_audio
MODEL_ID = "syvai/cohere-transcribe-diarize"
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
MODEL_ID, dtype=torch.bfloat16
).to("cuda").eval()
# Prompt that activates diarization + timestamps. The base Cohere model
# uses special control tokens to switch features on/off; we keep that contract.
# `<|en|><|en|>` is the canonical Cohere prompt — the two slots are
# audio-language + transcript-language; setting them to the same code means
# "transcribe" (different codes would be "translate"). To run on another
# Cohere language, swap BOTH tokens, e.g. `<|de|><|de|>`.
# Each `<|...|>` is a single special token in the tokenizer vocab. Resolve
# via convert_tokens_to_ids — running the prompt string through the tokenizer
# re-tokenizes each marker into 6-12 subword pieces, which weakens the
# control-token signal the model trained on.
PROMPT_TOKENS = [
"<|startofcontext|>", "<|startoftranscript|>",
"<|emo:undefined|>", "<|en|>", "<|en|>",
"<|pnc|>", "<|noitn|>", "<|timestamp|>", "<|diarize|>",
]
prompt_ids = torch.tensor(
[[processor.tokenizer.convert_tokens_to_ids(t) for t in PROMPT_TOKENS]]
).to(model.device)
# Load any ≤ 30 s audio clip.
audio = load_audio("clip.wav", sampling_rate=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
inputs = {k: v.to(model.device, dtype=model.dtype if v.is_floating_point() else None)
for k, v in inputs.items()}
with torch.inference_mode():
out = model.generate(
input_features=inputs["input_features"],
attention_mask=torch.ones(inputs["input_features"].shape[:2], device=model.device),
decoder_input_ids=prompt_ids,
max_new_tokens=400,
do_sample=False,
repetition_penalty=1.2, # baked into generation_config but explicit here
)
raw = processor.tokenizer.decode(out[0], skip_special_tokens=False)
print(raw)
# → <|spltoken0|><|t:0.0|> Welcome back. <|t:1.5|><|spltoken1|><|t:1.5|> Thanks. <|t:2.4|>...

Parsing the output into structured segments

python

SEG_RE = re.compile(r"<\|spltoken(\d+)\|><\|t:(\d+\.\d+)\|>(.*?)<\|t:(\d+\.\d+)\|>", re.DOTALL)
# Drop the prompt prefix; the diarized text follows <|diarize|>
text = raw.split("<|diarize|>", 1)[-1].replace("<|endoftext|>", "")
segments = [
{
"speaker": int(m.group(1)),
"start": float(m.group(2)),
"end": float(m.group(4)),
"text": re.sub(r"<\|[^|]+\|>", "", m.group(3)).strip(),
}
for m in SEG_RE.finditer(text)
]
for s in segments:
print(f"[{s['start']:6.2f}{s['end']:6.2f}] SPK{s['speaker']:02d} {s['text']}")

Output:

text

[ 0.00– 1.50] SPK00 Welcome back.
[ 1.50– 2.40] SPK01 Thanks for having me.
[ 2.40– 3.80] SPK00 Let's get into it.

The model uses 8 reusable speaker slots per clip (<|spltoken0|><|spltoken7|>). IDs are local to the clip — there is no global identity across separately decoded clips. For long-form audio that's split into windows, re-link windows with the helper below.

Long-form audio (> 30 s)

Audio longer than 30 s exceeds the encoder's maximum window. Two helpers in this repo do the windowing + cross-chunk speaker matching for you:

  • diarize_long_vllm.py — recommended. Calls a local vLLM server concurrently (continuous batching) and reuses one GPU for both decode and embedding. ~44× RTF on a 10-min clip on a single 3090.
  • diarize_long.py — transformers-only fallback, no server needed. Slower (~7× RTF on the same clip) but minimal deps.

Both helpers:

  1. Slide 28 s windows with 2 s overlap over the full audio
  2. Decode each window with this model
  3. Embed each parsed segment with ReDimNet2 B6 (12 M params, 0.17 % EER, loaded automatically via torch.hub)
  4. Cluster embeddings globally with cosine-distance AHC so the same speaker keeps the same ID across windows

bash

# Assumes vLLM is already serving (see next section)
python diarize_long_vllm.py podcast.wav \
--vllm http://127.0.0.1:8000 \
--model syvai/cohere-transcribe-diarize \
--language en \
--tau 0.45 \
--concurrency 32 \
--embed-batch 32

Or via the offline transformers helper (slower, no server):

python

from diarize_long import diarize_long_audio
segments = diarize_long_audio(
audio="podcast.wav",
diar_model_id="syvai/cohere-transcribe-diarize",
language="en",
chunk_s=28.0,
overlap_s=2.0,
cluster_threshold=0.45,
)

Additional dependencies for long-form inference: numpy, scipy, soundfile, torchaudio (required by ReDimNet2's feature extractor), plus aiohttp if using diarize_long_vllm.py.

Tuning the clustering threshold. cluster_threshold is the cosine-distance ceiling for AHC merges over ReDimNet2 embeddings. Around 0.45 is a good default for podcast / panel-style audio: a 2-min Bernie Sanders town-hall clip cleanly resolves Bernie as one consistent ID across all 5 sliding windows and the host as a second ID, while short audience interjections get their own IDs. Drop to 0.30–0.35 if the audio has many similar-sounding speakers; raise to 0.50–0.55 for noisier conditions where you'd rather collapse near-duplicate IDs.

Serving with vLLM (recommended)

The transformers code path above works but is single-stream. For production we run this model on vLLM 0.19.0 (note: 0.19.1 is broken) — it gives continuous batching, a custom OpenAI-compatible diarized_json response format, and ~25× higher peak throughput than calling model.generate() in a loop.

One-time setup

Two scripts ship with this repo to handle the setup — both idempotent:

bash

# Download the model locally first, then patch it
hf download syvai/cohere-transcribe-diarize --local-dir cohere-transcribe-diarize
# 1. Reshape the checkpoint files for vLLM compatibility
python fix_for_vllm.py ./cohere-transcribe-diarize

fix_for_vllm.py makes three edits to your local copy:

  • tokenizer_config.json: drops the legacy extra_special_tokens list (transformers 4.57+ expects a dict; the actual tokens are still in tokenizer.json).
  • config.json: sets head.num_classes and transf_decoder.config_dict.vocab_size to 16684 (the resized vocab).
  • model.safetensors: strips the model. weight-name prefix and drops the BatchNorm num_batches_tracked tensors vLLM's CohereAsr model doesn't register.

bash

# 2. Install vLLM 0.19.0 (NOT 0.19.1 — broken)
uv pip install "vllm==0.19.0" --torch-backend=cu128
uv pip install librosa
# 3. Patch vLLM's speech_to_text endpoint to add diarized_json
python vllm_diarized_patch.py

vllm_diarized_patch.py applies five edits inside the installed vLLM (also idempotent):

  1. protocol.py — add "diarized_json" to the AudioResponseFormat enum
  2. protocol.py — force skip_special_tokens=False in to_sampling_params so <|spltoken*|> and <|t:*|> survive into the response text
  3. speech_to_text.py — let the validator accept response_format="diarized_json"
  4. speech_to_text.py — parse the raw token stream with the segment regex and return OpenAI-compatible {task, language, duration, text, segments:[{speaker, start, end, text}], speakers, usage} JSON
  5. api_router.py — pass JSONResponse returns through unchanged (otherwise the diarized branch's return value gets misinterpreted as a streaming generator and the response body comes out empty)

Launch the server

bash

vllm serve ./cohere-transcribe-diarize \
--served-model-name syvai/cohere-transcribe-diarize \
--trust-remote-code \
--host 127.0.0.1 --port 8000 \
--gpu-memory-utilization 0.55 # leaves ~10 GB for ReDimNet2 batching

--gpu-memory-utilization 0.55 is the sweet spot on a 24 GB card when you also run ReDimNet2 on the same GPU for long-form. If you only need short-form decode (≤ 30 s, no cross-chunk linking), bump it to 0.85 for better KV cache headroom.

Call the API

Plain transcription is OpenAI-compatible:

bash

curl -X POST http://127.0.0.1:8000/v1/audio/transcriptions \
-F "file=@clip.wav" \
-F "model=syvai/cohere-transcribe-diarize" \
-F "language=en" \
-F "response_format=diarized_json" \
--form-string "prompt=<|startofcontext|><|startoftranscript|><|emo:undefined|><|en|><|en|><|pnc|><|noitn|><|timestamp|><|diarize|>"

Response shape (mirrors OpenAI's gpt-4o-transcribe-diarize):

json

{
"task": "transcribe",
"language": "en",
"duration": 28.0,
"text": "UM I REJECT THE IDEA I REALLY DO ...",
"segments": [
{"speaker": "SPEAKER_00", "start": 2.5, "end": 3.8, "text": "I REALLY DO"},
{"speaker": "SPEAKER_01", "start": 3.6, "end": 15.0, "text": "IT'S ONE OF THINGS THAT BOTHERS ME ..."},
{"speaker": "SPEAKER_02", "start": 15.5, "end": 28.0, "text": "IS RAISING A STARVATION MINIMUM WAGE ..."}
],
"speakers": ["SPEAKER_00", "SPEAKER_01", "SPEAKER_02"],
"usage": {"type": "duration", "seconds": 28}
}

The prompt field must be passed explicitly — vLLM's default prompt builder emits <|nodiarize|> which suppresses the speaker tokens.

Measured throughput (RTX 3090, 28 s clips)

ConcurrencyThroughput
122× audio/wall
8117×
32171×
128249× (peak)

vLLM does continuous (in-flight) batching automatically — fire concurrent requests at the endpoint and it batches them through one forward pass.

Training

This model was produced by full fine-tuning of CohereLabs/cohere-transcribe-03-2026 on English diarization data. The base vocabulary was extended with 8 speaker tokens and 300 100 ms timestamp tokens; the new rows of the embedding and LM-head matrices were initialised from the existing token embedding statistics.

DatasetRowsDescription
AMI SDM (train split)19,928Single-distant-microphone meeting recordings, sliding 28 s windows with 14 s hop, up to 4 simultaneous speakers per window. Provides realistic multi-speaker conversation with overlap, hesitations, and turn-taking.
LibriSpeech synthetic mix11,813Synthetic K-speaker mixtures (K weighted 0.2 / 0.3 / 0.3 / 0.2 for K=1…4) constructed from LibriSpeech utterances, with realistic gap silences. Provides clean cross-talk-free speaker examples to anchor the diarization head.
Total31,741All segments are ≤ 30 s and capped at K ≤ 4 speakers.

Training ran for 2 epochs at peak LR 3e-4 (linear warmup over 100 optimizer steps, then linear decay to 0). Effective batch size 128 (per-device batch 2 × 64 gradient-accumulation), bf16, gradient checkpointing, AdamW8bit optimizer. The full fine-tune updates all 2 B parameters. repetition_penalty=1.2 is baked into the generation config and is required at inference — without it, K=4 outputs occasionally loop on a single speaker token.

Limitations

  • 30 s hard cap per decoder pass — use diarize_long for longer audio. The Cohere feature extractor batches longer clips into multiple chunks, which the diarization decoder is not trained to consume.
  • K ≤ 4 well-supported, K = 5–8 still emit but accuracy degrades on dense overlapping speech.
  • Real-time factor ≈ 14× on RTX 3090 at bf16 — the 2 B autoregressive decoder is the bottleneck. For >100× RTF on long audio, pair with a smaller segmenter (e.g. DiariZen-base) or use this model only on the highlight regions.
  • Speaker IDs are local to each generate call. Always cluster embeddings across windows when working with audio that crosses the 30 s boundary.

Citation

If you use this model, please cite Cohere Labs' base release alongside this fine-tune:

bibtex

@misc{cohere-transcribe-diarize-2026,
author = {{syv.ai}},
title = {Cohere Transcribe — Diarize + Timestamps (English)},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/syvai/cohere-transcribe-diarize}},
}

License

Apache 2.0, inherited from the base model.

Model provider

syvai

syvai

Model tree

Base

CohereLabs/cohere-transcribe-03-2026

Fine-tuned

this model

Modalities

Input

Audio

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today