cohere-transcribe-diarize API & Inference Endpoint

Quick start

bash
pip install transformers==4.57.6 torch huggingface_hub soundfile librosa sentencepiece protobuf

python
import re
import torch
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
from transformers.audio_utils import load_audio

MODEL_ID = "syvai/cohere-transcribe-diarize"

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    MODEL_ID, dtype=torch.bfloat16
).to("cuda").eval()

# Prompt that activates diarization + timestamps. The base Cohere model
# uses special control tokens to switch features on/off; we keep that contract.
# `<|en|><|en|>` is the canonical Cohere prompt — the two slots are
# audio-language + transcript-language; setting them to the same code means
# "transcribe" (different codes would be "translate"). To run on another
# Cohere language, swap BOTH tokens, e.g. `<|de|><|de|>`.
# Each `<|...|>` is a single special token in the tokenizer vocab. Resolve
# via convert_tokens_to_ids — running the prompt string through the tokenizer
# re-tokenizes each marker into 6-12 subword pieces, which weakens the
# control-token signal the model trained on.
PROMPT_TOKENS = [
    "<|startofcontext|>", "<|startoftranscript|>",
    "<|emo:undefined|>", "<|en|>", "<|en|>",
    "<|pnc|>", "<|noitn|>", "<|timestamp|>", "<|diarize|>",
]
prompt_ids = torch.tensor(
    [[processor.tokenizer.convert_tokens_to_ids(t) for t in PROMPT_TOKENS]]
).to(model.device)

# Load any ≤ 30 s audio clip.
audio = load_audio("clip.wav", sampling_rate=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
inputs = {k: v.to(model.device, dtype=model.dtype if v.is_floating_point() else None)
          for k, v in inputs.items()}

with torch.inference_mode():
    out = model.generate(
        input_features=inputs["input_features"],
        attention_mask=torch.ones(inputs["input_features"].shape[:2], device=model.device),
        decoder_input_ids=prompt_ids,
        max_new_tokens=400,
        do_sample=False,
        repetition_penalty=1.2,  # baked into generation_config but explicit here
    )

raw = processor.tokenizer.decode(out[0], skip_special_tokens=False)
print(raw)
# → <|spltoken0|><|t:0.0|> Welcome back. <|t:1.5|><|spltoken1|><|t:1.5|> Thanks. <|t:2.4|>...

Parsing the output into structured segments

python
SEG_RE = re.compile(r"<\|spltoken(\d+)\|><\|t:(\d+\.\d+)\|>(.*?)<\|t:(\d+\.\d+)\|>", re.DOTALL)

# Drop the prompt prefix; the diarized text follows <|diarize|>
text = raw.split("<|diarize|>", 1)[-1].replace("<|endoftext|>", "")

segments = [
    {
        "speaker": int(m.group(1)),
        "start":   float(m.group(2)),
        "end":     float(m.group(4)),
        "text":    re.sub(r"<\|[^|]+\|>", "", m.group(3)).strip(),
    }
    for m in SEG_RE.finditer(text)
]
for s in segments:
    print(f"[{s['start']:6.2f}–{s['end']:6.2f}] SPK{s['speaker']:02d}  {s['text']}")

Output:

text
[  0.00–  1.50] SPK00  Welcome back.
[  1.50–  2.40] SPK01  Thanks for having me.
[  2.40–  3.80] SPK00  Let's get into it.

The model uses 8 reusable speaker slots per clip (<|spltoken0|>…<|spltoken7|>). IDs are local to the clip — there is no global identity across separately decoded clips. For long-form audio that's split into windows, re-link windows with the helper below.

Long-form audio (> 30 s)

Audio longer than 30 s exceeds the encoder's maximum window. Two helpers in this repo do the windowing + cross-chunk speaker matching for you:

diarize_long_vllm.py — recommended. Calls a local vLLM server concurrently (continuous batching) and reuses one GPU for both decode and embedding. ~44× RTF on a 10-min clip on a single 3090.
diarize_long.py — transformers-only fallback, no server needed. Slower (~7× RTF on the same clip) but minimal deps.

Both helpers:

Slide 28 s windows with 2 s overlap over the full audio
Decode each window with this model
Embed each parsed segment with ReDimNet2 B6 (12 M params, 0.17 % EER, loaded automatically via torch.hub)
Cluster embeddings globally with cosine-distance AHC so the same speaker keeps the same ID across windows

bash
# Assumes vLLM is already serving (see next section)
python diarize_long_vllm.py podcast.wav \
    --vllm http://127.0.0.1:8000 \
    --model syvai/cohere-transcribe-diarize \
    --language en \
    --tau 0.45 \
    --concurrency 32 \
    --embed-batch 32

Or via the offline transformers helper (slower, no server):

python
from diarize_long import diarize_long_audio

segments = diarize_long_audio(
    audio="podcast.wav",
    diar_model_id="syvai/cohere-transcribe-diarize",
    language="en",
    chunk_s=28.0,
    overlap_s=2.0,
    cluster_threshold=0.45,
)

Additional dependencies for long-form inference: numpy, scipy, soundfile, torchaudio (required by ReDimNet2's feature extractor), plus aiohttp if using diarize_long_vllm.py.

Tuning the clustering threshold. cluster_threshold is the cosine-distance ceiling for AHC merges over ReDimNet2 embeddings. Around 0.45 is a good default for podcast / panel-style audio: a 2-min Bernie Sanders town-hall clip cleanly resolves Bernie as one consistent ID across all 5 sliding windows and the host as a second ID, while short audience interjections get their own IDs. Drop to 0.30–0.35 if the audio has many similar-sounding speakers; raise to 0.50–0.55 for noisier conditions where you'd rather collapse near-duplicate IDs.

Serving with vLLM (recommended)

The transformers code path above works but is single-stream. For production we run this model on vLLM 0.19.0 (note: 0.19.1 is broken) — it gives continuous batching, a custom OpenAI-compatible diarized_json response format, and ~25× higher peak throughput than calling model.generate() in a loop.

One-time setup

Two scripts ship with this repo to handle the setup — both idempotent:

bash
# Download the model locally first, then patch it
hf download syvai/cohere-transcribe-diarize --local-dir cohere-transcribe-diarize

# 1. Reshape the checkpoint files for vLLM compatibility
python fix_for_vllm.py ./cohere-transcribe-diarize

fix_for_vllm.py makes three edits to your local copy:

tokenizer_config.json: drops the legacy extra_special_tokens list (transformers 4.57+ expects a dict; the actual tokens are still in tokenizer.json).
config.json: sets head.num_classes and transf_decoder.config_dict.vocab_size to 16684 (the resized vocab).
model.safetensors: strips the model. weight-name prefix and drops the BatchNorm num_batches_tracked tensors vLLM's CohereAsr model doesn't register.

bash
# 2. Install vLLM 0.19.0 (NOT 0.19.1 — broken)
uv pip install "vllm==0.19.0" --torch-backend=cu128
uv pip install librosa

# 3. Patch vLLM's speech_to_text endpoint to add diarized_json
python vllm_diarized_patch.py

vllm_diarized_patch.py applies five edits inside the installed vLLM (also idempotent):

protocol.py — add "diarized_json" to the AudioResponseFormat enum
protocol.py — force skip_special_tokens=False in to_sampling_params so <|spltoken*|> and <|t:*|> survive into the response text
speech_to_text.py — let the validator accept response_format="diarized_json"
speech_to_text.py — parse the raw token stream with the segment regex and return OpenAI-compatible {task, language, duration, text, segments:[{speaker, start, end, text}], speakers, usage} JSON
— pass returns through unchanged (otherwise the diarized branch's return value gets misinterpreted as a streaming generator and the response body comes out empty)

Launch the server

bash
vllm serve ./cohere-transcribe-diarize \
    --served-model-name syvai/cohere-transcribe-diarize \
    --trust-remote-code \
    --host 127.0.0.1 --port 8000 \
    --gpu-memory-utilization 0.55     # leaves ~10 GB for ReDimNet2 batching

--gpu-memory-utilization 0.55 is the sweet spot on a 24 GB card when you also run ReDimNet2 on the same GPU for long-form. If you only need short-form decode (≤ 30 s, no cross-chunk linking), bump it to 0.85 for better KV cache headroom.

Call the API

Plain transcription is OpenAI-compatible:

bash
curl -X POST http://127.0.0.1:8000/v1/audio/transcriptions \
    -F "file=@clip.wav" \
    -F "model=syvai/cohere-transcribe-diarize" \
    -F "language=en" \
    -F "response_format=diarized_json" \
    --form-string "prompt=<|startofcontext|><|startoftranscript|><|emo:undefined|><|en|><|en|><|pnc|><|noitn|><|timestamp|><|diarize|>"

Response shape (mirrors OpenAI's gpt-4o-transcribe-diarize):

json
{
  "task": "transcribe",
  "language": "en",
  "duration": 28.0,
  "text": "UM I REJECT THE IDEA I REALLY DO ...",
  "segments": [
    {"speaker": "SPEAKER_00", "start": 2.5,  "end": 3.8,  "text": "I REALLY DO"},
    {"speaker": "SPEAKER_01", "start": 3.6,  "end": 15.0, "text": "IT'S ONE OF THINGS THAT BOTHERS ME ..."},
    {"speaker": "SPEAKER_02", "start": 15.5, "end": 28.0, "text": "IS RAISING A STARVATION MINIMUM WAGE ..."}
  ],
  "speakers": ["SPEAKER_00", "SPEAKER_01", "SPEAKER_02"],
  "usage": {"type": "duration", "seconds": 28}
}

The prompt field must be passed explicitly — vLLM's default prompt builder emits <|nodiarize|> which suppresses the speaker tokens.

Measured throughput (RTX 3090, 28 s clips)

Table with columns: Concurrency, Throughput
Concurrency	Throughput
1	22× audio/wall
8	117×
32	171×
128	249× (peak)

vLLM does continuous (in-flight) batching automatically — fire concurrent requests at the endpoint and it batches them through one forward pass.

Training

This model was produced by full fine-tuning of CohereLabs/cohere-transcribe-03-2026 on English diarization data. The base vocabulary was extended with 8 speaker tokens and 300 100 ms timestamp tokens; the new rows of the embedding and LM-head matrices were initialised from the existing token embedding statistics.

Table with columns: Dataset, Rows, Description
Dataset	Rows	Description
AMI SDM (train split)	19,928	Single-distant-microphone meeting recordings, sliding 28 s windows with 14 s hop, up to 4 simultaneous speakers per window. Provides realistic multi-speaker conversation with overlap, hesitations, and turn-taking.
LibriSpeech synthetic mix	11,813	Synthetic K-speaker mixtures (K weighted 0.2 / 0.3 / 0.3 / 0.2 for K=1…4) constructed from LibriSpeech utterances, with realistic gap silences. Provides clean cross-talk-free speaker examples to anchor the diarization head.
Total	31,741	All segments are ≤ 30 s and capped at K ≤ 4 speakers.

Training ran for 2 epochs at peak LR 3e-4 (linear warmup over 100 optimizer steps, then linear decay to 0). Effective batch size 128 (per-device batch 2 × 64 gradient-accumulation), bf16, gradient checkpointing, AdamW8bit optimizer. The full fine-tune updates all 2 B parameters. repetition_penalty=1.2 is baked into the generation config and is required at inference — without it, K=4 outputs occasionally loop on a single speaker token.

Limitations

30 s hard cap per decoder pass — use diarize_long for longer audio. The Cohere feature extractor batches longer clips into multiple chunks, which the diarization decoder is not trained to consume.
K ≤ 4 well-supported, K = 5–8 still emit but accuracy degrades on dense overlapping speech.
Real-time factor ≈ 14× on RTX 3090 at bf16 — the 2 B autoregressive decoder is the bottleneck. For >100× RTF on long audio, pair with a smaller segmenter (e.g. DiariZen-base) or use this model only on the highlight regions.
Speaker IDs are local to each generate call. Always cluster embeddings across windows when working with audio that crosses the 30 s boundary.

Citation

If you use this model, please cite Cohere Labs' base release alongside this fine-tune:

bibtex
@misc{cohere-transcribe-diarize-2026,
  author       = {{syv.ai}},
  title        = {Cohere Transcribe — Diarize + Timestamps (English)},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/syvai/cohere-transcribe-diarize}},
}

License

Apache 2.0, inherited from the base model.

Quick start

bash
pip install transformers==4.57.6 torch huggingface_hub soundfile librosa sentencepiece protobuf

python
import re
import torch
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
from transformers.audio_utils import load_audio

MODEL_ID = "syvai/cohere-transcribe-diarize"

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    MODEL_ID, dtype=torch.bfloat16
).to("cuda").eval()

# Prompt that activates diarization + timestamps. The base Cohere model
# uses special control tokens to switch features on/off; we keep that contract.
# `<|en|><|en|>` is the canonical Cohere prompt — the two slots are
# audio-language + transcript-language; setting them to the same code means
# "transcribe" (different codes would be "translate"). To run on another
# Cohere language, swap BOTH tokens, e.g. `<|de|><|de|>`.
# Each `<|...|>` is a single special token in the tokenizer vocab. Resolve
# via convert_tokens_to_ids — running the prompt string through the tokenizer
# re-tokenizes each marker into 6-12 subword pieces, which weakens the
# control-token signal the model trained on.
PROMPT_TOKENS = [
    "<|startofcontext|>", "<|startoftranscript|>",
    "<|emo:undefined|>", "<|en|>", "<|en|>",
    "<|pnc|>", "<|noitn|>", "<|timestamp|>", "<|diarize|>",
]
prompt_ids = torch.tensor(
    [[processor.tokenizer.convert_tokens_to_ids(t) for t in PROMPT_TOKENS]]
).to(model.device)

# Load any ≤ 30 s audio clip.
audio = load_audio("clip.wav", sampling_rate=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
inputs = {k: v.to(model.device, dtype=model.dtype if v.is_floating_point() else None)
          for k, v in inputs.items()}

with torch.inference_mode():
    out = model.generate(
        input_features=inputs["input_features"],
        attention_mask=torch.ones(inputs["input_features"].shape[:2], device=model.device),
        decoder_input_ids=prompt_ids,
        max_new_tokens=400,
        do_sample=False,
        repetition_penalty=1.2,  # baked into generation_config but explicit here
    )

raw = processor.tokenizer.decode(out[0], skip_special_tokens=False)
print(raw)
# → <|spltoken0|><|t:0.0|> Welcome back. <|t:1.5|><|spltoken1|><|t:1.5|> Thanks. <|t:2.4|>...

Parsing the output into structured segments

python
SEG_RE = re.compile(r"<\|spltoken(\d+)\|><\|t:(\d+\.\d+)\|>(.*?)<\|t:(\d+\.\d+)\|>", re.DOTALL)

# Drop the prompt prefix; the diarized text follows <|diarize|>
text = raw.split("<|diarize|>", 1)[-1].replace("<|endoftext|>", "")

segments = [
    {
        "speaker": int(m.group(1)),
        "start":   float(m.group(2)),
        "end":     float(m.group(4)),
        "text":    re.sub(r"<\|[^|]+\|>", "", m.group(3)).strip(),
    }
    for m in SEG_RE.finditer(text)
]
for s in segments:
    print(f"[{s['start']:6.2f}–{s['end']:6.2f}] SPK{s['speaker']:02d}  {s['text']}")

Output:

text
[  0.00–  1.50] SPK00  Welcome back.
[  1.50–  2.40] SPK01  Thanks for having me.
[  2.40–  3.80] SPK00  Let's get into it.

Long-form audio (> 30 s)

Audio longer than 30 s exceeds the encoder's maximum window. Two helpers in this repo do the windowing + cross-chunk speaker matching for you:

diarize_long_vllm.py — recommended. Calls a local vLLM server concurrently (continuous batching) and reuses one GPU for both decode and embedding. ~44× RTF on a 10-min clip on a single 3090.
diarize_long.py — transformers-only fallback, no server needed. Slower (~7× RTF on the same clip) but minimal deps.

Both helpers:

Slide 28 s windows with 2 s overlap over the full audio
Decode each window with this model
Embed each parsed segment with ReDimNet2 B6 (12 M params, 0.17 % EER, loaded automatically via torch.hub)
Cluster embeddings globally with cosine-distance AHC so the same speaker keeps the same ID across windows

bash
# Assumes vLLM is already serving (see next section)
python diarize_long_vllm.py podcast.wav \
    --vllm http://127.0.0.1:8000 \
    --model syvai/cohere-transcribe-diarize \
    --language en \
    --tau 0.45 \
    --concurrency 32 \
    --embed-batch 32

Or via the offline transformers helper (slower, no server):

python
from diarize_long import diarize_long_audio

segments = diarize_long_audio(
    audio="podcast.wav",
    diar_model_id="syvai/cohere-transcribe-diarize",
    language="en",
    chunk_s=28.0,
    overlap_s=2.0,
    cluster_threshold=0.45,
)

Additional dependencies for long-form inference: numpy, scipy, soundfile, torchaudio (required by ReDimNet2's feature extractor), plus aiohttp if using diarize_long_vllm.py.

Serving with vLLM (recommended)

One-time setup

Two scripts ship with this repo to handle the setup — both idempotent:

bash
# Download the model locally first, then patch it
hf download syvai/cohere-transcribe-diarize --local-dir cohere-transcribe-diarize

# 1. Reshape the checkpoint files for vLLM compatibility
python fix_for_vllm.py ./cohere-transcribe-diarize

fix_for_vllm.py makes three edits to your local copy:

tokenizer_config.json: drops the legacy extra_special_tokens list (transformers 4.57+ expects a dict; the actual tokens are still in tokenizer.json).
config.json: sets head.num_classes and transf_decoder.config_dict.vocab_size to 16684 (the resized vocab).
model.safetensors: strips the model. weight-name prefix and drops the BatchNorm num_batches_tracked tensors vLLM's CohereAsr model doesn't register.

bash
# 2. Install vLLM 0.19.0 (NOT 0.19.1 — broken)
uv pip install "vllm==0.19.0" --torch-backend=cu128
uv pip install librosa

# 3. Patch vLLM's speech_to_text endpoint to add diarized_json
python vllm_diarized_patch.py

vllm_diarized_patch.py applies five edits inside the installed vLLM (also idempotent):

protocol.py — add "diarized_json" to the AudioResponseFormat enum
protocol.py — force skip_special_tokens=False in to_sampling_params so <|spltoken*|> and <|t:*|> survive into the response text
speech_to_text.py — let the validator accept response_format="diarized_json"
speech_to_text.py — parse the raw token stream with the segment regex and return OpenAI-compatible {task, language, duration, text, segments:[{speaker, start, end, text}], speakers, usage} JSON
— pass returns through unchanged (otherwise the diarized branch's return value gets misinterpreted as a streaming generator and the response body comes out empty)

Launch the server

bash
vllm serve ./cohere-transcribe-diarize \
    --served-model-name syvai/cohere-transcribe-diarize \
    --trust-remote-code \
    --host 127.0.0.1 --port 8000 \
    --gpu-memory-utilization 0.55     # leaves ~10 GB for ReDimNet2 batching

Call the API

Plain transcription is OpenAI-compatible:

bash
curl -X POST http://127.0.0.1:8000/v1/audio/transcriptions \
    -F "file=@clip.wav" \
    -F "model=syvai/cohere-transcribe-diarize" \
    -F "language=en" \
    -F "response_format=diarized_json" \
    --form-string "prompt=<|startofcontext|><|startoftranscript|><|emo:undefined|><|en|><|en|><|pnc|><|noitn|><|timestamp|><|diarize|>"

Response shape (mirrors OpenAI's gpt-4o-transcribe-diarize):

json
{
  "task": "transcribe",
  "language": "en",
  "duration": 28.0,
  "text": "UM I REJECT THE IDEA I REALLY DO ...",
  "segments": [
    {"speaker": "SPEAKER_00", "start": 2.5,  "end": 3.8,  "text": "I REALLY DO"},
    {"speaker": "SPEAKER_01", "start": 3.6,  "end": 15.0, "text": "IT'S ONE OF THINGS THAT BOTHERS ME ..."},
    {"speaker": "SPEAKER_02", "start": 15.5, "end": 28.0, "text": "IS RAISING A STARVATION MINIMUM WAGE ..."}
  ],
  "speakers": ["SPEAKER_00", "SPEAKER_01", "SPEAKER_02"],
  "usage": {"type": "duration", "seconds": 28}
}

The prompt field must be passed explicitly — vLLM's default prompt builder emits <|nodiarize|> which suppresses the speaker tokens.

Measured throughput (RTX 3090, 28 s clips)

Table with columns: Concurrency, Throughput
Concurrency	Throughput
1	22× audio/wall
8	117×
32	171×
128	249× (peak)

vLLM does continuous (in-flight) batching automatically — fire concurrent requests at the endpoint and it batches them through one forward pass.

Training

Table with columns: Dataset, Rows, Description
Dataset	Rows	Description
AMI SDM (train split)	19,928	Single-distant-microphone meeting recordings, sliding 28 s windows with 14 s hop, up to 4 simultaneous speakers per window. Provides realistic multi-speaker conversation with overlap, hesitations, and turn-taking.
LibriSpeech synthetic mix	11,813	Synthetic K-speaker mixtures (K weighted 0.2 / 0.3 / 0.3 / 0.2 for K=1…4) constructed from LibriSpeech utterances, with realistic gap silences. Provides clean cross-talk-free speaker examples to anchor the diarization head.
Total	31,741	All segments are ≤ 30 s and capped at K ≤ 4 speakers.

Limitations

30 s hard cap per decoder pass — use diarize_long for longer audio. The Cohere feature extractor batches longer clips into multiple chunks, which the diarization decoder is not trained to consume.
K ≤ 4 well-supported, K = 5–8 still emit but accuracy degrades on dense overlapping speech.
Real-time factor ≈ 14× on RTX 3090 at bf16 — the 2 B autoregressive decoder is the bottleneck. For >100× RTF on long audio, pair with a smaller segmenter (e.g. DiariZen-base) or use this model only on the highlight regions.
Speaker IDs are local to each generate call. Always cluster embeddings across windows when working with audio that crosses the 30 s boundary.

Citation

If you use this model, please cite Cohere Labs' base release alongside this fine-tune:

bibtex
@misc{cohere-transcribe-diarize-2026,
  author       = {{syv.ai}},
  title        = {Cohere Transcribe — Diarize + Timestamps (English)},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/syvai/cohere-transcribe-diarize}},
}

License

Apache 2.0, inherited from the base model.

cohere-transcribe-diarize

README

Quick start

Parsing the output into structured segments

Long-form audio (> 30 s)

Serving with vLLM (recommended)

One-time setup

Launch the server

Call the API

Measured throughput (RTX 3090, 28 s clips)

Training

Limitations

Citation

License

Explore FriendliAI today

README

Quick start

Parsing the output into structured segments

Long-form audio (> 30 s)

Serving with vLLM (recommended)

One-time setup

Launch the server

Call the API

Measured throughput (RTX 3090, 28 s clips)

Training

Limitations

Citation

License