Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Usage

Neo_EP is supported in Hugging Face Transformers. To run the model, first install the Transformers library. For this example, we'll also install Datasets to load audio data from the Hugging Face Hub:

bash

pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate

The model can be used with the pipeline class to transcribe audios of arbitrary length:

python

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "Neoscopio-SA/Neo_EP"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
)
result = pipe("audio.mp3")
print(result["text"])

Multiple audio files can be transcribed in parallel by specifying them as a list and setting the batch_size parameter:

python

result = pipe(["audio_1.mp3", "audio_2.mp3"], batch_size=2)

To transcribe with timestamps, pass the return_timestamps argument:

python

result = pipe("audio.mp3", return_timestamps=True)
print(result["chunks"])

And for word-level timestamps:

python

result = pipe("audio.mp3", return_timestamps="word")
print(result["chunks"])

Neo_EP is pre-configured for European Portuguese. If you want to explicitly set the language and task:

python

result = pipe("audio.mp3", generate_kwargs={"language": "portuguese", "task": "transcribe"})

python

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "Neoscopio-SA/Neo_EP"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
# audio_array: numpy array, sampling_rate: 16000
inputs = processor(
audio_array,
sampling_rate=16000,
return_tensors="pt",
truncation=False,
padding="longest",
return_attention_mask=True,
)
inputs = inputs.to(device, dtype=torch_dtype)
gen_kwargs = {
"max_new_tokens": 448,
"num_beams": 5,
"no_repeat_ngram_size": 3,
"return_timestamps": True,
"language": "pt",
"task": "transcribe",
}
pred_ids = model.generate(**inputs, **gen_kwargs)
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False)
print(pred_text)

Additional Speed & Memory Improvements

You can apply additional speed and memory improvements to Neo_EP to further reduce the inference speed and VRAM requirements.

Chunked Long-Form

Neo_EP has a receptive field of 30-seconds. To transcribe audios longer than this, pass the chunk_length_s parameter to the pipeline. For Neo_EP, a chunk length of 30-seconds is optimal. To activate batching over long audio files, pass the argument batch_size:

python

pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
chunk_length_s=30,
batch_size=16,
torch_dtype=torch_dtype,
device=device,
)
result = pipe("long_audio.mp3")
print(result["text"])

Flash Attention 2

We recommend using Flash-Attention 2 if your GPU supports it. To do so, first install Flash Attention:

bash

pip install flash-attn --no-build-isolation

Then pass attn_implementation="flash_attention_2" to from_pretrained:

python

model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="flash_attention_2"
)

Torch Scale-Product-Attention (SDPA)

If your GPU does not support Flash Attention, we recommend making use of PyTorch scaled dot-product attention (SDPA). SDPA is activated by default for PyTorch versions 2.1.1 or greater. It can also be set explicitly:

python

model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="sdpa"
)

Model Details

Neo_EP is a Transformer-based encoder-decoder model with the same architecture as openai/whisper-large-v3:

PropertyValue
Parameters1550M
Encoder Layers32
Decoder Layers32
Attention Heads20
Hidden Size1280
Mel Frequency Bins128
Max Sequence Length448 tokens
Receptive Field30 seconds

Base Model Lineage

markdown

openai/whisper-large-v3 → inesc-id/WhisperLv3-FT → Neoscopio-SA/Neo_EP

Training

Neo_EP was fine-tuned in two sequential stages on NVIDIA A100 GPUs (Deucalion HPC, Portugal) using Hugging Face Transformers Seq2SeqTrainer:

StageDatasetEpochsBatch SizeLearning RateWarmup StepsSchedulerPrecision
1EuroSpeech1165e-6200Linearbf16
2FalaBracarense1165e-6200Linearbf16

Both stages used gradient checkpointing and no evaluation split (100% training data).

Evaluated Use

The primary intended users of this model are developers and researchers working on European Portuguese speech processing. Neo_EP is suitable for:

  • Transcription of meetings, interviews, lectures, and phone calls in PT-PT
  • Voice-driven applications targeting European Portuguese speakers
  • Research on ASR for European Portuguese

We recommend that users perform robust evaluations of the model in their particular context and domain before deploying it in production.

Performance and Limitations

Neo_EP demonstrates improved transcription accuracy for European Portuguese compared to the base model. However, the following limitations apply:

  • 30-second receptive field: Standard Whisper constraint. Use the pipeline with chunk_length_s=30 for longer audio.
  • No punctuation or casing: Output is lowercase and unpunctuated.
  • Hallucination: Like all Whisper-based models, Neo_EP may generate text not actually spoken in the audio, especially on silent or noisy segments.
  • Repetition: The sequence-to-sequence architecture can produce repetitive text, which can be mitigated with no_repeat_ngram_size and beam search.

Model provider

Neoscopio-SA

Model tree

Base

inesc-id/WhisperLv3-FT

Fine-tuned

this model

Modalities

Input

Audio

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today