Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Usage
Neo_EP is supported in Hugging Face Transformers. To run the model, first install the Transformers library. For this example, we'll also install Datasets to load audio data from the Hugging Face Hub:
bash
pip install --upgrade pippip install --upgrade transformers datasets[audio] accelerate
The model can be used with the pipeline class to transcribe audios of arbitrary length:
python
import torchfrom transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipelinedevice = "cuda:0" if torch.cuda.is_available() else "cpu"torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32model_id = "Neoscopio-SA/Neo_EP"model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)model.to(device)processor = AutoProcessor.from_pretrained(model_id)pipe = pipeline("automatic-speech-recognition",model=model,tokenizer=processor.tokenizer,feature_extractor=processor.feature_extractor,torch_dtype=torch_dtype,device=device,)result = pipe("audio.mp3")print(result["text"])
Multiple audio files can be transcribed in parallel by specifying them as a list and setting the batch_size parameter:
python
result = pipe(["audio_1.mp3", "audio_2.mp3"], batch_size=2)
To transcribe with timestamps, pass the return_timestamps argument:
python
result = pipe("audio.mp3", return_timestamps=True)print(result["chunks"])
And for word-level timestamps:
python
result = pipe("audio.mp3", return_timestamps="word")print(result["chunks"])
Neo_EP is pre-configured for European Portuguese. If you want to explicitly set the language and task:
python
result = pipe("audio.mp3", generate_kwargs={"language": "portuguese", "task": "transcribe"})
python
import torchfrom transformers import AutoModelForSpeechSeq2Seq, AutoProcessordevice = "cuda:0" if torch.cuda.is_available() else "cpu"torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32model_id = "Neoscopio-SA/Neo_EP"model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True)model.to(device)processor = AutoProcessor.from_pretrained(model_id)# audio_array: numpy array, sampling_rate: 16000inputs = processor(audio_array,sampling_rate=16000,return_tensors="pt",truncation=False,padding="longest",return_attention_mask=True,)inputs = inputs.to(device, dtype=torch_dtype)gen_kwargs = {"max_new_tokens": 448,"num_beams": 5,"no_repeat_ngram_size": 3,"return_timestamps": True,"language": "pt","task": "transcribe",}pred_ids = model.generate(**inputs, **gen_kwargs)pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False)print(pred_text)
Additional Speed & Memory Improvements
You can apply additional speed and memory improvements to Neo_EP to further reduce the inference speed and VRAM requirements.
Chunked Long-Form
Neo_EP has a receptive field of 30-seconds. To transcribe audios longer than this, pass the chunk_length_s parameter to the pipeline. For Neo_EP, a chunk length of 30-seconds is optimal. To activate batching over long audio files, pass the argument batch_size:
python
pipe = pipeline("automatic-speech-recognition",model=model,tokenizer=processor.tokenizer,feature_extractor=processor.feature_extractor,chunk_length_s=30,batch_size=16,torch_dtype=torch_dtype,device=device,)result = pipe("long_audio.mp3")print(result["text"])
Flash Attention 2
We recommend using Flash-Attention 2 if your GPU supports it. To do so, first install Flash Attention:
bash
pip install flash-attn --no-build-isolation
Then pass attn_implementation="flash_attention_2" to from_pretrained:
python
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="flash_attention_2")
Torch Scale-Product-Attention (SDPA)
If your GPU does not support Flash Attention, we recommend making use of PyTorch scaled dot-product attention (SDPA). SDPA is activated by default for PyTorch versions 2.1.1 or greater. It can also be set explicitly:
python
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="sdpa")
Model Details
Neo_EP is a Transformer-based encoder-decoder model with the same architecture as openai/whisper-large-v3:
| Property | Value |
|---|---|
| Parameters | 1550M |
| Encoder Layers | 32 |
| Decoder Layers | 32 |
| Attention Heads | 20 |
| Hidden Size | 1280 |
| Mel Frequency Bins | 128 |
| Max Sequence Length | 448 tokens |
| Receptive Field | 30 seconds |
Base Model Lineage
markdown
openai/whisper-large-v3 → inesc-id/WhisperLv3-FT → Neoscopio-SA/Neo_EP
Training
Neo_EP was fine-tuned in two sequential stages on NVIDIA A100 GPUs (Deucalion HPC, Portugal) using Hugging Face Transformers Seq2SeqTrainer:
| Stage | Dataset | Epochs | Batch Size | Learning Rate | Warmup Steps | Scheduler | Precision |
|---|---|---|---|---|---|---|---|
| 1 | EuroSpeech | 1 | 16 | 5e-6 | 200 | Linear | bf16 |
| 2 | FalaBracarense | 1 | 16 | 5e-6 | 200 | Linear | bf16 |
Both stages used gradient checkpointing and no evaluation split (100% training data).
Evaluated Use
The primary intended users of this model are developers and researchers working on European Portuguese speech processing. Neo_EP is suitable for:
- Transcription of meetings, interviews, lectures, and phone calls in PT-PT
- Voice-driven applications targeting European Portuguese speakers
- Research on ASR for European Portuguese
We recommend that users perform robust evaluations of the model in their particular context and domain before deploying it in production.
Performance and Limitations
Neo_EP demonstrates improved transcription accuracy for European Portuguese compared to the base model. However, the following limitations apply:
- 30-second receptive field: Standard Whisper constraint. Use the
pipelinewithchunk_length_s=30for longer audio. - No punctuation or casing: Output is lowercase and unpunctuated.
- Hallucination: Like all Whisper-based models, Neo_EP may generate text not actually spoken in the audio, especially on silent or noisy segments.
- Repetition: The sequence-to-sequence architecture can produce repetitive text, which can be mitigated with
no_repeat_ngram_sizeand beam search.
Model provider
Neoscopio-SA
Model tree
Base
inesc-id/WhisperLv3-FT
Fine-tuned
this model
Modalities
Input
Audio
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information