Neoscopio-SA/Neo_EP API & Inference Endpoint

Usage

Neo_EP is supported in Hugging Face Transformers. To run the model, first install the Transformers library. For this example, we'll also install Datasets to load audio data from the Hugging Face Hub:

bash
pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate

The model can be used with the pipeline class to transcribe audios of arbitrary length:

python
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "Neoscopio-SA/Neo_EP"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

result = pipe("audio.mp3")
print(result["text"])

Multiple audio files can be transcribed in parallel by specifying them as a list and setting the batch_size parameter:

python
result = pipe(["audio_1.mp3", "audio_2.mp3"], batch_size=2)

To transcribe with timestamps, pass the return_timestamps argument:

python
result = pipe("audio.mp3", return_timestamps=True)
print(result["chunks"])

And for word-level timestamps:

python
result = pipe("audio.mp3", return_timestamps="word")
print(result["chunks"])

Neo_EP is pre-configured for European Portuguese. If you want to explicitly set the language and task:

python
result = pipe("audio.mp3", generate_kwargs={"language": "portuguese", "task": "transcribe"})

python
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "Neoscopio-SA/Neo_EP"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

# audio_array: numpy array, sampling_rate: 16000
inputs = processor(
    audio_array,
    sampling_rate=16000,
    return_tensors="pt",
    truncation=False,
    padding="longest",
    return_attention_mask=True,
)
inputs = inputs.to(device, dtype=torch_dtype)

gen_kwargs = {
    "max_new_tokens": 448,
    "num_beams": 5,
    "no_repeat_ngram_size": 3,
    "return_timestamps": True,
    "language": "pt",
    "task": "transcribe",
}

pred_ids = model.generate(**inputs, **gen_kwargs)
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False)

print(pred_text)

Additional Speed & Memory Improvements

You can apply additional speed and memory improvements to Neo_EP to further reduce the inference speed and VRAM requirements.

Chunked Long-Form

Neo_EP has a receptive field of 30-seconds. To transcribe audios longer than this, pass the chunk_length_s parameter to the pipeline. For Neo_EP, a chunk length of 30-seconds is optimal. To activate batching over long audio files, pass the argument batch_size:

python
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    chunk_length_s=30,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
)

result = pipe("long_audio.mp3")
print(result["text"])

Flash Attention 2

We recommend using Flash-Attention 2 if your GPU supports it. To do so, first install Flash Attention:

bash
pip install flash-attn --no-build-isolation

Then pass attn_implementation="flash_attention_2" to from_pretrained:

python
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="flash_attention_2"
)

Torch Scale-Product-Attention (SDPA)

If your GPU does not support Flash Attention, we recommend making use of PyTorch scaled dot-product attention (SDPA). SDPA is activated by default for PyTorch versions 2.1.1 or greater. It can also be set explicitly:

python
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="sdpa"
)

Model Details

Neo_EP is a Transformer-based encoder-decoder model with the same architecture as openai/whisper-large-v3:

Property	Value
Parameters	1550M
Encoder Layers	32
Decoder Layers	32
Attention Heads	20
Hidden Size	1280
Mel Frequency Bins	128
Max Sequence Length	448 tokens
Receptive Field	30 seconds

Base Model Lineage

markdown
openai/whisper-large-v3 → inesc-id/WhisperLv3-FT → Neoscopio-SA/Neo_EP

Training

Neo_EP was fine-tuned in two sequential stages on NVIDIA A100 GPUs (Deucalion HPC, Portugal) using Hugging Face Transformers Seq2SeqTrainer:

Stage	Dataset	Epochs	Batch Size	Learning Rate	Warmup Steps	Scheduler	Precision
1	EuroSpeech	1	16	5e-6	200	Linear	bf16
2	FalaBracarense	1	16	5e-6	200	Linear	bf16

Both stages used gradient checkpointing and no evaluation split (100% training data).

Evaluated Use

The primary intended users of this model are developers and researchers working on European Portuguese speech processing. Neo_EP is suitable for:

Transcription of meetings, interviews, lectures, and phone calls in PT-PT
Voice-driven applications targeting European Portuguese speakers
Research on ASR for European Portuguese

We recommend that users perform robust evaluations of the model in their particular context and domain before deploying it in production.

Performance and Limitations

Neo_EP demonstrates improved transcription accuracy for European Portuguese compared to the base model. However, the following limitations apply:

30-second receptive field: Standard Whisper constraint. Use the pipeline with chunk_length_s=30 for longer audio.
No punctuation or casing: Output is lowercase and unpunctuated.
Hallucination: Like all Whisper-based models, Neo_EP may generate text not actually spoken in the audio, especially on silent or noisy segments.
Repetition: The sequence-to-sequence architecture can produce repetitive text, which can be mitigated with no_repeat_ngram_size and beam search.

Neo_EP

Get help setting up a custom Dedicated Endpoints.

README