Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: other

Intended Use

This model is intended for Japanese ASR in speech with galgame or anime-like delivery, including:

  • visual novel and game voice transcription
  • subtitle generation workflows
  • Japanese character dialogue with expressive voice acting
  • research on domain adaptation from general ASR models to anime-style speech

It is not yet validated as a general-purpose Japanese ASR model. For broad Japanese speech, compare against the original base model before production use.

Training

The eval loss above comes from the training run's internal evaluation split. It should not be treated as an external benchmark score.

Evaluation

Fixed 800-Clip Benchmark

The following numbers are from a fixed 800-clip Japanese ASR evaluation set sampled with seed 20260531. The set contains 200 clips from each source:

sourcedatasetsplitclipsduration
Nekoparagrider-transwithai/nekopara-speechtrain200991.0s
Anime Speechjoujiboi/japanese-anime-speechtrain2001053.5s
JSUT Basic5000japanese-asr/ja_asr.jsut_basic5000test2001067.4s
Common Voice 8.0 JAjapanese-asr/ja_asr.common_voice_8_0test200996.4s

Total: 800 clips, 4108.3s audio, 17354 reference characters.

Metric: strict character error rate (CER) after removing whitespace and common Japanese/ASCII punctuation. S, I, and D are substitution, insertion, and deletion rates divided by reference characters. The same decoding and normalization were used for all models.

modelrowsCERSID
Qwen/Qwen3-ASR-0.6B8000.16730.10250.02140.0434
jaykwok/Qwen3-ASR-0.6B-JA-Anime-Galgame8000.14380.09620.02280.0249
Qwen/Qwen3-ASR-1.7B8000.14370.08510.01690.0418
jaykwok/Qwen3-ASR-1.7B-JA-Anime-Galgame8000.12850.08120.02310.0242

CER by source:

modelNekoparaAnime SpeechJSUTCommon Voice
Qwen/Qwen3-ASR-0.6B0.29000.12440.12970.1552
jaykwok/Qwen3-ASR-0.6B-JA-Anime-Galgame0.23920.08110.12070.1568
Qwen/Qwen3-ASR-1.7B0.28030.10910.09480.1269
jaykwok/Qwen3-ASR-1.7B-JA-Anime-Galgame0.22760.07990.09980.1312

For this 0.6B checkpoint, full SFT improves overall CER from 0.1673 to 0.1438, a 14.0% relative reduction. The largest improvement is deletion reduction, from 0.0434 to 0.0249. In-domain gains are stronger: Nekopara CER improves by 17.5% relative, and Anime Speech CER improves by 34.8% relative. Common Voice is effectively flat, so this checkpoint should still be treated primarily as a galgame/anime-domain model rather than a general Japanese ASR upgrade.

These numbers are a small reproducible sanity benchmark, not a comprehensive public leaderboard. Strict character CER can over-penalize kana/kanji variants, long-vowel spelling, expressive writing, and transcript style differences.

Additional Evaluation Candidates

Recommended additional evaluation sets:

For a larger follow-up benchmark, use a fixed sample instead of evaluating every available hour. A practical next pass would be:

datasetdomainsuggested subsetreason
ntaquan0125/steinsgate-voicevisual novel500-2000 clipssmall, strongly in-domain, but check access/license first
grider-transwithai/nekopara-speechvisual novel/game voice500-2000 fixed random clipsrelevant character voice with metadata; use the full distribution unless you need content filtering
joujiboi/japanese-anime-speechanime/VN dialogue1000-3000 fixed random clipsbroader anime-style speech; full set is larger, so sample first
makiligon/Blue-Archive-Japanese-Voicelinesgame/anime voice lines500 clips if transcripts existvery small download, but card/viewer metadata appears incomplete
ja_asr.common_voice_8_0general Japanesefull or 1000 clipsquick out-of-domain sanity check
ja_asr.jsut_basic5000read Japanesefull or 1000 clipscompact read-speech regression check

Report CER plus substitution, insertion, and deletion rates, with the exact normalization and decoding settings.

Repository Contents

This repository intentionally includes training recovery artifacts:

  • model.safetensors
  • tokenizer and processor files
  • optimizer.pt
  • scheduler.pt
  • rng_state.pth
  • trainer_state.json
  • training_args.bin

For inference-only use, the optimizer and scheduler files are not required.

Inference

Use the same inference stack as the upstream Qwen3-ASR models, replacing the model id with:

text

jaykwok/Qwen3-ASR-0.6B-JA-Anime-Galgame

Refer to the upstream Qwen3-ASR documentation for the latest supported inference commands and runtime requirements.

Limitations

  • The model is specialized for galgame/anime-style Japanese speech and may be less reliable on news, meetings, lectures, or spontaneous conversation.
  • The training data may contain adult or NSFW source material. Downstream users should account for domain and content bias.
  • The published benchmark is small and should be treated as a sanity check rather than a full leaderboard result.
  • Transcriptions may still contain hallucinations, punctuation differences, or style-specific handling of non-speech vocalizations.

License and Use

The base model Qwen/Qwen3-ASR-0.6B is released under Apache-2.0.

This fine-tuned checkpoint was trained on litagin/Galgame_Speech_ASR_16kHz. Users must review and comply with the dataset license and upstream terms before redistribution, commercial use, or further fine-tuning. This model card does not grant rights beyond the upstream model and dataset licenses.

Model provider

jaykwok

Model tree

Base

Qwen/Qwen3-ASR-0.6B

Fine-tuned

this model

Modalities

Input

Audio

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today