zhifeixie

AudioInteraction

README

License: apache-2.0

Model Details

Model name: Audio-Interaction
Task: Streaming audio-conditioned text generation (audio in, text out)
Audio encoder: Qwen2.5-Omni audio tower (chunk-wise)
Audio framing: 16 kHz, padded to 0.4-second (6400-sample) boundaries; 10 encoder-output frames per chunk
Decoding states: LISTENING (emits KEEP_SILENCE / TEXT_BEGIN) and SPEAKING (emits text until TEXT_END)
Default sampling: temperature 0.3, top-k 3
Default max new tokens: 4096 per session
License: Apache-2.0

Repository Contents

text
Audio-Interaction/
├── model-00001-of-00004.safetensors      # LM weights, sharded (≈4 GB each)
├── model-00002-of-00004.safetensors
├── model-00003-of-00004.safetensors
├── model-00004-of-00004.safetensors
├── model.safetensors.index.json          # Shard index consumed by safetensors loader
├── config.json                           # Top-level model config
├── generation_config.json                # Generation defaults
├── model_config.yaml                     # GPT config consumed by Config.from_file
├── hyperparameters.yaml                  # Training-time hyperparameters (reference)
├── tokenizer.json                        # Tokenizer
├── tokenizer_config.json
├── MiniOmni3_ChunkwisedEncoder.pth       # Audio encoder weights (Qwen2.5-Omni audio tower)
└── qwen25OmniConfig/                     # Audio-encoder config (nested: thinker_config.audio_config)

Intended Use

Audio-Interaction is intended for streaming conversational agents that need to react to audio as it arrives — for example, voice assistants that may interject mid-utterance, alarms that respond to ambient sound, or low-latency dialogue systems where waiting for a full utterance before replying is too slow.

Quick Start

Installation

bash
git clone https://github.com/xzf-thu/Audio-Interaction.git
cd Audio-Interaction
conda create -n Audio-Interaction python=3.10 -y
conda activate Audio-Interaction
pip install -r requirements.txt

Download the checkpoint

From the Audio-Interaction project root, pull the weights into checkpoints/:

python
from huggingface_hub import snapshot_download

snapshot_download(repo_id="zhifeixie/Audio-Interaction", local_dir="checkpoints")

snapshot_download is the recommended path — it pulls every file and resumes on interruption.

Python Usage

python
from src.miniomni3.generate.run import run_inference

run_inference(
    checkpoint_dir="checkpoints",
    audio_paths=["/path/to/audio.wav"], # offline mode: one round per path
    device="cuda:0",                    # or "mps" / "cpu"
)

Streaming Protocol

A single session looks like:

text
[system prompt tokens]
  ┌─── LISTENING ───┐
  │ AUDIO_BEGIN PAD*10 ASSISTANT  →  KEEP_SILENCE          (keep listening)
  │ AUDIO_BEGIN PAD*10 ASSISTANT  →  TEXT_BEGIN EMOTION    (start replying)
  └─────────────────┘
  ┌─── SPEAKING ────┐
  │ … text tokens … TEXT_END                                (reply finished)
  └─────────────────┘
  ┌─── LISTENING ───┐  (next audio chunk)
  …

The model is trained to emit at most one TEXT_BEGIN per audio chunk. Each assistant turn begins with TEXT_BEGIN, followed by an emotion token, the reply tokens, and TEXT_END. Turns starting with KEEP_SILENCE indicate the model chose not to respond to that chunk.

Limitations

The model produces text, not speech. Pair it with a TTS system for end-to-end voice interaction.
Audio must be 16 kHz mono; non-conforming inputs are resampled and padded to 0.4-second boundaries.
Decisions are made at 0.4-second granularity (one encoder chunk), which sets a floor on response-onset latency.
Trailing partial audio chunks shorter than 10 encoder frames are dropped before generation.

Citation

bibtex
@misc{xie2026audiointeractionmodel,
      title={Audio Interaction Model}, 
      author={Zhifei Xie and Zihang Liu and Ze An and Xiaobin Hu and Yue Liao and Ziyang Ma and Dongchao Yang and Mingbao Lin and Deheng Ye and Shuicheng Yan and Chunyan Miao},
      year={2026},
      eprint={2606.05121},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2606.05121}, 
}

Acknowledgements

Audio-Interaction builds on the Qwen2.5-Omni audio encoder. We thank the Qwen team and the maintainers of OpenAI Whisper for the audio-loading utilities used in this project.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

zhifeixie

Model Tree

Base

this model

Input Modalities

Text

Output Modalities

Text

Supported Functionality

Dedicated EndpointsContainer

Explore FriendliAI today

Get started Talk to an engineer

README

License: apache-2.0

Model Details

Model name: Audio-Interaction
Task: Streaming audio-conditioned text generation (audio in, text out)
Audio encoder: Qwen2.5-Omni audio tower (chunk-wise)
Audio framing: 16 kHz, padded to 0.4-second (6400-sample) boundaries; 10 encoder-output frames per chunk
Decoding states: LISTENING (emits KEEP_SILENCE / TEXT_BEGIN) and SPEAKING (emits text until TEXT_END)
Default sampling: temperature 0.3, top-k 3
Default max new tokens: 4096 per session
License: Apache-2.0

Repository Contents

text
Audio-Interaction/
├── model-00001-of-00004.safetensors      # LM weights, sharded (≈4 GB each)
├── model-00002-of-00004.safetensors
├── model-00003-of-00004.safetensors
├── model-00004-of-00004.safetensors
├── model.safetensors.index.json          # Shard index consumed by safetensors loader
├── config.json                           # Top-level model config
├── generation_config.json                # Generation defaults
├── model_config.yaml                     # GPT config consumed by Config.from_file
├── hyperparameters.yaml                  # Training-time hyperparameters (reference)
├── tokenizer.json                        # Tokenizer
├── tokenizer_config.json
├── MiniOmni3_ChunkwisedEncoder.pth       # Audio encoder weights (Qwen2.5-Omni audio tower)
└── qwen25OmniConfig/                     # Audio-encoder config (nested: thinker_config.audio_config)

Intended Use

Quick Start

Installation

bash
git clone https://github.com/xzf-thu/Audio-Interaction.git
cd Audio-Interaction
conda create -n Audio-Interaction python=3.10 -y
conda activate Audio-Interaction
pip install -r requirements.txt

Download the checkpoint

From the Audio-Interaction project root, pull the weights into checkpoints/:

python
from huggingface_hub import snapshot_download

snapshot_download(repo_id="zhifeixie/Audio-Interaction", local_dir="checkpoints")

snapshot_download is the recommended path — it pulls every file and resumes on interruption.

Python Usage

python
from src.miniomni3.generate.run import run_inference

run_inference(
    checkpoint_dir="checkpoints",
    audio_paths=["/path/to/audio.wav"], # offline mode: one round per path
    device="cuda:0",                    # or "mps" / "cpu"
)

Streaming Protocol

A single session looks like:

text
[system prompt tokens]
  ┌─── LISTENING ───┐
  │ AUDIO_BEGIN PAD*10 ASSISTANT  →  KEEP_SILENCE          (keep listening)
  │ AUDIO_BEGIN PAD*10 ASSISTANT  →  TEXT_BEGIN EMOTION    (start replying)
  └─────────────────┘
  ┌─── SPEAKING ────┐
  │ … text tokens … TEXT_END                                (reply finished)
  └─────────────────┘
  ┌─── LISTENING ───┐  (next audio chunk)
  …

Limitations

The model produces text, not speech. Pair it with a TTS system for end-to-end voice interaction.
Audio must be 16 kHz mono; non-conforming inputs are resampled and padded to 0.4-second boundaries.
Decisions are made at 0.4-second granularity (one encoder chunk), which sets a floor on response-onset latency.
Trailing partial audio chunks shorter than 10 encoder frames are dropped before generation.

Citation

bibtex
@misc{xie2026audiointeractionmodel,
      title={Audio Interaction Model}, 
      author={Zhifei Xie and Zihang Liu and Ze An and Xiaobin Hu and Yue Liao and Ziyang Ma and Dongchao Yang and Mingbao Lin and Deheng Ye and Shuicheng Yan and Chunyan Miao},
      year={2026},
      eprint={2606.05121},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2606.05121}, 
}

Acknowledgements

Audio-Interaction builds on the Qwen2.5-Omni audio encoder. We thank the Qwen team and the maintainers of OpenAI Whisper for the audio-loading utilities used in this project.