Developer Uses ONNX Runtime and Pyannote 3.0 to Split Two-Speaker Audio Into Segments
A developer has demonstrated how to detect speaker changes in a two-person audio conversation using an ONNX version of the Pyannote Segmentation 3.0 model running on CPU via ONNX Runtime. The experiment uses FFmpeg to decode a roughly 14-second MP3 recording into a 16 kHz mono waveform, which is then processed in 10-second windows to identify where one speaker gives way to another. The pipeline successfully separates six alternating utterances into six individual WAV files while maintaining consistent speaker indexing throughout. Post-processing steps handle silence, brief fluctuations, and potential overlapping speech using probability thresholds and minimum segment duration rules. The author notes this is not a full diarization pipeline, as it relies on the model's internal speaker indexes rather than embedding comparison or clustering across longer recordings.
This is an AI-generated summary. ShortSingh links to the original source for the complete article.
Discussion (0)
Log in to join the discussion and vote.
Log in