Silero VAD and ONNX Runtime Detect 12 Speech Segments in 14-Second Audio Clip
A developer used the Silero VAD ONNX model with ONNX Runtime's CPU provider to detect speech in a 14.171-second two-speaker MP3 conversation. FFmpeg decoded the audio into a 16 kHz mono waveform, which was then processed in 32-millisecond chunks to generate speech probability scores. Using a detection threshold of 0.5 to open segments and 0.35 to close them, the system identified 12 distinct speech segments while discarding clips shorter than 250 milliseconds. The entire detection process completed in just 0.028 seconds on a Mac Studio, achieving a real-time factor of 0.002x. Each detected segment was saved as a separate 16-bit PCM WAV file, with the full reproducible code available in the kiarina/labs GitHub repository.
This is an AI-generated summary. ShortSingh links to the original source for the complete article.
Discussion (0)
Log in to join the discussion and vote.
Log in