optionally use temp files on load_audio() to avoid running out of RAM when dealing with long audio files by to-audiobook · Pull Request #1221 · m-bain/whisperX

to-audiobook · 2025-09-02T20:22:30Z

Very long audio files might crash whisperX during the call to load_audio() in case the system runs out of RAM.

This PR adds the parameter useTmpFiles to load_audio(), which makes ffmpeg resample the audio using a temporary file, instead of trying to do it all on memory, thus substantially increasing the audio length whisperX can handle.

NOTE: if the audio file is long enough to crash the system during the load_audio() call, requiring the usage of temporary files, the system will probably run out of memory during the diarization stage too. I dealt with that by splitting the source audio in two or more parts, using the alignment stage result timestamps to avoid splitting the audio in the middle of a sentence. That code is not including in this patch and, even if it was, it still has some issues because, if we split the audio, the diarization might assign different speaker IDs to the same speaker on each one of those parts.

Freeing the ffmpeg call output buffer after it is initially loaded into a numpy array allows us to have much more free RAM for the next operations. This helps avoiding running out of RAM when dealing with very long audio files.

apparently they changed some default arguments values after transformers v4.51.0 I believe num_beams is the culprit. It used to be 1, now it is set to 5. See huggingface/transformers#40682 So, according to them if you pass num_beams=1 to the pipeline, versions >4.51.0 be as fast as before. But, since I am not exactly sure where to put that, I'll just lock the version for now.

thaks to https://github.com/mve/whisperX/tree/update-diarization-model

Starting from v4.5.0 CTranslate2 requires cudnn-9, instead of 8

…eporting

to-audiobook and others added 18 commits September 2, 2025 18:50

free ffmpeg output buffer before manipulating np array

f1aef55

Freeing the ffmpeg call output buffer after it is initially loaded into a numpy array allows us to have much more free RAM for the next operations. This helps avoiding running out of RAM when dealing with very long audio files.

execute garbage collection after deleting ffmpeg out buffer

d3697fc

maybe if we use a temporary file

73146e5

temp files worked. Lets try both

7d8f84c

oops

28721fd

oops I did it again

d64cb0b

try catch does not work. So let the user decide

46a06ff

update to pyannote 4.0.0

f083aac

thaks to https://github.com/mve/whisperX/tree/update-diarization-model

updated ctranslate2 to >= 4.5.0 (needs cudnn-9 now)

eb7d1f9

Starting from v4.5.0 CTranslate2 requires cudnn-9, instead of 8

added ssh to gitignore

c7d320c

Merge branch 'dev-to.audiobook'

be8c5c5

force torch/torchcodec versions because torchcodec has crappy error r…

82a47cf

…eporting

missed a 0 in torchcodec version

8741d1a

Update pyproject.toml

b7a1626

Update pyproject.toml

69a243a

Update pyproject.toml

1f00bfb

Update pyproject.toml

2403889

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

optionally use temp files on load_audio() to avoid running out of RAM when dealing with long audio files#1221

optionally use temp files on load_audio() to avoid running out of RAM when dealing with long audio files#1221
to-audiobook wants to merge 18 commits intom-bain:mainfrom
to-audiobook:main

to-audiobook commented Sep 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

to-audiobook commented Sep 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants