optionally use temp files on load_audio() to avoid running out of RAM when dealing with long audio files#1221
Open
to-audiobook wants to merge 18 commits intom-bain:mainfrom
Open
optionally use temp files on load_audio() to avoid running out of RAM when dealing with long audio files#1221to-audiobook wants to merge 18 commits intom-bain:mainfrom
to-audiobook wants to merge 18 commits intom-bain:mainfrom
Conversation
Freeing the ffmpeg call output buffer after it is initially loaded into a numpy array allows us to have much more free RAM for the next operations. This helps avoiding running out of RAM when dealing with very long audio files.
apparently they changed some default arguments values after transformers v4.51.0 I believe num_beams is the culprit. It used to be 1, now it is set to 5. See huggingface/transformers#40682 So, according to them if you pass num_beams=1 to the pipeline, versions >4.51.0 be as fast as before. But, since I am not exactly sure where to put that, I'll just lock the version for now.
Starting from v4.5.0 CTranslate2 requires cudnn-9, instead of 8
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Very long audio files might crash whisperX during the call to
load_audio()in case the system runs out of RAM.This PR adds the parameter
useTmpFilestoload_audio(), which makesffmpegresample the audio using a temporary file, instead of trying to do it all on memory, thus substantially increasing the audio length whisperX can handle.NOTE: if the audio file is long enough to crash the system during the
load_audio()call, requiring the usage of temporary files, the system will probably run out of memory during the diarization stage too. I dealt with that by splitting the source audio in two or more parts, using the alignment stage result timestamps to avoid splitting the audio in the middle of a sentence. That code is not including in this patch and, even if it was, it still has some issues because, if we split the audio, the diarization might assign different speaker IDs to the same speaker on each one of those parts.