How to transcribe audio files using OpenAI API

When preparing data for an AI Persona Chatbot or any AI program that draws from pre-recorded audio, you might find yourself needing to transcribe large volumes of audio files. One of the tools available for this task is OpenAI’s “Whisper” Automatic Speech Recognition (ASR) system. Whisper has been trained on an extensive dataset that’s both multilingual and multitask, sourced from the web.

To transcribe audio using the Whisper ASR API, you’ll interact with OpenAI’s API. It’s essential to note that the API comes with associated costs, and there are limitations, especially when dealing with large batches of MP3 files. As an illustration, I incurred a charge of 2.24 USD to transcribe approximately 4 hours of audio, which equated to 411 MB of MP3 files.

By default, the Whisper API can process files up to 25 MB. If your audio file exceeds this size, you’ll need to divide it into segments of 25 MB or smaller. Alternatively, you can opt to use a compressed audio format.

Code Overview

To get started, you’ll need to install a few libraries. Run the following command:

pip install PyDub OpenAI ffmpeg

The subsequent code will:

  • Scan and read every MP3 file from a designated directory.
  • If any MP3 files are too large, it will convert and segment them.
  • Utilize the Whisper API to transcribe each segment.
  • Save the transcriptions to individual text files.
import os
from pydub import AudioSegment
import openai

# Load OpenAI API key from a file
with open('api_key.txt', 'r') as key_file:
    openai.api_key = key_file.readline().strip()

# Define constants
MAX_FILE_SIZE_MB = 25
AUDIO_CHUNK_LENGTH_MS = 10 * 60 * 1000  # 10 minutes in milliseconds

def split_audio_file(file_path):
    """Splits an audio file into chunks of 10 minutes each.
    
    Args:
        file_path (str): Path to the audio file.
        
    Returns:
        list: List of audio chunks.
    """
    audio = AudioSegment.from_mp3(file_path)
    audio_chunks = []
    
    for i in range(0, len(audio), AUDIO_CHUNK_LENGTH_MS):
        chunk = audio[i:i + AUDIO_CHUNK_LENGTH_MS]
        audio_chunks.append(chunk)
    
    return audio_chunks

def transcribe_chunk(chunk, prompt=""):
    """Transcribes an audio chunk using OpenAI's API.
    
    Args:
        chunk (AudioSegment): Audio segment to be transcribed.
        prompt (str, optional): Additional text to help guide the transcription. Defaults to "".
        
    Returns:
        str: Transcription of the audio chunk.
    """
    # Save chunk to a temporary file
    temp_file = "temp_chunk.mp3"
    chunk.export(temp_file, format="mp3")
    
    with open(temp_file, "rb") as audio_file:
        # Pass the prompt parameter to the transcription API
        transcript = openai.Audio.transcribe("whisper-1", audio_file, prompt=prompt).get('text')
    
    # Remove temporary file
    os.remove(temp_file)
    
    return transcript

def main():
    """Main function to transcribe audio files in a directory."""
    directory = '/path/to/mp3/files'
    
    for filename in os.listdir(directory):
        if filename.endswith('.mp3'):
            file_path = os.path.join(directory, filename)
            file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
            
            # If the file is larger than the MAX_FILE_SIZE_MB, split it into chunks
            if file_size_mb > MAX_FILE_SIZE_MB:
                audio_chunks = split_audio_file(file_path)
                transcriptions = []
                previous_transcript = ""  # Initialize an empty previous transcript
                
                for chunk in audio_chunks:
                    # Prepend the previous transcript to the chunk transcription
                    transcript_chunk = transcribe_chunk(chunk, prompt=previous_transcript)
                    transcriptions.append(transcript_chunk)
                    # Update previous_transcript for the next chunk
                    previous_transcript = transcript_chunk
                
                full_transcription = '\n'.join(transcriptions)
                
            else:
                # If the file is smaller than the maximum size, transcribe it directly
                with open(file_path, "rb") as audio_file:
                    full_transcription = openai.Audio.transcribe("whisper-1", audio_file).get('text')
            
            # Write transcription to a .txt file
            with open(os.path.splitext(file_path)[0] + '.txt', 'w') as txt_file:
                txt_file.write(full_transcription)

if __name__ == "__main__":
    # Execute the main function when the script is run directly
    main()

Download the code from GitHub.

Key Points in the Code

  • File Setup: Ensure your OpenAI API key is written to a file named api_key.txt in the same directory as the script.
  • Chunk Adjustments: You can modify the AUDIO_CHUNK_LENGTH_MS variable to alter the chunk length. But remember the limitations of the Whisper API. For instance, this script breaks files larger than 25MB into 10-minute chunks for transcription. If your files generally exceed 10 minutes, consider adjusting the chunk size or devising a more refined splitting technique.
  • Maintaining Context: To retain the context of segmented files, the script prompts the model with the transcript from the prior segment, enhancing transcription accuracy. The variable previous_transcript keeps a rolling transcript of the last chunk. When the next chunk is transcribed, the previous_transcript is prefixed to the chunk’s transcription. After processing each chunk, the previous_transcript is updated with the current chunk’s transcription.
  • Function Breakdown:
    • split_audio_file splits an audio file into 10-minute chunks.
    • transcribe_chunk transcribes an audio chunk, and its transcription can be guided by an optional prompt.
    • main processes each audio file in a designated directory. For files bigger than 25MB, they are divided, transcribed by chunks, and then concatenated. Smaller files are directly transcribed. The final transcription is saved as a .txt file in the same directory.

Conclusion

With the above steps and the provided script, you’ll be able to transcribe large volumes of audio files efficiently using OpenAI’s Whisper ASR API. Always remember to stay updated with API costs and adjust your strategies based on the sizes of your audio files.

Comments

Leave a Reply