When preparing data for an AI Persona Chatbot or any AI program that draws from pre-recorded audio, you might find yourself needing to transcribe large volumes of audio files. One of the tools available for this task is OpenAI’s “Whisper” Automatic Speech Recognition (ASR) system. Whisper has been trained on an extensive dataset that’s both multilingual and multitask, sourced from the web.
To transcribe audio using the Whisper ASR API, you’ll interact with OpenAI’s API. It’s essential to note that the API comes with associated costs, and there are limitations, especially when dealing with large batches of MP3 files. As an illustration, I incurred a charge of 2.24 USD to transcribe approximately 4 hours of audio, which equated to 411 MB of MP3 files.
By default, the Whisper API can process files up to 25 MB. If your audio file exceeds this size, you’ll need to divide it into segments of 25 MB or smaller. Alternatively, you can opt to use a compressed audio format.
Code Overview
To get started, you’ll need to install a few libraries. Run the following command:
pip install PyDub OpenAI ffmpeg
The subsequent code will:
- Scan and read every MP3 file from a designated directory.
- If any MP3 files are too large, it will convert and segment them.
- Utilize the Whisper API to transcribe each segment.
- Save the transcriptions to individual text files.
import os
from pydub import AudioSegment
import openai
# Load OpenAI API key from a file
with open('api_key.txt', 'r') as key_file:
openai.api_key = key_file.readline().strip()
# Define constants
MAX_FILE_SIZE_MB = 25
AUDIO_CHUNK_LENGTH_MS = 10 * 60 * 1000 # 10 minutes in milliseconds
def split_audio_file(file_path):
"""Splits an audio file into chunks of 10 minutes each.
Args:
file_path (str): Path to the audio file.
Returns:
list: List of audio chunks.
"""
audio = AudioSegment.from_mp3(file_path)
audio_chunks = []
for i in range(0, len(audio), AUDIO_CHUNK_LENGTH_MS):
chunk = audio[i:i + AUDIO_CHUNK_LENGTH_MS]
audio_chunks.append(chunk)
return audio_chunks
def transcribe_chunk(chunk, prompt=""):
"""Transcribes an audio chunk using OpenAI's API.
Args:
chunk (AudioSegment): Audio segment to be transcribed.
prompt (str, optional): Additional text to help guide the transcription. Defaults to "".
Returns:
str: Transcription of the audio chunk.
"""
# Save chunk to a temporary file
temp_file = "temp_chunk.mp3"
chunk.export(temp_file, format="mp3")
with open(temp_file, "rb") as audio_file:
# Pass the prompt parameter to the transcription API
transcript = openai.Audio.transcribe("whisper-1", audio_file, prompt=prompt).get('text')
# Remove temporary file
os.remove(temp_file)
return transcript
def main():
"""Main function to transcribe audio files in a directory."""
directory = '/path/to/mp3/files'
for filename in os.listdir(directory):
if filename.endswith('.mp3'):
file_path = os.path.join(directory, filename)
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
# If the file is larger than the MAX_FILE_SIZE_MB, split it into chunks
if file_size_mb > MAX_FILE_SIZE_MB:
audio_chunks = split_audio_file(file_path)
transcriptions = []
previous_transcript = "" # Initialize an empty previous transcript
for chunk in audio_chunks:
# Prepend the previous transcript to the chunk transcription
transcript_chunk = transcribe_chunk(chunk, prompt=previous_transcript)
transcriptions.append(transcript_chunk)
# Update previous_transcript for the next chunk
previous_transcript = transcript_chunk
full_transcription = '\n'.join(transcriptions)
else:
# If the file is smaller than the maximum size, transcribe it directly
with open(file_path, "rb") as audio_file:
full_transcription = openai.Audio.transcribe("whisper-1", audio_file).get('text')
# Write transcription to a .txt file
with open(os.path.splitext(file_path)[0] + '.txt', 'w') as txt_file:
txt_file.write(full_transcription)
if __name__ == "__main__":
# Execute the main function when the script is run directly
main()
Download the code from GitHub.
Key Points in the Code
- File Setup: Ensure your OpenAI API key is written to a file named
api_key.txt
in the same directory as the script. - Chunk Adjustments: You can modify the
AUDIO_CHUNK_LENGTH_MS
variable to alter the chunk length. But remember the limitations of the Whisper API. For instance, this script breaks files larger than 25MB into 10-minute chunks for transcription. If your files generally exceed 10 minutes, consider adjusting the chunk size or devising a more refined splitting technique. - Maintaining Context: To retain the context of segmented files, the script prompts the model with the transcript from the prior segment, enhancing transcription accuracy. The variable
previous_transcript
keeps a rolling transcript of the last chunk. When the next chunk is transcribed, theprevious_transcript
is prefixed to the chunk’s transcription. After processing each chunk, theprevious_transcript
is updated with the current chunk’s transcription. - Function Breakdown:
split_audio_file
splits an audio file into 10-minute chunks.transcribe_chunk
transcribes an audio chunk, and its transcription can be guided by an optional prompt.main
processes each audio file in a designated directory. For files bigger than 25MB, they are divided, transcribed by chunks, and then concatenated. Smaller files are directly transcribed. The final transcription is saved as a .txt file in the same directory.
Conclusion
With the above steps and the provided script, you’ll be able to transcribe large volumes of audio files efficiently using OpenAI’s Whisper ASR API. Always remember to stay updated with API costs and adjust your strategies based on the sizes of your audio files.
Leave a Reply
You must be logged in to post a comment.