How to Embed Data into Pinecone using OpenAI and LangChain

In the previous post, “How to transcribe audio files using OpenAI API“, I went through the process of transcribing audio files as a preparation step for creating an AI chatbot that can answer questions based on many podcast episodes.

In this post, I will go through the next step of data preparation – embedding vector data into a vector store.

Understanding Vector Stores and Vector Embeddings

When it comes to AI chatbots, especially those capable of retrieving and providing detailed information on a specific topic, the concepts of vector stores and vector embeddings become fundamental. To elucidate these concepts, let’s walk through the process of how an AI chatbot operates using vector stores like Pinecone and vector embeddings provided by OpenAI.

Vector Embeddings

Vector embeddings are a representation of text in the form of vectors, which are essentially arrays of numbers. These numerical representations capture the semantic meaning of words, sentences, or even entire documents like podcast transcripts. In AI, particularly in natural language processing, embeddings transform complex, nuanced language into a format that machines can understand and process.

By using embeddings from OpenAI, transcripts of podcasts are converted into a mathematical space where each point (i.e., each embedding) represents a segment of the transcript. These points cluster together in this space if their contents are semantically similar, much like how books on similar topics might be found close to each other on a library shelf.

Vector Stores

A vector store, such as Pinecone, is like a highly efficient and specialized library for these vectors. It doesn’t just store them; it arranges them in such a way that makes it easy to find the most relevant vectors quickly. Pinecone, specifically, is designed for scalable vector search, ensuring that even as the library of podcast embeddings grows, users can still find the most relevant podcast segments in response to their queries promptly.

Code Overview

The provided notebook includes code that:

  • Sets environment variables for API keys from a keys.txt file and user input.
  • Loads documents and splits them into chunks using LangChain‘s text splitter.
  • Embeds text using the OpenAI API.
  • Stores embeddings in Pinecone, a vector database for similarity search.
  • Tests the database with a QA example.
  • Adds more transcripts to an existing Pinecone index.

LangChain

LangChain is an open-source framework designed to help developers integrate large language models (LLMs) with external components, facilitating the creation of AI and machine learning applications. It is especially tuned for crafting applications that capitalize on the power of natural language processing (NLP).

With LangChain, developers can connect robust LLMs such as OpenAI’s GPT-3.5 and GPT-4 to various external data sources, thereby enhancing the capabilities of NLP applications. The framework offers packages in Python, JavaScript, and TypeScript.

Dependencies

Make sure you have the following libraries installed.

pip install openai langchain pinecone-client

Handling API keys

Make sure you have OpenAI API key, Pinecone API key and Pinecone enviroment name.

For simplicity, API keys keys can be stored to a file and then read by a Python script. This is probably fine for testing but not a safe way to handle API keys.

# keys.txt
PINECONE_API_KEY = XXXX-XXXX-XXXX
PINECONE_ENV = XXXX-XXXX-XXXX
OPENAI_API_KEY = XXXX-XXXX-XXXX

Keys are then read and stored as enviroment viriables.

import os

def set_env_variables_from_file(file_path):
    with open(file_path, 'r') as file:
        for line in file:
            key, value = line.strip().split('=')
            os.environ[key] = value

set_env_variables_from_file('keys.txt')

Another way to set API keys as enviroment variables would be to ask for user’s input.

import os
import getpass

os.environ["PINECONE_API_KEY"] = getpass.getpass("Pinecone API Key:")
os.environ["PINECONE_ENV"] = getpass.getpass("Pinecone Environment:")
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

Loading text files

Loading all transcripts from a single text file. Then splitting text to chunks.

from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = TextLoader("transcripts.txt")
documents = loader.load()

length_function = len

# The default list of split characters is [\n\n, \n, " ", ""]
# Tries to split on them in order until the chunks are small enough
# Keep paragraphs, sentences, words together as long as possible
splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ", ""],
    chunk_size=1000, 
    chunk_overlap=100,
    length_function=length_function,
)

docs = splitter.split_documents(documents)

Loading transcripts from many text files from a single folder. Then splitting texts to chunks.

from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_loader_kwargs={'autodetect_encoding': True}
loader = DirectoryLoader('./transcripts', glob="**/*.txt", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
documents = loader.load()

length_function = len

# The default list of split characters is [\n\n, \n, " ", ""]
# Tries to split on them in order until the chunks are small enough
# Keep paragraphs, sentences, words together as long as possible
splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ", ""],
    chunk_size=1000, 
    chunk_overlap=100,
    length_function=length_function,
)

docs = splitter.split_documents(documents)

Text embedding: Storing embeddings to Pinecone.

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
import pinecone

embeddings = OpenAIEmbeddings()

# initialize pinecone
pinecone.init(
    api_key=os.getenv("PINECONE_API_KEY"),  # find at app.pinecone.io
    environment=os.getenv("PINECONE_ENV"),  # next to api key in console
)

index_name = "aichatbot-alex"

# First, check if our index already exists. If it doesn't, we create it
if index_name not in pinecone.list_indexes():
    # we create a new index
    pinecone.create_index(
      name=index_name,
      metric='cosine',
      dimension=1536  
)

# The OpenAI embedding model `text-embedding-ada-002 uses 1536 dimensions`
docsearch = Pinecone.from_documents(docs, embeddings, index_name=index_name)

Testing database: QA.

index_name = "aichatbot-alex"
embeddings = OpenAIEmbeddings()

docsearch = Pinecone.from_existing_index(index_name, embeddings)

query = "How to relax?"
docs = docsearch.similarity_search(query)

print(docs[0].page_content)

Adding More Transcripts to an Existing Index

The following code will load and split to chunks all text files in a specified derictory.

from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_loader_kwargs={'autodetect_encoding': True}
loader = DirectoryLoader('./transcripts2', glob="**/*.txt", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
documents = loader.load()

length_function = len

# The default list of split characters is [\n\n, \n, " ", ""]
# Tries to split on them in order until the chunks are small enough
# Keep paragraphs, sentences, words together as long as possible
splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ", ""],
    chunk_size=1000, 
    chunk_overlap=100,
    length_function=length_function,
)

docs = splitter.split_documents(documents)

Text embedding: Storing embeddings to Pinecone.

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
import pinecone

embeddings = OpenAIEmbeddings()

# initialize pinecone
pinecone.init(
    api_key=os.getenv("PINECONE_API_KEY"),  # find at app.pinecone.io
    environment=os.getenv("PINECONE_ENV"),  # next to api key in console
)

index_name = "aichatbot-alex"

vectorstore = Pinecone.from_existing_index(index_name, embeddings)

vectorstore.add_documents(docs)

You can find this Jupyter Notebook on GitHub.

Conclusion

By leveraging vector embeddings and vector stores, AI chatbots can navigate through extensive information with astonishing efficiency and accuracy. They transform the way users interact with content, making it possible to access specific knowledge from a sea of data through a simple conversation. This is the power of AI at work, combining the subtleties of human language with the speed and precision of machine intelligence.

Comments

Leave a Reply