In this guide, we will see how to build a chatGPT for your PDF documents i.e. an AI that will answer your questions based on a particular PDF document.

You could use this to ask questions about your textbooks, ebooks, or anything else as long as it’s in a PDF file format

We will be using

Let’s go.

Table of Contents

The process to build a chatGPT for your PDF documents

There is the main steps we are going to follow to build a chatGPT for your PDF documents

First, we will extract the text from a pdf document and process it and make it ready for the next step.
Next, we will use an embedding AI model to create embeddings from this text.
Next, we will build the query part that will take the user’s question and uses the embeddings created from the pdf document, and uses the GPT3/3.5 API to answer that question.

Requirements to build a chatGPT for your PDF documents

We will be using OpenAI GPT-3/3.5 API for this. Grab your API key from your OpenAI Account.
Python 3.x or higher installed on your computer.

Install Python packages

First, install the necessary python packages. Depending on your python installation, you could use pip install <package> or python -m pip install <package>. Run these from your command line program.

The python packages you need to install are:

PyPDF2
langchain
openai
faiss-cpu

Setup your working directory/folder

create a new directory or folder and create a .env file inside the folder and write below text into it

OPENAI_API_KEY=your-openai-api-key

make sure to replace the text your-openai-api-key with your actual OpenAI API key.

Import the required Python packages

You can do this in a Jupyter Notebook / Google Colab Notebook or a python .py on your computer

Make sure it’s in the same folder as the .env file you created above.

# import the modules
from PyPDF2 import PdfReader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import ElasticVectorSearch, Pinecone, Weaviate, FAISS
import os
# load .env file
from dotenv import load_dotenv
load_dotenv()

Process the PDF

We start with reading in the pdf document.

reader = PdfReader('my_pdf_doc.pdf')

raw_text = ''
for i, page in enumerate(reader.pages):
    text = page.extract_text()
    if text:
        raw_text += text

Next we split the pdf contents into chunks.

text_splitter = CharacterTextSplitter(        
    separator = "\n",
    chunk_size = 1000,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

Create embeddings

Now, it’s time to create embeddings from the text chunks we created above from the pdf document.

embeddings = OpenAIEmbeddings()

We then save the embeddings so that we do need not to create them again and again. The below code saves them to the disk. However, you could also save them to various vector databases covered here.

import pickle
with open("foo.pkl", 'wb') as f:
    pickle.dump(embeddings, f)

Query the PDF document using the embeddings

First we load the saved embeddings

with open("foo.pkl", 'rb') as f: 
   new_docsearch = pickle.load(f)

There are two ways to query the PDF document using mebeddings

Below method will list the most similar chunks that might contain the answer to the query
docsearch = FAISS.from_texts(texts, new_docsearch)

query = "Your query here"
docs = docsearch.similarity_search(query)
print(docs[0].page_content)

Another way to query is to use embeddings to build a prompt and then use LLM model like GPT-3 to answer the question directly.

from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

chain = load_qa_chain(OpenAI(temperature=0), chain_type="stuff")
chain.run(input_documents=docs, question=query)

Conclusion

You can use this technique for all kinds of text data beyond just PDFs. You can also the techniques explained here to turn this into a web-based knowledge retrieval system.

Also, here is the complete code used in this guide.

HarishGarg.com

Easiest Guide to build a chatGPT for your PDF documents using GPT-3/3.5