In this guide, we will see how to build a chatGPT for your PDF documents i.e. an AI that will answer your questions based on a particular PDF document.
You could use this to ask questions about your textbooks, ebooks, or anything else as long as it’s in a PDF file format
We will be using
Table of Contents
The process to build a chatGPT for your PDF documents
There is the main steps we are going to follow to build a chatGPT for your PDF documents
- First, we will extract the text from a pdf document and process it and make it ready for the next step.
- Next, we will use an embedding AI model to create embeddings from this text.
- Next, we will build the query part that will take the user’s question and uses the embeddings created from the pdf document, and uses the GPT3/3.5 API to answer that question.
Requirements to build a chatGPT for your PDF documents
- We will be using OpenAI GPT-3/3.5 API for this. Grab your API key from your OpenAI Account.
- Python 3.x or higher installed on your computer.
Install Python packages
First, install the necessary python packages. Depending on your python installation, you could use pip install <package> or python -m pip install <package>. Run these from your command line program.
The python packages you need to install are:
Setup your working directory/folder
create a new directory or folder and create a .env file inside the folder and write below text into it
make sure to replace the text your-openai-api-key with your actual OpenAI API key.
Import the required Python packages
You can do this in a Jupyter Notebook / Google Colab Notebook or a python .py on your computer
Make sure it’s in the same folder as the .env file you created above.
# import the modules from PyPDF2 import PdfReader from langchain.embeddings.openai import OpenAIEmbeddings from langchain.text_splitter import CharacterTextSplitter from langchain.vectorstores import ElasticVectorSearch, Pinecone, Weaviate, FAISS import os # load .env file from dotenv import load_dotenv load_dotenv()
Process the PDF
We start with reading in the pdf document.
reader = PdfReader('my_pdf_doc.pdf')
raw_text = '' for i, page in enumerate(reader.pages): text = page.extract_text() if text: raw_text += text
text_splitter = CharacterTextSplitter( separator = "\n", chunk_size = 1000, chunk_overlap = 200, length_function = len, ) texts = text_splitter.split_text(raw_text)
Now, it’s time to create embeddings from the text chunks we created above from the pdf document.
embeddings = OpenAIEmbeddings()
import pickle with open("foo.pkl", 'wb') as f: pickle.dump(embeddings, f)
Query the PDF document using the embeddings
First we load the saved embeddings
with open("foo.pkl", 'rb') as f: new_docsearch = pickle.load(f)
There are two ways to query the PDF document using mebeddings
Below method will list the most similar chunks that might contain the answer to the query docsearch = FAISS.from_texts(texts, new_docsearch)
query = "Your query here" docs = docsearch.similarity_search(query) print(docs.page_content)
from langchain.chains.question_answering import load_qa_chain from langchain.llms import OpenAI
chain = load_qa_chain(OpenAI(temperature=0), chain_type="stuff") chain.run(input_documents=docs, question=query)
You can use this technique for all kinds of text data beyond just PDFs. You can also the techniques explained here to turn this into a web-based knowledge retrieval system.
Also, here is the complete code used in this guide.