Easiest Guide to build a chatGPT for your PDF documents using GPT-3/3.5

Build a chatGPT for your PDF documents

In this guide, we will see how to build a chatGPT for your PDF documents i.e. an AI that will answer your questions based on a particular PDF document.

You could use this to ask questions about your textbooks, ebooks, or anything else as long as it’s in a PDF file format

We will be using

Let’s go.

The process to build a chatGPT for your PDF documents

There is the main steps we are going to follow to build a chatGPT for your PDF documents

  1. First, we will extract the text from a pdf document and process it and make it ready for the next step.
  2. Next, we will use an embedding AI model to create embeddings from this text.
  3. Next, we will build the query part that will take the user’s question and uses the embeddings created from the pdf document, and uses the GPT3/3.5 API to answer that question.

Requirements to build a chatGPT for your PDF documents

  1. We will be using OpenAI GPT-3/3.5 API for this. Grab your API key from your OpenAI Account.
  2. Python 3.x or higher installed on your computer.

Install Python packages

First, install the necessary python packages. Depending on your python installation, you could use pip install <package> or python -m pip install <package>. Run these from your command line program.

The python packages you need to install are:

  • PyPDF2
  • langchain
  • openai
  • faiss-cpu

Setup your working directory/folder

create a new directory or folder and create a .env file inside the folder and write below text into it

OPENAI_API_KEY=your-openai-api-key

make sure to replace the text your-openai-api-key with your actual OpenAI API key.

Import the required Python packages

You can do this in a Jupyter Notebook / Google Colab Notebook or a python .py on your computer

Make sure it’s in the same folder as the .env file you created above.

# import the modules
from PyPDF2 import PdfReader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import ElasticVectorSearch, Pinecone, Weaviate, FAISS
import os
# load .env file
from dotenv import load_dotenv
load_dotenv()

Process the PDF

We start with reading in the pdf document.

Create embeddings

Now, it’s time to create embeddings from the text chunks we created above from the pdf document.

Conclusion

You can use this technique for all kinds of text data beyond just PDFs. You can also the techniques explained here to turn this into a web-based knowledge retrieval system.

Also, here is the complete code used in this guide.