Research & Opinions

How To Get ChatGPT To Answer Questions Using Your Trusted Documents

This article discusses the strengths and shortcomings of ChatGPT, a large language model by OpenAI, emphasizing the need for using trusted sources to avoid 'hallucinations' or factually incorrect statements. It offers a detailed explanation of concepts like hallucinations, trusted sources, contextual data, and context window. The article also delves into the use of vector databases for effective query management, challenges of time, cost, and effort in setting up such a system, and the potential future improvements in LLMs.
Shanif Dhanani
7.5 minutes

ChatGPT is one of the most brilliant software tools ever created. It's going to make you better, faster, and stronger. But if you're not careful, it can also make you look foolish.

ChatGPT is a large language model that was trained on the world's data... before 2021. It's brilliant and incredibly efficient at coming up with creative ideas, new content, and often, factual answers to your questions. But it's too creative for its own good, and is infamous for just making things up when it doesn't have enough information to answer you properly.

I've seen it cite websites that don't exist, provide fake news, and falsify facts. So while I think it's going to change the world, I think it's also important to make sure you stop it from hallucinating like this as much as possible.

That's why it's important to build up a library of trusted sources that contain reliable information, and make sure it cites those sources whenever possible.

What's a hallucination

You may have never heard the term "hallucination" in the context of an LLM before. It's a relatively new term, but I bet you it's going to get a lot more popular over the next 12 months. In the context of large language models, a "hallucination" is simply a false statement that's presented as true. If large language models were alive, we might say they were lying, but even that wouldn't necessarily be accurate. It would be more accurate to say that it would be doing its best to complete a thought by extrapolating what it thinks makes sense to say next.

In any case, a hallucination is a factually incorrect statement, and they're particularly troublesome because they're proclaimed with the same confident tone as a non-hallucination. They frequently even have false or made up sources that could make you believe that they're legitimate statements of fact. Hallucinations may be innocent enough if you're just using ChatGPT to talk like a pirate, but if you're using it for business productivity, they can be particularly troublesome.

That's why it's important to ensure the answers you receive are well-sourced and reliable.

Trusted sources

Today, ChatGPT doesn't provide its sources of truth. There's a lot of work being done here to fix this, but for now, the most common way to ensure you're getting reliable answers is to provide ChatGPT with contextual data from trusted sources that you've identified and ask it to use only those sources whenever possible. Taking it one step further, you can even ask ChatGPT to cite which trusted source it used when providing your answer.

Contextual data

You can think of ChatGPT as having a giant, imperfect memory. It was trained on a ton of digital documents, websites, online forums, and other digital sources. Because it's such a massive model, it can more-or-less remember what it has been trained on, and when you don't give it any contextual information to work off of, it tries to optimize its answer to you by recreating the most relevant sources it was trained on in a manner that most reasonably answers your prompt.

But it doesn't need to be that way.

You can prompt it to use specific sources of information when answering your question. A common way to do this is to first identify any sources that you think might be helpful in discussing some subject matter, copying and pasting the content in those sources into ChatGPT, and then prompting ChatGPT to provide an answer using only that context.

This method works quite well at delivering relevant answers, and can even be used to identify which document was used to answer your question. But as always, there's a catch - the "context window."

Large language models can only consider a finite range of tokens when responding to a prompt. Said differently, you can only provide ChatGPT with so many words when you ask it a question. This means that if you have a huge document library of thousands of docs, you can't just take them all, copy/paste, and ask ChatGPT to answer your question. You need to selectively provide only those documents, and only those paragraphs within those documents, that have the highest likelihood of being able to answer your questions.

By selecting a small subset of the context within your trusted data sources that are most likely to have the best content, you optimize the use of the context window and the likelihood that ChatGPT will be able to find an answer within your documents.

Vector databases

In order to make this all work, you'll have to have a way of taking a user's query, finding the documents that are most likely to contain answers to their query, and sending the content from those documents to ChatGPT to answer their question. The hard part lies in finding documents that might match a user's query. That's where vector databases come in.

A vector database is a database that can store a vector - a string of numbers - that represent some piece of text. The vector itself has a mathematical structure that represents the semantic meaning of a word or group of words. That means that the vectors for two words that are semantically related will be mathematically closer than the vectors for two words that have entirely different meanings. By representing your documents as vectors and storing them in a vector database, you can now easily identify sentences, paragraphs, and even entire documents whose "meaning" might be similar to the meaning implied by a user's question.

Once you have those documents, you can then feed in their associated text, with the user's question, into ChatGPT to get a final answer. Additionally, you can prompt ChatGPT to only answer using the context you've provided, and if the answer is not in the documents you've given it, you can try again with new paragraphs and new documents, or have it fall back to using its built in knowledge if you need to.

Time, cost, and effort

The unfortunate part about this whole process is that it does take time and effort to put this system together, and it does require you to know how to code. Vector databases are just another type of data lookup tool, just like a relational database or a data warehouse, and you'll need to interact with it using a software development kit that is provided by the makers of the tool. Right now, one of the most popular tools out there is Pinecone, and they provide SDKs for node, Java, and Python, among others. They also have a free tier version that should be enough to get you started. You'll need to use your OpenAPI key to turn each trusted document you have into a chunk of associated vectors, but that's straightforward to do using OpenAI's library.

It's relatively easy to get up and running, and it doesn't take a lot of time, but like with any software project, the complexity, time, and cost grow as you scale. As you get more users, you'll need to make sure you provide proper namespaces so that Pinecone can minimize the number of documents it needs to search through, and you'll need to scale your vector databases up as you get more and more indexed documents. You'll likely still need to maintain a relational database to keep track of which documents have been indexed, along with the chunk IDs for each chunk that's associated with each document (since you'll generally want to break apart your document into chunks of 500-1,000 tokens).

Larger context sources + wrapping up

It's likely that future versions of LLMs may solve this problem intrinsically. We've already seen that ChatGPT 4 can provide a context window of up to 32K tokens. One day we might see a world where you can ask ChatGPT a question and it can provide its source for you without you having to input any source documents (for publicly available documents, at least). But until then, it's on us as LLM users to ensure we are doing our best to validate all of the answers we request from an LLM that is known to make things up.

If you're not looking to build your own system of vector databases, infrastructure, and ChatGPT APIs, you can always use Locusive's chatbot. Just set up a demo and then connect the apps and data you already use before downloading our chatbot for Slack.


Image by studiogstock on Freepik