Research & Opinions

What Are Vector Databases And Why Do We Need Them?

What Are Vector Databases And Why Do We Need Them?

The more time that goes by, the more I hear about people using ChatGPT for their work. Most folks who have used it have told me their mind has been blown by what it can do, and that it's saving them lots of time in their jobs. It's truly changing the way we work.

But even with how powerful and useful it is today, we're still only in the beginning stages of how people will use it in the future. Today, most people are using ChatGPT through its website, and while that works well for a large number of use cases, it falls short when you need to use it to search through a large knowledge base or document store. ChatGPT only allows you to write a set number of words in its chat window — generally a few thousand words at a time. And while it does allow you to paste multiple messages, and it uses those messages when responding to your prompts, it's not really feasible for you to copy and paste a large number of documents into ChatGPT's chat box when you're trying to quickly find information that you know is available in one of your documents, but you don't know which one.

Contextual information retrieval

One of the use cases we're solving for at Locusive is something we're calling "contextual information retrieval." The idea behind this is quite simple — you need to ask a question that can be answered by some pre-existing information you have stored somewhere (for now let's call it a document store), so you ask a question to some system that has access to all your data and the system answers back using the knowledge in your document store, optionally citing the document it used to answer your question. This is a fairly common use case that we come across, but it's not one that can be handled by ChatGPT out of the box.

Since ChatGPT was trained on public data, it has a wide variety of built-in knowledge about a lot of topics, but it doesn't have access to, nor was it trained on, your business' specific information. In order to leverage ChatGPT to answer your question, you need to somehow pass it some relevant context about your question so that it can provide you with an intelligent answer. In addition, because you can only pass in a few thousand words at a time, you need to make sure that the context you provide it has a high probability of containing the answer to your question, because if you don't give it the right details, it will either make something up (known as a hallucination) or tell you it can't answer your question at all. This is where vector databases come in.

Identifying potentially relevant content

At the highest level, vector databases can store textual content in a way that makes it easy to quickly lookup stored data that is similar to a question or query that you have. This means that if you've got a large store of data that you want to be able to use to answer questions, you can store that data in a vector database and use it to find documents (or even paragraphs, or words, or sentences) that are likely to contain the answer to your question. Whenever you have a new question that you need ChatGPT to answer, you can first use a vector database to find a list of all of the documents that might contain the answer to your question, and then feed those documents, along with your original question into ChatGPT, to get your final answer.

Using this strategy, you can get around ChatGPT's limitation on the number of words you can feed it when you need to ask a question. Vector databases are a powerful tool that businesses can use to increase their efficiency and productivity when searching for new information, but they do come with a few caveats that make them harder to use for some.

API interface

Today, vector databases require you to store and retrieve documents via an API, which means that you either need to use a pre-existing product that plugs in to your ChatGPT workflow, or you need to have an engineering team that can build a system for you that allows you to store (or index) your documents inside of a vector database, and then call that database whenever you have a new question you want to ask, then invoke ChatGPT with the results of that database (alongside your original question), and then return you the final answer. Vector databases aren't meant to be used within a chatbot window on the ChatGPT website (at least as they stand today). They're primarily designed to be used as a component of an application. They work in the background as you ask your questions, and you as an end user should never really know that they even exist, unless, of course, you're curious about software.

Scoring

Also, unlike ChatGPT, vector databases aren't amazing at picking out the documents that best match the semantic intent of a query, at least not every time. While they're pretty good at identifying documents that probably contain the information that you'll need to answer your question, they aren't going to analyze your question and find the best documents for your query. Rather, what they do is they embed your documents, which means they turn your documents into a vector of numbers in an intelligent and meaningful way, and they do the same for your query, and then they find the documents from the database whose embeddings have the most similar value to the embeddings that represent your query.

That's a lot of technobabble to say that while vector databases are good, they may not always return the right documents for your query. This means that any applications you build using a vector database should be robust enough to handling multiple documents at once, and should be able to iterate through these documents, using the help of ChatGPT, to find the answer you need.

Vector databases provide a score for how well they have matched your query. Pinecone, one of the leading vector databases today, provides a score between 0 and 1, and we've found that in general, a score above 0.75 or 0.8 tends to have documents that have a high-probability of containing information that might match your request. So if you're building an application that leverages vector databases, it's important to put a threshold on the score that you get back, because by default, Pinecone will return a list of the top documents based on a total limit, rather than a score.

Updating data

Another issue with using vector databases lies with having to update the data in a vector store. Vector databases aren't inherently linked to any of your source data, you as a developer must insert and query documents yourself, using code. This means that when one of your source documents change, it's on you to ensure that you update the relevant vector within your vector database with the new content. This means that you'll need to have separate systems for either tracking changes, or for re-indexing data on a schedule.

The future of information retrieval

As we start to see increased usage of ChatGPT, particularly as a component that's embedded within an application, we'll likely start to see vector databases becoming more and more ubiquitous. It's likely that they'll be a part of every major application in the next few years — everything from search engines to accounting systems to meme generators. The world of search and information retrieval is changing quickly, and vector databases are going to be a major player in every software industry for the foreseeable future.