Guide

How To Build An Internal Search Engine With ChatGPT: A Complete Guide

How To Build An Internal Search Engine With ChatGPT: A Complete Guide
Thanks, we'll send you periodic updates!
Oops! Something went wrong while submitting the form.

Employees today have a huge number of data sources that they use to get things done, but one of the problems with having a lot of data is that it's not always easy to find what you're looking for when you need it. You might have some old Slack messages that can address a customer question now, product documentation that helps your engineering team figure out the specs for a new feature they're building this week, and FAQs in a customer support doc in Google Drive that your team needs to solve a ticket that comes in tomorrow. But all of this information is separated and hard to find. Your employees may not even know that it all exists, much less how to access it.

If you're really organized, you might have a data catalog or a directory of all of your key documents across your company, but more likely, each of your employees has their own way figuring out what they need to get at any given point. It's even likely that there's a lot of internal chatting and even phone calls between colleagues in order to get answers to commonly asked questions.

With all of these different data sources and an increasingly large amount of new tools and services that you're onboarding every month, it will only get harder for you to find the answers, documents, and media files that you need at any given point in time. It's likely that at some point, you'll need to create an internal search engine for your company, and integrating all of these tools and services with ChatGPT is one of the best ways to do so.

What's an internal search engine

You've probably already used the search functionality on some of your existing tools to find what you need. Google Drive has a search feature, so does Slack, and so do almost every other tool that you've used. Some of them are decent and reliable, others are terrible. But it's likely that no single search feature is able to find what you need across all of your data sources and systems. An internal search engine can change that. A search engine across all your data sources and tools can provide a single place for you to find what you need — whether that be a concise answer to a well-defined question, a list of images that you used for your last ad campaign, or the documents you need to fill out your next legal policy. An internal search engine is a tool that helps you find what you need when you need it so long as it's already available somewhere in your online possessions.

The key difference between an internal search engine and something like Google is that your internal search engine has access to your private data and knows how to access it properly given your user's questions, whereas public search engines use publicly available sources to find the most likely websites, public files, and other publicly available documents that are likely to answer a question. Internal search engines can be integrated into existing tools, like a data catalog that you already have, or can be created as custom-built, specific applications for your company. In both cases, they require access to your most important and commonly used data sources to do their jobs well.

1. Integrating your data

The key to having an effective internal search engine is to ensure it has access to all of your data and data sources while simultaneously ensuring it respects permissions and access controls. It's important that these tools can pull in the right data that they need when a user makes a request, but not pull in data that users shouldn't have access to, even if that data could answer their questions. This could take a concerted effort and custom code, but is well worth the engineering effort.

Building a data source catalog

The first step in creating an internal search engine is to create a single source for listing all integrated tools and documents. This can be as simple as a database table that lists out all of the different tools that you've integrated, or a as complicated as integrating a data cataloguing tool, but there needs to be a single source of truth for identifying which data sources and datasets you've included in your search engine. We recommend starting with a table in your data source that can keep track of the type of data source you've integrated, and an associated table to keep track of which specific data sets have been integrated. For example, if you're integrating data from Google Drive, your "data sources" table would have a single record to represent your Google Drive integration, and a series of associated records in your "data sets" table to keep track of which folders you've included from Google Drive.

Adding connected data sources

Once you've got a way to keep track of which data sources you're integrating, you need to actually integrate those data sources. This includes storing data from these data sources in a readily accessible way, and updating any stored data after it changes. You also need to ensure that your internal search engine properly tracks permissions and incorporates standard data security practices.

OAuth

Most third party tools today use OAuth to provide access management to protected data. For example, Google Drive uses OAuth to provide an access code that's tied to a single user. That access code grants access to that user's data, but it expires quickly, and you'll need to use a refresh token to get a new access token when the old one expires. Your internal search engine must support the ability to track and refresh OAuth tokens on an as-needed basis. To enable this sort of functionality, you'll also need a frontend that allows your users to login to their third party accounts.

Indexing data

In order to quickly find and retrieve the right data when a user searches for a document or the answer to a question, the data from large documents or integrated data sources must also be stored in an index for fast lookup. Later in this article, we'll talk about how to use ChatGPT to help find the right documents or the right answer to a question when a user makes a query, but in order to enable ChatGPT to work properly, the textual contents of your data must be stored in a vector database, which enables you to find the right documents (or paragraphs within your documents) that are most likely to contain the answer to a user's question.

When you integrate a new data source, or you allow a user to upload a document or save the contents of a public URL, you need to index the contents of that data source into a vector database so that you can look it up later. When indexing a large document, you'll need to break down that document into chunks. We've found that chunk sizes of 200-1,000 words works well for most use cases. However, you'll need to adjust this strategy as you start to integrate new data sources, like Slack messages or Google Sheets.

Adding specialized data

It's likely that much of your organization's data will be stored in a non-standard format, like chat messages, Google Sheets, internal databases, and similar formats. In order to make this data available for search queries, you'll likely need to transform and index this data in a structured way, specific to each data source. For example, when we built the Locusive Slackbot, we explored a few different ways to store the data in a Google Sheet, and ultimately decided to store it in different ways depending on how our users were using their Sheets. For example, if users specified they were using a Google Sheet as a database table, we indexed each row as a JSON object using a key-value format. But if you're indexing historical Slack messages, you might choose to index every 20-30 messages together as its own chunk.

This is one of the trickiest parts of creating an internal search engine, since it requires a specialized understanding of the different types of data that your organization contains, and how you most expect to work with that data. A good data scientist or engineer can help you come up with the most appropriate schema for every type of data source you have.

You could also choose not to index your data when it's first integrated, but rather, store a schema that represents how your data is structured. Then you could use this schema at the time that your user provides a query to run a real-time lookup of the actual data that you need to answer a user's query. We'll discuss this approach in the next section. This strategy works well for when you have a data source that's very large (for example, a Postgres database of transactions, or something similar) or a data source that changes frequently. The most important part of this strategy is that you incorporate tools that understand how to best query your data source for the data that's needed when a user makes a query.

Key considerations

Integrating your data sources and documents will likely be the most challenging and time consuming part of creating your internal search engine. It requires an in-depth understanding of the different systems that you use in your business and also a good engineering team to build systems for data management and authentication. Doing this right will take time, but a solid investment in this step will pay off in the long run.

2. Building the search functionality

Once you've got all your data integrated into a single system, you'll need to create the functionality that can actually identify the proper sources or snippets within your data sources that are most likely to answer your questions. This is where ChatGPT, or a similar large language model (LLM) can play a role.

Creating a search and information retrieval agent

One of the key requirements for an effective search engine is to quickly identify and return the most likely answer, or documents, to answer a user's question. But when you integrate all of your data sources into a single system, it could be hard to sift through all of the different documents, folders, and APIs that might contain the answer to a user's question. You'll need a tool that can both find the most likely data sources and intelligently identify what data within those data sources contains the answers to your users' questions.

The first part, identifying what documents and datasets are most likely to answer your users' questions, can be handled by a combination of an LLM and a vector database. When a user asks a question or searches for some content, your search engine first needs to identify which data sources might contain the information that you need. An LLM like ChatGPT can help you identify which data sources you should query first based on the user's question.

For example, if a user asks how many leads they have in their Salesforce account, an LLM can help your search engine know that it needs to run a query against the Salesforce API, but if a user asks about key features of your product, which can be answered by using your product documentation in your Google Drive folder, your LLM might recommend you query that specific document for the data. Once your LLM identifies which data source you need to query, you can then run the commands you need to get data from your data source, or query for the right datasets within your vector database so that you can get the snippets of information that are most likely to contain the answers to a user's question. You could even combine the two steps above into a single step, where you first use the user's query to identify the actual snippets or datasets that could contain the answer to a user's question and then use those snippets to prompt ChatGPT to provide a final answer to the user's query using the provided context.

As mentioned in the "Adding specialized data" section above, you might need to add additional functionality into your search engine to be able to query data sources that don't support the ability to index stored data. For example, if you've got a database table that contains millions of rows, you probably shouldn't index all of those rows, but you could store the schema of the table, and create the code that allows your search engine to query your database table using SQL. As you can tell, creating a fully-featured internal search engine might require you to do more than just retrieve documents. You may need to create an agent using an AutoGPT strategy.

Data security

As you build your data retrieval functionality, you'll need to build in proper permissions management from the start. The easiest way to do this is to use existing access controls or relationship-based access management tools to filter out data sources that the querying user doesn't have access to read. There are two approaches to handling this:

  1. Storing permissions for each data source in a separate permissions management system, refreshing those permissions periodically or when they change on the underlying data source, and using those permissions at query time to filter out documents that a user shouldn't have access to
  2. Querying the underlying data source's permissions management system when a user makes a query and filtering out documents that users don't have access to using the response from the third party system prior to sending context to ChatGPT

Both approaches have their benefits and disadvantages. In the first, where you re-build your own permissions management tracking system, you'll need to invest a significant amount of resources, time, and money to create the infrastructure and software to handle and update permissions management, but you'll have access to a system that will always provide a well-structured response that your system knows how to process. In the second scenario, you'll need to ensure you can handle temporary errors or outages from third party tools and you'll need to ensure you have an effective way to map from your querying user to every user that might have listed permissions within every third party system at query time.

Processing data to get to the final answer

Your system will need to properly process a user's query, identify the data sources that need to return the proper data, and find the most likely datasets that contain the data needed to respond to the user's query. Once you have the right candidate datasets available to respond to the user's query, you'll need to parse through it all to produce the final response for the user. LLMs can help here as well. If you send the user's original request, along with the contextual data retrieved, and prompt the LLM to use the context to answer the user's question, there's a great chance you'll get back exactly what's needed to answer the user's request. LLMs can process large amounts of data to provide intelligent and reasonable answers to nuanced and complicated questions, and they'll be able to read through all of the context that your data retrieval system has provided to answer the user's final question.

In the event that the provided context doesn't have enough data to answer the user's question, you can also have the LLM request additional context, or reach out to additional systems, or lookup more data. That's part of the beauty of using LLMs as the intelligence layer of your internal search engines — they can be flexible and powerful with their functionality, so long as your code accounts for it.

3. The search interface

Once you've built the core search functionality, you'll need to create the user interface to allow your users to actually provide their search query. This could be on an internal-facing website, a desktop application or plugin, an existing chat interface, or even a voice-to-text application. We recommend plugging into an interface that your employees are already familiar with so that you can ease their burden and easily fit into their workflow. This might also be a brand new application, for example, if your organization uses Slack for internal communications, you may need to create a Slack bot, even if you've never created a bot like this before. The initial engineering time will likely be large, but it will be worth it for your employees, as they'll be able to easily plug in to the tools they're already using, which will reduce the time it takes to get acclimated to the new functionality at their fingertips.

If you do build an app on top of an existing chat tool, you'll likely be able to leverage existing functionality of the tool to enhance your search engine as well. For example, tools like Slack and Discord allow you to provide "slash commands", which are well-structured commands that have narrow, but often-used functionality. You might build a help command that a user can invoke by typing "/help", or a command to let your users easily add new trusted websites by typing "/add_trusted_source <url>". The most important thing is to ensure that you're plugging into the mental model that your users are already working with.

Productivity gains

Building an internal search engine could require a significant investment in engineering time and resources, but once you've got a fully functioning application, your employees will have a single, consolidated tool that they can use to access all of their organizational data. Even better, they'll be able to interact with that data using the simplest possible interface - natural language. If you create your internal search engine within an existing chat tool, you'll also enable your users to operate within an existing tool that they're already used to using, which will allow them to spend less time context-switching.

If you think an internal search engine could significantly boost your productivity, but you don't want to build one yourself, at Locusive, we've already created a system that lets you integrate your existing data sources and applications into a single, consolidated chatbot, which you can use for free. If you're interested in expanding it to include your own tools, or you'd like to chat more about how it can help you, feel free to contact us to learn more.

---

Image by macrovector on Freepik