Research & Opinions

Why You Can't Just Train ChatGPT (And What To Do Instead)

This article delves into the use of ChatGPT for businesses, addressing misconceptions about its integration with proprietary data and how it's not as straightforward as 'feeding' it documents. It further discusses strategies like fine-tuning ChatGPT, contextual information retrieval, few-shot learning, or training your own language model, each with its challenges, for the effective utilization of ChatGPT in your business.
Shanif Dhanani
8.5 minutes

Everyone from sales teams to entrepreneurs to professors wants to use ChatGPT to make their jobs easier. Its power, ease of use, and and ability to automate time-consuming tasks are just too enticing to pass up. But there's a common misperception about how ChatGPT can be used with your own proprietary data, even with folks who use it on a daily basis. While most users of ChatGPT are using it to improve their website today, many of them want to be able to customize ChatGPT so that it works on their data, and they think that doing so is as simple as "feeding in" all of their documents to ChatGPT so that it can become an expert on the content contained within. In reality, ChatGPT needs to be fine-tuned, or given contextual information without exceeding a context window, and the act of "feeding in documents" doesn't do anything today, other than just providing context for a future question. But there are ways around these roadblocks, and while it does require additional engineering effort and infrastructure, it is possible to use ChatGPT with your own data.

Teaching ChatGPT

When OpenAI created ChatGPT, they had to train it to provide the proper response to a user's prompt. This seemingly simple act was actually a huge feat of data science and engineering prowess, and required many complicated steps and internet-scale data. The resulting system, which we use today, is now able to respond in a seemingly intelligent manner to a wide variety of prompts. It's a generalist system that can provide high-quality responses on a myriad of topics, but what it can't do is use your specific data, which it hasn't seen before, in a way that you might need for your specific job. This means that you can't ask it to answer fact-based questions from your knowledge base, or provide an interface for your users to answer questions about their account on your SaaS platform, or provide customer service to your customers based on their order information. Any use case that requires specific, proprietary knowledge that it either hasn't been trained on, or hasn't seen enough of to provide a proper response with, will result in either a hallucination or a short, insufficient response.

You can see that the training process for ChatGPT was complicated and beyond the reach of most individuals and businesses, but there are a few ways to use ChatGPT with your own data or for your own use case:

  1. Fine-tune ChatGPT 3 with your own example prompts and responses
  2. Provide contextual information every time you ask a question
  3. Provide a few examples it can use to learn how to respond properly

There are several tradeoffs in each of these three scenarios, and it's certainly possible that none of them will fit your use case, but I've seen that at least one of these options will usually be sufficient for most businesses. In the rest of this article, I'll dive deeper into each option.

Fine-tuning

ChatGPT is a large language model (LLM), which means it has learned a variety of generic rules, tactics, probabilities, and approaches for providing a natural language response to a prompt. Just like with humans, there are a large number of skills that an LLM needs to learn to write — everything from simple grammatical compositions to high-level concepts and reasoning skills. OpenAI has provided a method to fine-tune ChatGPT, which simply means you can use the larger, generic model (which has learned all these skills) while retraining it to learn domain-specific information, allowing it to become an expert on new data.

This might sound like exactly what you need, but there are a few caveats that make this strategy unworkable for many companies.

The biggest blocker to using this strategy is that, at the time of this post, it's not even possible to use on OpenAI's latest models — ChatGPT 3.5 and ChatGPT 4. Currently, fine-tuning is only available on ChatGPT 3, which is a sufficiently less capable model than its newest siblings. Even if you train it on your own data, ChatGPT's underlying training has worse reasoning, analytical, and conceptualization skills than 3.5 or 4, so even if it becomes an expert on your data, it might not be smart enough to do what you need.

The second problem with using fine-tuning is that it requires you to have a large number of examples to use for the training process. The data you need to provide for fine-tuning ChatGPT should be list of <prompt>, <answer> examples, and in general, the more examples you can provide, the better the model will learn. This process, known as "labeling" in the world of machine learning, is commonly known to be one of the most time-consuming, costly, and manual parts of building a machine learning system. You might be lucky enough to already have a large number of prompt + response examples to work with, but even if you have thousands of examples, when you fine tune ChatGPT, it still only learns from the examples you provide, it's not memorizing any conceptual data or contextual information that's not present in your examples. This means that you need to have a broad enough set of examples that the model will learn everything it needs to respond to your prompts down the line. I've heard that it requires several thousand examples before fine tuning ChatGPT will yield production-level results.

For most businesses, fine-tuning ChatGPT won't be the best option to get started with using ChatGPT on their own data.

Contextual information retrieval

An alternative to fine-tuning ChatGPT is to provide enough context with every question sent to the system that it can respond to the question entirely from the context provided. In this strategy, you provide a broad-based set of relevant information related to the user's question every time you ask ChatGPT for an answer, and you instruct ChatGPT to answer using the contextual information that you've provided. This strategy works well for retrieving factual or context-based information from ChatGPT, but it has its downsides.

First, it doesn't help for use cases where you want ChatGPT to provide a large, context-specific document, like an essay or a script. It's primarily useful for information retrieval, search, and question answering.

While these might be acceptable tradeoffs, the larger issue with contextual information retrieval lies with providing the right context for each question. ChatGPT can only accept a few thousand words at once, which means you won't be able to provide it an encyclopedia's worth of information with your question. This means you'll need to be able to intelligently select the right context when you send it a prompt. The current state of the art in doing this is to store all potential sources of information in a vector database, and then, when a user asks a new question, use the vector database to look up the parts of your documents that have a high likelihood of containing the information to answer that question. This requires the user to ask a question with enough specificity that a vector database can find relevant documents efficiently, and it also requires creating, maintaining, and paying for the server infrastructure to store and retrieve these documents in the first place.

Another concern with this strategy is that it's not always easy to find documents, or parts of documents, that are likely to contain the answer to a question, and given ChatGPT's limited context window, it might require multiple rounds of providing the right context to ChatGPT for it to find an answer. This could increase the time it takes for a user to get an answer, and also increase the cost of using ChatGPT, as more documents leads to more tokens used.

One final caveat to implementing a contextual information retrieval strategy is that users may assume that once they provide some pieces of context to ChatGPT, it will learn from that context for future interactions. In reality, ChatGPT won't learn from these prompts and answers without OpenAI incorporating them into their training process, so users might get frustrated with having to provide multiple rounds of the same contextual information to ChatGPT every time they ask it something similar. Because of these concerns, this strategy is best implemented as an automated process using ChatGPT's API.

Few-shot learning

Whether you use fine-tuning or contextual information retrieval, there's one tactic that might help ChatGPT provide better, more informed answers using your own data, and that's by giving it examples of how you'd like it to respond when you ask it a question. This particular methodology is more of a tactic than a strategy because it's a way to optimize your answers and it can be used whether or not you provide ChatGPT with your own data. With this approach, whenever you ask ChatGPT a question, you can provide it with a few examples of similar questions and answers, which it can then use to infer how it should respond to you. This might help it provide answers that are more relevant to your data, but it should allow you to have more targeted and relevant answers because you show ChatGPT how it should think and respond when it responds to you.

A final alternative

If none of these strategies seem appealing to you, it is possible to train your own LLM with your own data with open source models from providers like HuggingFace. While these models are generally considered less-capable than ChatGPT, and you'll likely need a huge amount of textual data to generate a reasonable model, this strategy will allow you to create a domain-specific system that's highly specific to what you need. This will require a good engineering and devops team that's well-versed in data engineering, data processing, and infrastructure creation and maintenance. Most companies won't get much benefit from this approach, but it is possible that it could work for you.

No matter what approach you use, it's important to always have in mind the objective that you want from the system you're building, how you expect it to be used, and who its users will be. These are important considerations when creating any system, but are especially important in the world of large language models. If you're interested in exploring any of these techniques, at Locusive, we're creating the systems and software that businesses can use to connect their data to ChatGPT. Feel free to get in touch if you think we can be helpful.

---

Image by vectorjuice on Freepik