Research & Opinions

Everything You Need To Know About ChatGPT And Data Security

This article addresses businesses' concerns regarding the use of large language models (LLMs) like ChatGPT for improving productivity, focusing on data security and privacy. It clarifies misconceptions about how these models operate and interact with proprietary data. The piece outlines the current use of data by LLMs, detailing how data is used for answering specific business-related questions, and potential security risks involved.
Shanif Dhanani
6.5 minutes

Many businesses are interested in using ChatGPT (or other LLMs) to improve their productivity, but many of them are hesitant to do so because of concerns about data security and privacy. Many business executives are still unfamiliar with exactly how ChatGPT works and what it needs to operate properly, and they're also unsure about how their data and systems need to integrate into it to ensure it can add value to their operations. There are a lot of misconceptions about what it does, whether OpenAI uses your data to train its models, and if your data is at risk of leaking out to the larger world. In this guide, we'll give you a quick overview of everything you need to know.

How do LLMs use your data today?

Large language models like ChatGPT were trained on public data available on the internet. They've learned how to respond to natural language prompts, gathered a large amount of background knowledge about a lot of different topics, and have learned the rules of communicating reasonably, rationalizing their answers, and responding with proper grammatical rules. That's why you can go to ChatGPT and ask it to write you an article about cooking and it will have something you can publish right away.

But these systems are still limited. They don't have access to your data and they wouldn't know how to answer a question that's specific to your business unless you give them the context to do so. Because of this, when you ask a question that requires some specific knowledge, or access to your data, you need to provide enough information and context with your question that the system can answer your question using the context you've provided.

This means that if you're using a public, third-party tool like ChatGPT, your data needs to leave your systems, travel over a network, and be used by the third-party systems to answer your question. Depending on your business, this could lead to security concerns that you need to address, specifically around:

  1. Your proprietary data falling into the hands of a third party
  2. OpenAI using your data to train their models, allowing other users of ChatGPT to access your data
  3. Running afoul of regulatory issues

Most businesses are very reasonably worried that ChatGPT might be used to answer another users' question with their data, and certain types of businesses, particularly those in the medical and financial industries, are worried that they can't even send data to ChatGPT to begin with, or they'll run into legal troubles. Fortunately, there are solutions to each of these problems.

Guarantees against using your data


The good news is that OpenAI (the makers of ChatGPT) has said that they will not use data submitted to their API to improve or train their models (note, it's important to note that this same guarantee does not apply to their web-based chat system, which the majority of their users use today). So if you're building apps or services on top of ChatGPT's capabilities, you're undoubtedly using their API, which means your data won't be used to train their model and won't be leaked out to other users. If that was your only concern, then you're done, full stop. You can go and start building today.

Azure's OpenAI Service

If your business is on Azure, you have an alternative that might actually provide more benefits. Since Microsoft (Azure's owner) is also a major investor in OpenAI, they've struck a deal that allows them to provide ChatGPT access to existing Azure customers at a higher rate limit than OpenAI's other customers. Additionally, like ChatGPT's general API users, Microsoft guarantees that data sent to their Azure-based OpenAI API will not be used for training the underlying model in the future. Finally, they also guarantee that any data sent from a system inside of Azure and sent to their OpenAI service will remain entirely within Azure's network, ensuring it doesn't go out into the Internet. This is a good option to use if you're already an Azure customer and your data lives in Azure-based systems.

Training your own model using open-source technology

For many businesses, using one of the strategies above will be sufficient. But other businesses have to deal with concerns around their data leaving their owned and operated systems. For those businesses, the only viable solution (today) is to train and host their own models using existing open-source technology (they could, of course, hire their own researchers and create their own model, but for nearly every business today, this is a non-starter).

Fortunately, there are a large variety of open-source models available today that businesses can use to train with their own data (Facebook's LLaMA is probably the most well-known option). However, in order to properly train these models, businesses will not only need to have a massive amount of data to train them, but also senior engineers that are able to implement the code and infrastructure needed to train these models properly. On top of that, these engineers may need to fine-tune the model to get it to do exactly what your business needs, and moreover, they may not even be able to reach the same level of high-quality responses that ChatGPT provides.

If a business decides to implement their own model, they should be ready to invest significant time and resources into the effort, and they should be prepared for an iterative process that may not yield results soon (or ever). Nevertheless, if they're restricted in their options, creating their own LLM may be the best way to go, and it may even be preferable to other options, as it allows them to use their own data to create exactly the model they need.

Future offerings

While data security and privacy will always be a concern for businesses, those that are willing to build on ChatGPT's API today can think of ChatGPT like any other API, assuming, of course, that they trust OpenAI to stick to their word about not using API data to train their models. In the future, it's very likely that providers of LLMs will offer additional security and privacy features for enterprises. Just like AWS came out with specialized offerings for health services, LLM providers will have new capabilities and features that open up their software to other businesses.

If you're looking to get started with using LLMs for your own business, we recommend starting with a small proof-of-concept using data that's not sensitive. You can build out something quick in a weekend with your existing engineering team, and by doing so you'll have a sense for what's possible, how much effort it might take to build something bigger, potential costs, and what you'll really need to worry about when it comes to data privacy and security.

If you're interested in getting started, at Locusive, we help companies build and implement POCs using their own data, and we also provide a tool to help you find and search your existing data using a chat-based interface, enabling you to find the information you need from any connected data source in just minutes. Feel free to reach out if you want to learn more.


Image by vectorjuice on Freepik