Research & Opinions

How Does AutoGPT Work?

This article delves into the concept of "autonomous agents" and introduces a new open-source tool called "AutoGPT." The author begins by defining "agents" as software applications that can carry out actions or a series of actions to accomplish a larger goal. The advent of large language models (LLMs), such as OpenAI's GPT-4, has enabled the creation of more sophisticated agents that can handle complex tasks.
Shanif Dhanani
7.2 minutes

There's been a lot of buzz recently around the idea of "autonomous agents", or applications that can act without human input to accomplish a complex task. Recently, there was a very popular open-source tool introduced on GitHub named "AutoGPT", which leverages agents to handle a user's request to accomplish a task, and it wowed a lot of people. The recent advent of large language models (LLMs) has enabled a huge number of these agents to function in ways that were never really possible before. While there is a lot of potential in these applications, the concept behind them is fairly straightforward, and once you figure out how they work, you (or your engineering team) can build your own AutoGPT with standard software engineering practices.

A deeper dive into agents

You may have heard the term "agent" before, and if you have, you might be confused as to what we're actually referring to when it comes to AutoGPT. The word "agent" has become a catchall term in the world of A.I. to refer to any software application that can take an action, or a series of actions, to accomplish a larger goal. A few years ago, the term became very popular when advancements in reinforcement learning led to new software that could beat the world's best players in Starcraft and Go. In those cases, researchers had built "agents" that were capable of playing each respective game, taking a series of actions in an optimal manner, that ultimately led to software being able to beat reigning world champions.

In the world of LLMs, an "agent" is a software application that can take a series of actions to accomplish a certain task. Unlike in the world of reinforcement learning, LLM agents aren't optimized to maximize a specific objective (like winning a game of Starcraft). Rather, they can perform an action, like calling an API or searching the internet, as part of a series of tasks that are needed to accomplish some user-specified objective.

In this post, I'll more or less refer to "agents" in the same vein as AutoGPT. At the highest level, they're both referring to the same thing, an application that's sophisticated enough to accomplish a larger task.

So now that we know what agents (or AutoGPT) is, we can start to dive in a bit deeper to what they do and how they work.

The flow of an AutoGPT application

If you've been a software developer for any amount of time, you'll know that, up until recently, creating software was an application in applied logic. As software developers, we have to create a strictly specified set of rules that a running application must follow in any given circumstance. You also know that applications are unforgiving when it comes to following this logic. Sometimes, we may want a piece of software to do one thing, but if we've missed an edge case, or forgot to account for an unusual scenario, it will bug out.

With the advent of LLMs, we now have a tool that can allow us to have a lot more flexibility when it comes to accomplishing an objective. Rather than trying to specify a strict set of rules for a particular use case, we can now leverage LLMs to "understand" what needs to be done at a certain point in time, and tell us what to do next.

An example of where something like this might come in handy is a virtual assistant app that's responsible for handling a wide variety of instructions from you. For example, you might want an app that plugs into your calendar to schedule meetings for you, plugs into your email service to send emails from you, and connects to your social media accounts on demand to write a new post when you tell it to do so. An application like this would need to be able to handle a wide variety of commands.

Before LLMs, to create an app like this, you'd have to use some advanced keyword matching code to try to have the application understand what task it needed to accomplish next, and you'd need to have a lot of state management and error handling in your app to handle edge cases. However, with the advent of LLMs, you can now "outsource" a lot of the hard work to a language model, which will act as an orchestration layer.

LLMs as orchestrators

When you create an AutoGPT-style application, you leverage LLMs as intelligent decision-makers that can tell you what action to take next. For example, let's take the virtual assistant example from above and break down how your application would actually go through the series of tasks it would need to go through using an LLM.

First, you'd need to have a pre-defined set of tools that your agent supports (these can really be anything you need - calculators, API endpoints, search engines, etc). Your application will need to provide support for invoking these tools and saving their output for later use.

Next, you'll need away for the application's users to provide input, usually in a chat-based format.

Once a user provides a query or command, your first step is to send a message to an LLM with the user's request, along with the list of tools that your application supports, and ask the LLM to determine which tool needs to be run next. Your prompt to the LLM should be clear and provide instructions for how you want the answer to be formatted so your application can best understand its response, but this is straightforward to do.

Once the LLM provides its decision of which tool to run next given a user's query, your application will need to determine what inputs to provide the tool. You can use an LLM to do this as well, given the right prompt. Once the LLM does this, your application then invokes the selected tool with the given inputs, captures the outputs, and then presents them back to the LLM, asking it if it has enough information to properly respond to the user's request. If it does, then you ask it to provide a final answer and return that answer to the user. If it doesn't, you go through the same iterative process above until the user's request can be fulfilled.

Using this strategy, you iteratively make progress towards responding to the user's request by leveraging an LLM to instruct your application on what to do next. In this way, LLMs act as an orchestration layer for your application, making intelligent decisions about how it should behave at every step of the process that needs to be taken to respond to your user's request.

New capability

This level of autonomous decision-making has never really been possible before. Until recently, machines have not been able to reason and make intelligent decisions using unstructured natural language commands. That's why we've had to rely on tools that require point-and-click user interfaces, or command-line interfaces that provide a small, well-defined set of structured commands that the application supports. This new capability opens up a world of opportunities that can be created by autonomous, intelligent decision-making.

Software developers are already leveraging this new capability to create a world of interesting, powerful, autonomous applications. We're starting to see new virtual assistants emerge that have incredible sophistication and powerful abilities to accomplish a wide variety of tasks.

At Locusive, we're building a single, comprehensive agent for businesses to use with their data. We imagine a world where businesses can plug in any data source, software application, or third-party tool that they use and interact with a single agent that has access to them all. Businesses will be able to use this agent to quickly find information from any of their data sources, or send out smart and personalized emails to prospects, or create new reports using KPIs, all using a single chat interface.

Agents will soon take over the world of software, allowing us to be incredibly more efficient and productive with our time.


Image by storyset on Freepik