Some Early Tests and Notes on ChatGPT Agent

Published on:

Earlier this week, OpenAI released ChatGPT agent, a new agentic model that combines the text-focused capabilities of Deep Research with the browser-based automation of Operator into a single, well, agent that can autonomously browse the web, read webpages, and interact with web apps. OpenAI describes the (lowercase) agent as ChatGPT having its own computer.

From their blog:

You can now ask ChatGPT to handle requests like “look at my calendar and brief me on upcoming client meetings based on recent news,” “plan and buy ingredients to make Japanese breakfast for four,” and “analyze three competitors and create a slide deck.” ChatGPT will intelligently navigate websites, filter results, prompt you to log in securely when needed, run code, conduct analysis, and even deliver editable slideshows and spreadsheets that summarize its findings.

At the core of this new capability is a unified agentic system. It brings together three strengths of earlier breakthroughs: Operator’s⁠ ability to interact with websites, deep research’s⁠ skill in synthesizing information, and ChatGPT’s intelligence and conversational fluency.

ChatGPT carries out these tasks using its own virtual computer, fluidly shifting between reasoning and action to handle complex workflows from start to finish, all based on your instructions.

In the model’s official system card, OpenAI refers to ChatGPT agent as a “new agentic model in the same family as OpenAI o3”.

There are also these interesting tidbits about ChatGPT agent’s ability to resist prompt injection from malicious prompts hidden in webpages:

We’ve placed a particular emphasis on safeguarding ChatGPT agent against adversarial manipulation through prompt injection, which is a risk for agentic systems generally, and have prepared more extensive mitigations accordingly. Prompt injections are attempts by third parties to manipulate its behavior through malicious instructions that ChatGPT agent may encounter on the web while completing a task. For example, a malicious prompt hidden in a webpage, such as in invisible elements or metadata, could trick the agent into taking unintended actions, like sharing private data from a connector with the attacker, or taking a harmful action on a site the user has logged into. Because ChatGPT agent can take direct actions, successful attacks can have greater impact and pose higher risks.

We’ve trained and tested the agent on identifying and resisting prompt injections, in addition to using monitoring to rapidly detect and respond to prompt injection attacks. Requiring explicit user confirmation before consequential actions further reduces the risk of harm from these attacks, and users can intervene in tasks as needed by taking over or pausing. Users should weigh these tradeoffs when deciding what information to provide to the agent, as well as take steps to minimize their exposure to these risks, such as disabling connectors when they aren’t needed for a task.

What’s fascinating for me in this announcement is the multi-tool nature of ChatGPT agent. The ability to search the web and reason over large amounts of text was adopted from Deep Research; the visual browser (itself looking like a slimmed down version of Chrome running in a Linux VM) is an enhanced version of the Operator browser; there is an underlying terminal that agent can use to process text and make external API calls; agent also integrates with built-in connectors (for services like Gmail or Google Calendar) to directly access data from your accounts. All of these features come together as a single product that, based on my early tests, feels much better than Operator as far as an AI-driven web browser goes, and more customizable than Deep Research.

After reading about what agent could do and seeing some of the early tests from other folks online, I had an idea for a long-running task that I’d been personally putting off for a long time, and which felt perfect for a background agent. I asked ChatGPT agent to process hundreds of issues of MacStories Weekly on the Club MacStories website, open each one, find stuff that I wrote, and compile all my Club articles into a list. Then, I asked it to spot patterns and ideas for follow-up stories based on things I wrote months ago, and which I’ve likely forgotten about. I could have used a regular model to kick off a long-running task that, say, parsed an RSS feed for my articles on Club MacStories, but I was curious to see how the agent would perform on its own instead.

Long story short: it took it three sessions of 1 hour each, and the output was outstanding. After authenticating with my own Club account in the agent’s virtual browser at the beginning of the chat, the agent got to work. It started scrolling the Club homepage and clicking issues to open them and see if I wrote something for the newsletter that week. It quickly came up with a plan: it hit Control- F (not ⌘, since it mentioned it was running on Linux), searched for my name, and every time it found a match in the newsletter, it tried to understand if that was part of an article I wrote or another MacStories writer mentioning me by name.

The agent worked for an hour straight, without interrupting me, to parse 33 issues of MacStories Weekly, navigating through several pages of the Club website in the process. Was it slow compared to ingesting a full RSS feed and processing text? Sure, but it was also incredible to watch and see it do its thing in the background. Also, the Club website is an exception here: not every paywalled site offers a personalized RSS feed for individual members; I have to imagine that these capabilities may come in handy for hundreds of websites that cannot be “automated” with RSS or any other API in the future.

In subsequent prompts within the same chat, I was able to let agent process 27 more issues of the newsletter for a total of 50 before any additional request returned an error, likely suggesting a context window limit. For every session, ChatGPT agent prepared a Markdown report that I could view from the ChatGPT app on iOS and iPadOS, but which I could only save as a PDF from the ChatGPT web app on desktop. The results were very good, with a summary of everything I wrote for each issue of Weekly I contributed to, plus some actually good suggestions at the end of each document for topics I may want to revisit or expand upon.

I then decided to go even further and asked the agent to make me a Notion database about an upcoming trip we have this summer. Specifically, I gave it some initial details myself, but I told it to double-check everything via its Gmail connector. Once everything was confirmed, I asked to log into my Notion account and create a new database with the necessary details for the key activities on the trip. I only had to take over the browser twice at the beginning: the first time to log me into my Notion account; the second to bring up the modal window to create a new database because the agent couldn’t figure out how to display it. Then I left it to work on its own for an hour.

After 60 minutes, it created a basic Notion database, all on its own, with entries for my flights, hotel, and other activities. Honestly, 60 minutes to create a database with five items isn’t great. But if you consider that this is the worst ChatGPT agent will ever be, and that it successfully created a database with properties and other metadata on its first try by cross-referencing data with my email, I think that’s pretty impressive.

Then, when I was done with these experiments, I opened ChatGPT’s settings and wiped my data for remote browser connections, which I assume is equivalent to erasing all browsing data and activity from a device’s browser.

I could see myself using ChatGPT agent on a regular basis to make it work autonomously and asynchronously on data-crunching tasks and other boring, text-heavy tasks I don’t want to do manually anymore, or which would literally take me days to complete instead of 60 minutes. Right now, my biggest issue with any ChatGPT product isn’t the model per se, but the lack of MCP support and third-party integrations. I don’t want my work to be stuck inside ChatGPT with a bunch of reports and PDFs I have to read inside chats. This is why, after the agent was done processing my newsletters, I took its PDFs, gave them to Claude, and asked it to extract article ideas and put them in my Content Planner in Notion. It worked perfectly, and there are some interesting follow-up ideas in there for old topics I may revisit soon.

I think it’s clear, however, that the puck is skating in one direction: more compute, heavy tool usage, and vision capabilities combined with large context windows are turning chatbots into long-promised agents that can work alongside us on tasks we can’t or don’t want to do ourselves anymore. I’m quite impressed by what OpenAI delivered with this first version of ChatGPT agent; I hope the company can make good on its promise of MCP support in the ChatGPT app in the near future.

Source link

Related