What Is GPTBot?

GPTBot is OpenAI's web crawler that fetches publicly available content from the internet to train its generative AI models. It sends HTTP requests with a user-agent string containing GPTBot/1.1 and a link to OpenAI's documentation, making it identifiable in server access logs. Traffic can also be verified against OpenAI's published IP ranges (OpenAI Developers, "Overview of OpenAI Crawlers", 2025).

How Does GPTBot Access Your Content?

GPTBot crawls pages that are publicly reachable without authentication, following links much like a search engine bot does. Each request carries the GPTBot/1.1 identifier in the user-agent header, so web servers can recognize it in logs. To stop GPTBot from crawling any part of your site, add User-agent: GPTBot followed by Disallow: / to your robots.txt file (OpenAI Developers, "Overview of OpenAI Crawlers", 2025). You can also permit crawling on specific paths while blocking others, using standard robots.txt path syntax.

Frequently Asked Questions

GPTBot collects publicly accessible web content that OpenAI uses to train and improve its generative AI models, including future versions of GPT. It skips pages that require login and respects standard robots.txt directives.

Add two lines to your site's robots.txt: User-agent: GPTBot on one line and Disallow: / on the next. This tells GPTBot to skip your entire site. You can target only specific directories by listing individual paths after Disallow instead.

Check the request's user-agent string for GPTBot/1.1, then cross-reference the source IP against OpenAI's published IP ranges in their developer documentation (OpenAI Developers, "Overview of OpenAI Crawlers", 2025). Using both checks together gives reliable confirmation.