Web AI crawlers

Configuration to ban GPTBot and friends. Here the idea is to look for their User-Agents in your webserver logs.

You may as well Disallow those user agents from looking at your websites in a robots.txt file. I personnally prefer banning them, to save ressources and be less cooperative to them. Note that an AI bot may give a browser-like User Agent and go unnoticed...

While the goal of this is to prevent AI bots from feeding themselves with your websites, banning search engine bots may affect how your appear in search results.

They seem to have separate user agents for AI en for search, but who knows?

A (most probably incomplete) list of user agents based on https://darkvisitors.com/agents:

ChatGPT-User
DuckAssistBot
Meta-ExternalFetcher
AI2Bot
Applebot-Extended
Bytespider
CCBot
ClaudeBot
Diffbot
FacebookBot
Google-Extended
GPTBot
Kangaroo Bot
Meta-ExternalAgent
omgili
Timpibot
Webzio-Extended
Amazonbot
Applebot
OAI-SearchBot
PerplexityBot
YouBot

(Feel free to add your own discoveries to this list!)

As a pattern, we'll use ip. See here.

JSONnet Example:

local bots = [ "ChatGPT-User", "DuckAssistBot", "Meta-ExternalFetcher", "AI2Bot", "Applebot-Extended", "Bytespider", "CCBot", "ClaudeBot", "Diffbot", "FacebookBot", "Google-Extended", "GPTBot", "Kangaroo Bot", "Meta-ExternalAgent", "omgili", "Timpibot", "Webzio-Extended", "Amazonbot", "Applebot", "OAI-SearchBot", "PerplexityBot", "YouBot" ];
{
  streams: {
    nginx: {
      cmd: ['...'], // see ./nginx.md
      filters: {
        aiBots: {
          regex: [
            // User-Agent is the last field
            // Bot's name can be anywhere in the User-Agent
            // (hence the leading and trailing [^"]*
            @'^<ip>.*"[^"]*%s[^"]*"$' % bot
            for bot in bots
          ],
          actions: banFor('720h'),
        },
      },
    },
    traefik: {
      cmd: ['...'], // see ./traefik.md
      filters: {
        aiBots: {
          regex: [
            // request_User-Agent is the last field
            // the field is not present by default
            // see ./traefik.md to add this header field
            // Bot's name can be anywhere in the User-Agent
            // (hence the leading and trailing [^"]*
            @'^.*"ClientHost":"<ip>".*"request_User-Agent":"[^"]*%s[^"]*"' % bot
            for bot in bots
          ],
          actions: banFor('720h'),
        },
      },
    },
  },
}

YAML Example:

streams:
  nginx:
    cmd: ['...'] # see ./nginx.md
    filters:
      aiBots:
        regex:
            # User-Agent is the last field
            # Bot's name can be anywhere in the User-Agent
            # (hence the leading and trailing [^"]*
          - '^<ip>.*"[^"]*ChatGPT-User[^"]*"$'
          - '^<ip>.*"[^"]*DuckAssistBot[^"]*"$'
          - '^<ip>.*"[^"]*Meta-ExternalFetcher[^"]*"$'
          - '...' # Repeat for each bot
      actions: '...' # your ban actions here

Reaction wiki

Web AI crawlers