Web AI crawlers

Configuration to ban GPTBot and friends. Here the idea is to look for their User-Agents in your webserver logs.

You may as well Disallow those user agents from looking at your websites in a robots.txt file. I personnally prefer banning them, to save ressources and be less cooperative to them. Note that an AI bot may give a browser-like User Agent and go unnoticed...

While the goal of this is to prevent AI bots from feeding themselves with your websites, banning search engine bots may affect how your appear in search results.

They seem to have separate user agents for AI en for search, but who knows?

A (most probably incomplete) list of user agents based on https://darkvisitors.com/agents:

  • ChatGPT-User
  • DuckAssistBot
  • Meta-ExternalFetcher
  • AI2Bot
  • Applebot-Extended
  • Bytespider
  • CCBot
  • ClaudeBot
  • Diffbot
  • FacebookBot
  • Google-Extended
  • GPTBot
  • Kangaroo Bot
  • Meta-ExternalAgent
  • omgili
  • Timpibot
  • Webzio-Extended
  • Amazonbot
  • Applebot
  • OAI-SearchBot
  • PerplexityBot
  • YouBot

(Feel free to add your own discoveries to this list!)

As a pattern, we'll use ip. See here.

JSONnet Example:

local bots = [ "ChatGPT-User", "DuckAssistBot", "Meta-ExternalFetcher", "AI2Bot", "Applebot-Extended", "Bytespider", "CCBot", "ClaudeBot", "Diffbot", "FacebookBot", "Google-Extended", "GPTBot", "Kangaroo Bot", "Meta-ExternalAgent", "omgili", "Timpibot", "Webzio-Extended", "Amazonbot", "Applebot", "OAI-SearchBot", "PerplexityBot", "YouBot" ];
{
  streams: {
    nginx: {
      cmd: ['...'], // see ./nginx.md
      filters: {
        aiBots: {
          regex: [
            // User-Agent is the last field
            // Bot's name can be anywhere in the User-Agent
            // (hence the leading and trailing [^"]*
            @'^<ip>.*"[^"]*%s[^"]*"$' % bot
            for bot in bots
          ],
          actions: banFor('720h'),
        },
      },
    },
  },
}

YAML Example:

local bots = [ "ChatGPT-User", "DuckAssistBot", "Meta-ExternalFetcher", "AI2Bot", "Applebot-Extended", "Bytespider", "CCBot", "ClaudeBot", "Diffbot", "FacebookBot", "Google-Extended", "GPTBot", "Kangaroo Bot", "Meta-ExternalAgent", "omgili", "Timpibot", "Webzio-Extended", "Amazonbot", "Applebot", "OAI-SearchBot", "PerplexityBot", "YouBot" ];

streams:
  nginx:
    cmd: ['...'] # see ./nginx.md
    filters:
      aiBots:
        regex:
            # User-Agent is the last field
            # Bot's name can be anywhere in the User-Agent
            # (hence the leading and trailing [^"]*
          - '^<ip>.*"[^"]*ChatGPT-User[^"]*"$'
          - '^<ip>.*"[^"]*DuckAssistBot[^"]*"$'
          - '...' # Repeat for each bot
      actions: '...' # your ban actions here