Web AI crawlers

Configuration to ban AI bots. Here the idea is to look for their User-Agents in your webserver logs.

You may as well Disallow those user agents from looking at your websites in a robots.txt file. I personnally prefer banning them, to save ressources and be less cooperative to them. Note that an AI bot may give a browser-like User Agent and go unnoticed...

While the goal of this is to prevent AI bots from feeding themselves with your websites, banning search engine bots may affect how your appear in search results.

Most search engines seem to have separate user agents for AI and for search, but not every one of them.

You can either maintain your own list, or delegate this task to the ai-robots.txt project.

1 - Handcrafted list

An incomplete list of user agents based on the ai-robots.txt project.

  • ChatGPT-User
  • DuckAssistBot
  • Meta-ExternalFetcher
  • AI2Bot
  • Applebot-Extended
  • Bytespider
  • CCBot
  • ClaudeBot
  • Diffbot
  • FacebookBot
  • Google-Extended
  • GPTBot
  • Kangaroo Bot
  • Meta-ExternalAgent
  • omgili
  • Timpibot
  • Webzio-Extended
  • Amazonbot
  • Applebot
  • OAI-SearchBot
  • PerplexityBot
  • YouBot

As a pattern, we'll use ip. See here.

JSONnet example:

local bots = [ "ChatGPT-User", "DuckAssistBot", "Meta-ExternalFetcher", "AI2Bot", "Applebot-Extended", "Bytespider", "CCBot", "ClaudeBot", "Diffbot", "FacebookBot", "Google-Extended", "GPTBot", "Kangaroo Bot", "Meta-ExternalAgent", "omgili", "Timpibot", "Webzio-Extended", "Amazonbot", "Applebot", "OAI-SearchBot", "PerplexityBot", "YouBot" ];
{
  streams: {
    nginx: {
      cmd: ['...'], // see ./nginx.md
      filters: {
        aiBots: {
          regex: [
            // User-Agent is the last field
            // Bot's name can be anywhere in the User-Agent
            // (hence the leading and trailing [^"]*
            @'^<ip> .* "[^"]*(%s)[^"]*"$' % std.join('|', bots)
          ],
          actions: banFor('30d'),
        },
      },
    },
    traefik: {
      cmd: ['...'], // see ./traefik.md
      filters: {
        aiBots: {
          regex: [
            // request_User-Agent is the last field
            // the field is not present by default
            // see ./traefik.md to add this header field
            // Bot's name can be anywhere in the User-Agent
            // (hence the leading and trailing [^"]*
            @'^.*"ClientHost":"<ip>".*"request_User-Agent":"[^"]*%s[^"]*"' % bot
            for bot in bots
          ],
          actions: banFor('30d'),
        },
      },
    },
  },
}

YAML Example:

streams:
  nginx:
    cmd: ['...'] # see ./nginx.md
    filters:
      aiBots:
        regex:
            # User-Agent is the last field
            # Bot's name can be anywhere in the User-Agent
            # (hence the leading and trailing [^"]*
          - '^<ip>.*"[^"]*ChatGPT-User[^"]*"$'
          - '^<ip>.*"[^"]*DuckAssistBot[^"]*"$'
          - '^<ip>.*"[^"]*Meta-ExternalFetcher[^"]*"$'
          - '...' # Repeat for each bot
      actions: '...' # your ban actions here

2 - Automatic list

The ai-robots.txt community project maintains a list of AI robots' user agents.

We can use that list to dynamically construct our regex list, thanks to JSONnet and a bit of cleverness.

We must first download their JSON file.

Automatically downloading the data

So we first need to download it somewhere. Assuming you use systemd, you could add this download step before running reaction:

/etc/systemd/system/reaction.service:

[Service]
ExecStartPre=curl -o /var/lib/reaction/ai-robots.json https://raw.githubusercontent.com/ai-robots-txt/ai.robots.txt/refs/heads/main/robots.json

Creating the configuration file

// Import the JSON file
local aiRobots = import "/var/lib/reaction/ai-robots.json";

// Keep only the user agents, which are the keys of the loaded object
local names = std.objectFields(aiRobots);

// Join all user agents into one string
local joined = std.join("|", names);

{
  streams: {
    nginx: {
      filters: {
        gptbot: {
          regex: [ @'^<ip>.*"[^"]*(' + joined + ')[^"]*"$' ],
          actions: {
             // ban!
          } 
        }
      }
    }
  }
}

Optional: Still allowing some bots of the list

If you want to allow some of those bots, you can remove them from the list first:

// Import the JSON file
local aiRobots = import "/var/lib/reaction/ai-robots.json";

// List of user agents we want to allow
local excluded = ["FacebookBot", "Google-Firebase"];

// Filter dynamic list based on excluded list
local exclude(elem) = !std.contains(excluded, elem);

// Keep only the user agents, which are the keys of the loaded object
local names = std.filter(exclude, std.objectFields(aiRobots));

// Join all user agents into one string
local joined = std.join("|", names);
{
  streams: {
    nginx: {
      filters: {
        gptbot: {
          regex: [ @'^<ip>.*"[^"]*(' + joined + ')[^"]*"$' ],
          actions: {
             // ban!
          } 
        }
      }
    }
  }
}