Web AI crawlers
Configuration to ban AI bots. Here the idea is to look for their User-Agents in your webserver logs.
You may as well Disallow those user agents from looking at your websites in a robots.txt file.
I personnally prefer banning them, to save ressources and be less cooperative to them.
Note that an AI bot may give a browser-like User Agent and go unnoticed...
While the goal of this is to prevent AI bots from feeding themselves with your websites, banning search engine bots may affect how your appear in search results.
Most search engines seem to have separate user agents for AI and for search, but not every one of them.
You can either maintain your own list, or delegate this task to the ai-robots.txt project.
1 - Handcrafted list
An incomplete list of user agents based on the ai-robots.txt project.
ChatGPT-UserDuckAssistBotMeta-ExternalFetcherAI2BotApplebot-ExtendedBytespiderCCBotClaudeBotDiffbotFacebookBotGoogle-ExtendedGPTBotKangaroo BotMeta-ExternalAgentomgiliTimpibotWebzio-ExtendedAmazonbotApplebotOAI-SearchBotPerplexityBotYouBot
As a pattern, we'll use ip. See here.
JSONnet example:
local bots = [ "ChatGPT-User", "DuckAssistBot", "Meta-ExternalFetcher", "AI2Bot", "Applebot-Extended", "Bytespider", "CCBot", "ClaudeBot", "Diffbot", "FacebookBot", "Google-Extended", "GPTBot", "Kangaroo Bot", "Meta-ExternalAgent", "omgili", "Timpibot", "Webzio-Extended", "Amazonbot", "Applebot", "OAI-SearchBot", "PerplexityBot", "YouBot" ];
{
streams: {
nginx: {
cmd: ['...'], // see ./nginx.md
filters: {
aiBots: {
regex: [
// User-Agent is the last field
// Bot's name can be anywhere in the User-Agent
// (hence the leading and trailing [^"]*
@'^<ip> .* "[^"]*(%s)[^"]*"$' % std.join('|', bots)
],
actions: banFor('30d'),
},
},
},
traefik: {
cmd: ['...'], // see ./traefik.md
filters: {
aiBots: {
regex: [
// request_User-Agent is the last field
// the field is not present by default
// see ./traefik.md to add this header field
// Bot's name can be anywhere in the User-Agent
// (hence the leading and trailing [^"]*
@'^.*"ClientHost":"<ip>".*"request_User-Agent":"[^"]*%s[^"]*"' % bot
for bot in bots
],
actions: banFor('30d'),
},
},
},
},
}
YAML Example:
streams:
nginx:
cmd: ['...'] # see ./nginx.md
filters:
aiBots:
regex:
# User-Agent is the last field
# Bot's name can be anywhere in the User-Agent
# (hence the leading and trailing [^"]*
- '^<ip>.*"[^"]*ChatGPT-User[^"]*"$'
- '^<ip>.*"[^"]*DuckAssistBot[^"]*"$'
- '^<ip>.*"[^"]*Meta-ExternalFetcher[^"]*"$'
- '...' # Repeat for each bot
actions: '...' # your ban actions here
2 - Automatic list
The ai-robots.txt community project maintains a list of AI robots' user agents.
We can use that list to dynamically construct our regex list, thanks to JSONnet and a bit of cleverness.
We must first download their JSON file.
Automatically downloading the data
So we first need to download it somewhere. Assuming you use systemd, you could add this download step before running reaction:
/etc/systemd/system/reaction.service:
[Service]
ExecStartPre=curl -o /var/lib/reaction/ai-robots.json https://raw.githubusercontent.com/ai-robots-txt/ai.robots.txt/refs/heads/main/robots.json
Creating the configuration file
// Import the JSON file
local aiRobots = import "/var/lib/reaction/ai-robots.json";
// Keep only the user agents, which are the keys of the loaded object
local names = std.objectFields(aiRobots);
// Join all user agents into one string
local joined = std.join("|", names);
{
streams: {
nginx: {
filters: {
gptbot: {
regex: [ @'^<ip>.*"[^"]*(' + joined + ')[^"]*"$' ],
actions: {
// ban!
}
}
}
}
}
}
Optional: Still allowing some bots of the list
If you want to allow some of those bots, you can remove them from the list first:
// Import the JSON file
local aiRobots = import "/var/lib/reaction/ai-robots.json";
// List of user agents we want to allow
local excluded = ["FacebookBot", "Google-Firebase"];
// Filter dynamic list based on excluded list
local exclude(elem) = !std.contains(excluded, elem);
// Keep only the user agents, which are the keys of the loaded object
local names = std.filter(exclude, std.objectFields(aiRobots));
// Join all user agents into one string
local joined = std.join("|", names);
{
streams: {
nginx: {
filters: {
gptbot: {
regex: [ @'^<ip>.*"[^"]*(' + joined + ')[^"]*"$' ],
actions: {
// ban!
}
}
}
}
}
}