User-Agent impersonators
Configuration to ban bots which claim to belong to an organisation from which they don't.
We'll be using Googlebot as an example, but most "legitimate" (whatever that means) actors publish their pools of IPs used for crawling.
Don't hesitate to add other bots to this page!
Googlebot
Great news! This can be fully automated, thanks to JSONnet and a bit of cleverness.
Google documents how to check their crawlers.
They offer a JSON file to download, which contain all the IP addresses of Googlebot.
Automatically downloading the data
So we first need to download it somewhere. Assuming you use systemd, you could add this download step before running reaction:
/etc/systemd/system/reaction.service:
[Service]
ExecStartPre=curl -o /var/lib/reaction/googlebot.json https://developers.google.com/search/apis/ipranges/googlebot.json
Parsing the data
Then we'll parse this JSON file directly in reaction's JSONnet configuration:
// Import the googlebot.json file
// It must first be downloaded from there:
// https://developers.google.com/search/apis/ipranges/googlebot.json
local googlebot = std.parseJson(importstr "/var/lib/reaction/googlebot.json").prefixes;
// Helper functions to test which field an object has
// https://jsonnet.org/ref/stdlib.html#std-objectHas
local isIpv6(obj) =  std.objectHas(obj, "ipv6Prefix");
local isIpv4(obj) =  std.objectHas(obj, "ipv4Prefix");
// Helper functions to extract the correct field
local toIpv6(obj) = obj.ipv6Prefix;
local toIpv4(obj) = obj.ipv4Prefix;
// Split the big array of objects into IPv4 and IPv6 ranges arrays
// https://jsonnet.org/ref/stdlib.html#std-filterMap
local ipv6Adresses = std.filterMap(isIpv6, toIpv6, googlebot);
local ipv4Adresses = std.filterMap(isIpv4, toIpv4, googlebot);
// Merge all IPv4 and IPv6 addresses in one array
// https://jsonnet.org/ref/stdlib.html#std-flattenArrays
local allAdresses = std.flattenArrays([ipv4Adresses, ipv6Adresses]);
Creating the right IP pattern
Now that we have our list of IP ranges, we can create an IP pattern that matches all IPs but true Googlebot's ones:
{
  patterns: {
    // This pattern only matches addresses
    // which does not belong to Google's advertised ranges
    ipnogoogle: {
      type: 'ip',
      ignorecidr: allAdresses,
    },
  },
}
Writing the correct filter
Then, we can create a filter for our webserver's logs, that will ban all IPs maliciously advertising themselves as Googlebot;
{
  streams: {
    nginx: {
      cmd: [ "tail", '-Fn0', '/var/log/nginx/access.log' ],
      filters: {
        googleimpersonators: {
          regex: [
            @'<ipnogoogle> .* "[^"]*Googlebot[^"]*"$',
          ],
          actions: {
            // ban! report!
          }
        }
      }
    }
  }
}
Wrapping up
This file can be pasted standalone in your reaction configuration directory, as long as you automatically download the googlebot.json file at the right place:
/etc/reaction/googlebot.jsonnet:
// Import the googlebot.json file
// It must first be downloaded from there:
// https://developers.google.com/search/apis/ipranges/googlebot.json
local googlebot = std.parseJson(importstr "/var/lib/reaction/googlebot.json").prefixes;
// Helper functions to test which field an object has
// https://jsonnet.org/ref/stdlib.html#std-objectHas
local isIpv6(obj) =  std.objectHas(obj, "ipv6Prefix");
local isIpv4(obj) =  std.objectHas(obj, "ipv4Prefix");
// Helper functions to extract the correct field
local toIpv6(obj) = obj.ipv6Prefix;
local toIpv4(obj) = obj.ipv4Prefix;
// Split the big array of objects into IPv4 and IPv6 ranges arrays
// https://jsonnet.org/ref/stdlib.html#std-filterMap
local ipv6Adresses = std.filterMap(isIpv6, toIpv6, googlebot);
local ipv4Adresses = std.filterMap(isIpv4, toIpv4, googlebot);
// Merge all IPv4 and IPv6 addresses in one array
// https://jsonnet.org/ref/stdlib.html#std-flattenArrays
local allAdresses = std.flattenArrays([ipv4Adresses, ipv6Adresses]);
{
  patterns: {
    // This pattern only matches addresses
    // which does not belong to Google's advertised ranges
    ipnogoogle: {
      type: 'ip',
      ignorecidr: allAdresses,
    },
  },
  streams: {
    nginx: {
      // `cmd` can be omitted if already precised in another configuration file
      cmd: [ "tail", '-Fn0', '/var/log/nginx/access.log' ],
      filters: {
        googleimpersonators: {
          regex: [
            @'<ipnogoogle> .* "[^"]*Googlebot[^"]*"$',
          ],
          actions: {
            // ban! report!
          }
        }
      }
    }
  }
}