The current implementation of Spamikaze has only a primitive way of avoiding false positives: querying one or more WhiteLists of known mail servers (eg. whitelist.surriel.com) and avoiding the listing of those IP addresses. Not only is it a lot of work to manually maintain such a whitelist, but such a whitelist is also bound to be incomplete and/or inaccurate.
Spamikaze's goal would be to only block those IP addresses that send out a lot of spam and little legitimate email; that way the users of Spamikaze powered DNSBLs would get little spam, while losing only very little legitimate email. There are various ideas on how to identify both spammy IP addresses (that should be blocked) and IP addresses that are the source of lots of legitimate email (and should not be blocked). Please add your idea to this list, so we can discuss them all and decide what to do:
Rik's idea
This method should be best for large sites, or DNSBLs that get a reasonable number of queries.
- For each IP address, measure:
- The number of spamtrap mails recently received.
- The total amount of email received (if we assume that the amount of DNSBL queries corresponds to the number of emails received from this IP by the DNSBL users, we can count the DNSBL queries about this IP address).
- For the database as a whole:
- Calculate the average ratio of (spamtrap mails / total emails) - "spamtrap ratio".
- Calculate the standard deviation.
- For each IP address in the database:
- Blocklist if the IP address has a spamtrap ratio higher than average + standard deviation.
- Auto-whitelist if the IP address has a spamtrap ratio lower than average - standard deviation.
enhancements
- If the ratio is near the average, follow the user's preferences:
- An aggressive list may still want to blocklist, if a spamtrap mail was received recently.
- A more cautious list may not want to block these IP addresses.
A third option would be to be cautious for IP addresses that are on one of the WhiteLists, and aggressive for other IP addresses.
- The score of an IP address can be modified depending on reverse DNS - missing or dynamic looking reverse DNS can get a host blocked faster than reverse DNS that looks like a mail server, etc...
- The IP address will automatically expire from the database if it stops sending mail to spamtraps, since the legitimate mail will cause the spamtrap ratio to go below the average.
optimizations
- only track spamtrap ratio for IP addresses that have sent spamtrap mail recently
- for other IP addresses we only count queries
- the ratio for other IP addresses will be zero, since they did not send any mail to spamtraps
- this way we only need to track the ratio for a small subset of all IP addresses
SELECT ip FROM blocklist,statistics WHERE (blocklist.spam / blocklist.queries) > (SUM(statistics.spam) / SUM(statistics.queries));
- only count DNS queries from a certain time interval, eg. between a day and a month ago
- if a host suddenly starts spewing spam out of nowhere, it will be listed quickly
- if a host has been sending out legitimate email steadily for weeks, it will not get listed easily
- when rotating/expiring queries from the query count tables
calculate and store the total spamtrap mail count and query count in a statistics table
recalculate the queries count for the IP addresses that sent spamtrap email recently
- if the ratio for a listed IP address now drops below the average, the IP address will drop out of the blocklist
- expire the hosts that sent out spamtrap mail
- a fixed time after the last spamtrap mail was received
- low overhead expiry, easier than recalculating ratios
the statistics table would contain:
- total DNS queries and spamtrap emails per day
- can be expired together with DNS query history
- from this we derive the threshold for blacklisting IP addresses
- total DNS queries and spamtrap emails per day
lonki's idea
This method should be best for small (or even personal) Spamikaze installations.
- Use spamassassin to find out the IP addresses that send ham.
- Use that data to build up a whitelist.
- The sample size is probably too small for statistical analysis.
Nico's Idea
Using a greylisting method reduces the amount of spam tremendously without (almost) any false positives.
Such a greylist system also produces a record of servers which deliver large amounts of mail over a period of time as well as servers who try to initially send mail, get temporary rejected and never try again to the same recipient. Such servers are obvious sources for being blacklisted.