The current implementation of Spamikaze has only a primitive way of avoiding false positives: querying one or more WhiteLists of known mail servers (eg. whitelist.surriel.com) and avoiding the listing of those IP addresses. Not only is it a lot of work to manually maintain such a whitelist, but such a whitelist is also bound to be incomplete and/or inaccurate.
Spamikaze's goal would be to only block those IP addresses that send out a lot of spam and little legitimate email; that way the users of Spamikaze powered DNSBLs would get little spam, while losing only very little legitimate email. There are various ideas on how to identify both spammy IP addresses (that should be blocked) and IP addresses that are the source of lots of legitimate email (and should not be blocked). Please add your idea to this list, so we can discuss them all and decide what to do:
Rik's idea
This method should be best for large sites, or DNSBLs that get a reasonable number of queries.
enhancements
If the ratio is near the average, follow the user's preferences:
An aggressive list may still want to blocklist, if a spamtrap mail was received recently.
A more cautious list may not want to block these IP addresses.
A third option would be to be cautious for IP addresses that are on one of the WhiteLists, and aggressive for other IP addresses.
The score of an IP address can be modified depending on reverse DNS - missing or dynamic looking reverse DNS can get a host blocked faster than reverse DNS that looks like a mail server, etc...
The IP address will automatically expire from the database if it stops sending mail to spamtraps, since the legitimate mail will cause the spamtrap ratio to go below the average.
optimizations
only track spamtrap ratio for IP addresses that have sent spamtrap mail recently
the ratio for other IP addresses will be zero, since they did not send any mail to spamtraps
this way we only need to track the ratio for a small subset of all IP addresses
SELECT ip FROM blocklist,statistics WHERE (blocklist.spam / blocklist.queries) > (SUM(statistics.spam) / SUM(statistics.queries));
only count DNS queries from a certain time interval, eg. between a day and a month ago
lonki's idea
This method should be best for small (or even personal) Spamikaze installations.
Use spamassassin to find out the IP addresses that send ham.
The sample size is probably too small for statistical analysis.
Using a greylisting method reduces the amount of spam tremendously without (almost) any false positives.
Such a greylist system also produces a record of servers which deliver large amounts of mail over a period of time as well as servers who try to initially send mail, get temporary rejected and never try again to the same recipient. Such servers are obvious sources for being blacklisted.

