Spam detection is an extremely well studied problem, and there’s a large body of knowledge for us to draw on. While the state of the art in spam filtering has advanced, one of the earliest and simplest techniques generally performs well: Bayesian filtering.
Bayesian filtering: the theory
A disclaimer: I’m not a credentialed statistician or expert on this topic. My apologies for any errors in explanation; they are indavertent.
The idea behind Bayesian filtering is that there is a probability that a given message is spam based on the presence of a specific word or phrase.
If you have a set of messages that are spam and non-spam, you can easily compute the probability for a single word – take the number of messages that have the word and are spam and divide it by the total number of messages that have the word:
In most cases, no single word is going to be a very effective predictor, and so the real value comes in combining the probabilities for a great many words. I’ll skip the mathematical explanation, but the bottom line is that by taking a mapping of words to emails that are known to be spam or not, you can compute a likelihood that a given new message is spam. If the probability is greater than a threshold, that email is flagged as spam.
Intressant sätt att använda bayesian filtering för att få ut vilka inlägg som kräver snabbt svar på Twitter. Vore intressant att testa på FL; typ förutspå vilka trådar som blir hetaste debatter genom vilka ord som finns i subject, lär gå att få ganska träffsäkert.
Behind the Scenes: Twitter, Part 2 - Lessons from email [37signals.com]