Comparison of Statistical Spam Detection Techniques
Abstract
Spam (unsolicited and undesirable email) has become a significant problem for email users. This study investigated the current state-of-the-art in statistical spam filtering. Established methods, inspired by the work of Paul Graham, were examined, and new techniques were introduced and tested. A base configuration of a spam filter program was implemented and tested. This configuration achieved high accuracy while maintaining a low rate of false positives. One main objective of this paper was to develop a new weighted token probability function. This function performed well. Tests showed that when tested separately, header and phrase weights gave mixed results. Also, tests were conducted to show the effects of different initial training set sizes. All three test corpora achieved adequate accuracy with small initial training sets, and even performed well with no initial training data, depending on the training method used. Three post-classification training methods and various other techniques were also studied.
Collections
- OSU Theses [15752]