The problem of spam e-mails has been addressed for some time. Most of the solutions are based on spam e-mail classification and filtering. However, the content of spam e-mails drifts with new concepts or social events. Thus, several spam classifiers perform effectively when their models are initially established, and their performances deteriorate with time. A learning mechanism is required to adjust the classification parameters for new and old e-mails. Because of the spread of spam e-mails, the number of spam e-mails is larger than that of legitimate e-mails. Therefore, most classifiers produce high recall for spam e-mails and low recall for legitimate e-mails. Based on the Bayesian algorithm, we propose an incremental forgetting weighted algorithm with a misclassification cost mechanism that extracts features by IGICF (Information Gain and Inverse Class Frequency) to address the problem of concept drift and data skew in spam e-mail classification. We implemented the algorithm and performed detailed tests on the effectiveness of the mechanism.
Lecture Notes in Computer Science 7802, pp.314-324