Projects  
 

Controlling Filter Complexity: Statistical-Algorithmic Hybrid Classification
Last Update: Saturday, April 3 2004

jonathan@nuclearelephant.com

Download the White Paper (LNCS)



Abstract


Present-day language classifiers bear the responsibility of maintaining accuracy in the midst of ever-increasing sample complexity. In the setting of spam filtering, many types of intentional attacks have been introduced such as obfuscation, word list injection, sample flooding, and etcetera. As the complexity of classification text continues to multiply rapidly, many developers today are left with conflicted feelings between increasing the complexity of their filter and wise teachings from CS class reminding them that computer science is about controlling complexity, not creating it. At the rate complexity is rising, filters will (and have already begun to) become so resource-intensive that they lose scalability, eventually leading to a second conflict of interests: where fighting spam becomes more expensive than managing it.

This paper boldly suggests that there is a better alternative to increasing the feature set of filters to match the spams they are trying to fight by employing algorithms designed to increase the quality of existing data rather than the quantity of data, the quality of rule sets rather than the quantity of rule sets, and by reducing the feature set rather than increasing it. We will discuss several present-day approaches, their results, and why this approach must be employed at some point if filters are to prevent becoming obsolete and too expensive to run.


Take me to the DSPAM Home Page

 All Website Content © 2004 Jonathan A. Zdziarski. All Rights Reserved.
Reproduction prohibited without permission