![]() |
Projects |
|
Concept Identification using Chained Tokens As presented at the MIT Spam Conference 2004 Last Update: Saturday, Feb 07 2004
jonathan@nuclearelephant.com
ABSTRACT From the article Bayesian Noise Reduction: Statistical Feature Omission for Advanced Language Analysis... "Modern statistical filters perform their language classification functions based on the available data. All of the present-day algorithms widely used are inherently sound and accurate ones, however regardless of which algorithm you use to perform statistical combination, a great deal of the filter's accuracy is related directly to the quality of data provided to the algorithm." Chained Tokens (also known as multi-word tokens and n-grams) is a simple data processing algorithm designed to provide more specific (and much better) data for the existing statistical algorithms to work with. As the name implies, this algorithm is based on the concept of chaining adjacent tokens together to make new tokens. This approach creates 2n-2(ish) additional data points to work with. Chained Tokens, for the purpose of this article, are tokens that are combined with their adjacent partners to form two new tokens. Chained tokens aren't a replacement for individual tokens, but rather a complement to be used in conjunction with them for better analysis. For example, if we are analyzing an email with the following phrase CALL NOW, IT'S FREE! ...there are four tokens created under standard analysis: Using chained tokens, we create three additional tokens as well: A few basic rules apply to the implementation of chained tokens: This article focuses less on the approach itself (since it is a simple one) and more on the benefits of implementing such an algorithm. 1.0 Case Study Analysis Now that we have an understanding of chained tokens, we will explore their purpose in statistical filtering. When I first coded chained tokens into DSPAM, I wasn't certain how effective they would be, but come to find they were extremely effective and far more useful than individual tokens alone. The reasoning behind chained tokens is this: given two tokens, A and B, chained tokens enable a filter to not only calculate the interest of A and B individually, but also the interest of AB, which may or may not be more interesting than the individual tokens. If AB is truly more interesting than A and B individually, then we have a potentially more context-specific probability, which can help identify both spams and innocent emails better resulting in better spam identification and fewer false positives. CRM114, an excellent language processor, uses a similar approach called "sparse binary polynomial hashing", however with a few differences. For one, CRM114 will calculate a chain of up to five tokens whereas the Chained Tokens approach calculates only a tokens adjacent neighbors. Secondly, CRM114 will allow word skipping in-between tokens. These are both excellent functions for general language classification, however for the sake of scalability and speed, I chose a modified approach taking into consideration the amount of disk space and processing power to do this in a scaled, real-time environment and on a per-user basis. When weighing the benefits, I decided that a slightly scaled down approach would be almost as effective without the additional expenses - at least in the setting of spam. If you look at most spams, any identifying language patterns can usually be ascertained with only two adjacent tokens. For example, "Play Lotto", "Hot Girls", or "remove from". While they can in fact be linked together with other tokens, it only takes the two tokens together to get a very specific subset of data related to the email. If someone was reading your email out loud using one or two interesting pairs of adjacent words, you would very easily be able to ascertain whether or not the message is spam. All the more should a statistical filter be able to do it with 15 pairs. Also, since spams are frequently different, it makes sense to keep the data in as small chunks as possible while being descriptive enough for an accurate comparison lest you run the risks of having such a specialized set of data that you miss variations. With all these things considered, I believe Chained Tokens strike a very good balance between analysis and verboseness in the setting of spam. 1.1 Pattern Identification Let's take a look at some examples of chained tokens used in a real-life environment. Below are some individual tokens accompanied by their chained counterpart. The above example shows that the words 'all' and 'the' are fairly uninteresting words (defining interesting as the words distance from a neutral 0.5), since they appear in a significant amount of messages, both spam and innocent. Why throw such great information away when we can combine them as adjacent partners? These words alone may not even be interesting enough to play a role in the calculation of a message, however we see that together, the phrase "all the" is a much more interesting token. This chained token has a much better chance of showing up in the calculation of such a message, improving its chances of being tagged an innocent message (based on the past corpus). Now look at a set of tokens with the opposite effect... The word color is a fairly innocent word in this user's dictionary, as it can be used to describe a number of things - a font color or a shirt color. The hex code for black (#000000) on the other hand, is a very neutral token for very different reasons, obviously used in some innocent HTML-based emails the user receives. If both of these tokens were used in the calculation of a message, you would end up with a rather mediocre result between the two. Even worse, the word color is likely to show up in the calculation without the hex code, since the hex code is less interesting, and could skew the statistics toward innocent. We see, however, that the combination of the two of these individual tokens creates a very guilty chained token that would dramatically change the output of the calculation. You may ask why it wasn't neutral, since a white font color can be part of innocent email as well. True, however the HTML generators used for spams are slightly different than those used for legitimate messages. Microsoft Outlook words its HTML code a bit differently (and with different case) so as to provide a different token altogether. For example, a legitimate email client may use COLOR*#000000 or COLOR*WHITE. Also keep in mind that the chained token is going to appear most when setting the font color (black is a very popular color for big bold text in spams). Since the hex code alone was neutral, we can draw the conclusion that the innocent messages used the color in a background setting, or some other area of their HTML code that did not use the color tag just prior. For example: As we see the background color is set several times in both innocent messages and spams. I'm not a strong HTML-based email user myself, so differentiating HTML code from actual conversation may not be as necessary. Many users are, however, big on HTML email and would benefit from Chained Tokens. 1.2 Differentiation We've learned so far that chained tokens can be used to improve the accurate identification of both innocent messages and spams, and play a significant role in the calculation. Let's take a look at another use for chained tokens. A content header of multipart/mixed is generally an innocent content header because it is used by so many email clients when sending HTML email, attachments, and etcetera. The tokens mixed and multipart individually, as the data reflects, will show up in both spams and nonspams almost equally. This is because there are several other content types using the individual tokens, such as multipart/alternative. If it wasn't for Chained Tokens, we would have discarded all this information, however we are able to catch the multipart/mixed and multipart/alternative tokens as a result of this chaining. So now we realize that chained tokens can give us a better subset of data to work with that we would otherwise be ignoring. Another good example of this is found in mailing lists: Many spams and mailing list emails both use the word unsubscribe, however only mailing lists generally use the phrase "unsubscribe from"...and all this time spammers have been trying to obfuscate their bogus unsubscribe algorithm meaning it is very unlikely you'll find a spam with this phrase - at least until the spammers get a little brighter. 1.3 HTML Classification We've already looked at how chained tokens can be used to identify HTML tags as opposed to text used in conversation. Yet another use I found in chained tokens is the example below, which exhibits how well chained tokens can function to identify the different patterns of HTML code used by both spammers and legitimate users. The 'font face' token was found to be much more innocent than the two neutral tokens alone. This is due in part to the many other font tags used in spams (such as font size) that separate these two tokens. Many generators use their own type of formatting, and so a spammer's generator might prefer "FONT COLOR" first, then FACE later while outlook prefers "FONT FACE". If you're looking at individual tokens, you're going to miss all this, because every HTML message uses FONT, FACE, and COLOR...but many do in a unique order. 1.4 Contextual Analysis Now we will move onto one of my favorite uses for chained tokens, contextual analysis. Yes, Virginia, chained tokens can be used to identify language patterns! Informal discussion, canned messages, or hot & heavy, chained tokens do a great job of performing natural language analysis. You can't tell what context the message is in simply by looking at the word 'that', however when combined with other words you can tell the discussion is informal. Since the words 'that sent' generally isn't a canned term (due to its informal nature and poor grammatical pretense), DSPAM was able to identify it and provide us with better insight about its use. Here are some more examples that were picked at random. Some are innocent, others are guilty, and others are in between. 1.5 Other uses So what else are chained tokens good for? I also noticed that they functioned well in more accurately identifying legitimate email addresses in the From and To headers. This helps to create a better 'Whitelist' for innocent messages. This same identification is also used to identify the X-Mailer more accurately (For Example, Microsoft Outlook was more innocent than Microsoft Outlook Express, and the chained token allowed us to realize that with one token instead of two. This brings us to our next useful function for chained tokens, and that is: De-duplication. DSPAM will only calculate the most 15 interesting tokens into a probability. If you eliminate the somewhat-interesting words by replacing them with very-interesting words, you have a more accurate calculation that ends up playing a very significant role in that very small percentage of emails (such as list digests with embedded advertisements) where limiting a calculation to 15 tokens could result in "the side with the eighth token randomly sorted on the list wins" by bringing it back into statistics land. 2.0 Supporting Data Now we'll discuss the different pieces of supporting data (good and bad) related to Chained Tokens. 2.1 Administrative Concerns The only administrative concern that arose from using chained tokens was the disk space consumed by the additional token records in each user's database. CPU was NOT a concern as we were still able to process messages in 0.05-0.15s. The average user's dictionary, without using chained tokens, seems to lurk around 500k to 1MB in size. This means that an ISP of 30,000 customers will require 15GB-30GB of additional disk space to accommodate an implementation without chained tokens (a very acceptable requirement). The average size of a dictionary using chained tokens, however, was originally around 10-20MB, requiring several additional disks for a very large scale implementation (still acceptable, but with less enthusiasm). This introduced some alternative solutions to help manage this space concern:
Given that disk is dirt cheap, and that an email system capable of supporting 30,000 users is already a significant in cost, I found the additional disk space requirement for Chained Tokens to be acceptable for a serious ISP who wants to provide very accurate filtering services. The average implementation of DSPAM will most likely not run 30,000 users, but a mere 1,000 users for a large company. It should be very comforting for these types of administrators to know they will only need a few gigabytes of disk space to accommodate a chained tokens implementation for the whole company. 2.2 Statistical Data Statistical filters are already very good at identifying predictible email, so it made no sense to run the standard corpus-train test to prove the worthiness of Chained Tokens. Instead, I decided to take a large sample of email from the SpamAssassin corpus...email that has never before been seen by the test users...and run it through to see just how many unknown emails we could classify. The results were very good:
As you can see from the matrix above, Chained Tokens actually did hurt accuracy with one user by 3 messages, but more than made up for it in correctly classifying innocent messages which would have otherwise become false positives. SUMMARY In summary, chained tokens have proven their worth both in the test lab and in actual production implementations. Their key features provide more accurate classification by means of:
Chained Tokens has been implemented in DSPAM for several versions now and has dramatically improved accuracy over other filters without such an implementation. |
|
All Website Content © 2004 Jonathan A. Zdziarski. All Rights Reserved. |
| Reproduction prohibited without permission |