Adapted from an article in http://www.wikipedia.org. Links maybe out-of-date.
E-mail has become the subject of much abuse, in the form of both spamming and E-mail worm programs. Both of these flood the in-boxes of E-mail users with junk E-mails, wasting their time and money, and often carrying offensive, fraudulent, or damaging content. This article describes the efforts being made to stop E-mail abuse and ensure that E-mail continues to be usable in the face of these threats.
Defense against spam
There are a number of services and software systems that mail sites and users can use to reduce the load of spam on their systems and mailboxes. Some of these depend upon rejecting email from Internet sites known or likely to send spam. Others rely on automatically analyzing the content of email messages and weeding out those which resemble spam. These two approaches are sometimes termed blocking and filtering.
Blocking and filtering each have their advocates and advantages. While both reduce the amount of spam delivered to users’ mailboxes, blocking does much more to alleviate the bandwidth cost of spam, since spam can be rejected before the message is transmitted to the recipient’s mail server. Filtering tends to be more thorough, since it can examine all the details of a message. Many modern spam filtering systems take advantage of machine learning techniques, which vastly improve their accuracy over manual methods. However, some people find filtering intrusive to privacy, and many mail administrators prefer blocking to deny access to their systems from sites tolerant of spammers.
Spam blocking and filtering techniques
DNSBLs
DNS-based Blackhole Lists, or DNSBLs, are a blocking technique, whereby a site publishes lists of IP addresses via the DNS, in such a way that mail servers can easily be set to reject mail from those addresses. There are literally scores of DNSBLs, each of which reflects different policies: some list sites known to emit spam; others list open mail relays or proxies; others, such as SPEWS, list ISPs known to support spam.
Content-based filtering
Until recently, content filtering techniques relied on mail administrators specifying lists of words or regular expressions disallowed in mail messages. Thus, if a site receives spam advertising “herbal Viagra”, the administrator might place these words in the filter configuration. The mail server would thence reject any message containing the phrase.
Content based filtering can also filter based on content other than the words and phrases that make up the test of the message. Primarily, this means looking at the headers of the email, the part of the message that contains information about the message, and not the text of the message. Spammers will often spoof headers in order to hide their identities, or to try to make the email look more legitimate than it is; many of these spoofing methods can be detected. Also, spam sending software often produces headers that violate the RFC 2822 standard on how email headers are supposed to be formed.
Disadvantages of this static filtering are threefold: First, it is time-consuming to maintain. Second, it is prone to false positives. Third, these false positives are not equally distributed: manual content filtering is prone to reject legitimate messages on topics related to products advertised in spam. A system administrator who attempts to reject spam messages which advertise mortgage refinancing may easily inadvertently block legitimate mail on the same subject.
Finally, spammers can change the phrases and spellings they use, or employ methods to try to trip up phrase detectors. This means more work for the administrator. However, it also has some advantages for the spam fighter. If the spammer starts spelling “Viagra” as “V1agra” or “Via_gra”, it makes it harder for the spammer’s intended audience to read their messages. If they try to trip up the phrase detector, by, for example, inserting an invisible-to-the-user HTML comment in the middle of a word (“Via<—->gra”), this sleight of hand is itself easily detectable, and is a good indication that the message is spam. And if they send spam that consists entirely of images, so that anti-spam software can’t analyze the words and phrases in the message, the fact that it is image only can be detected.
Statistical filtering
Statistical filtering was first proposed in 1998 by Mehran Sahami, et al., at the AAAI-98 Workshop on Learning for Text Categorization. A statistical filter is a kind of text classification system, and a number of machine learning researchers have turned their attention to the problem. Statistical filtering was popularized by Paul Graham’s influential 2002 article, which used Naive Bayesian classification to predict whether messages are spam or not — based on collections of spam and nonspam (“ham”) email submitted by users. See http://www.paulgraham.com/antispam.html and http://research.microsoft.com/~horvitz/junkfilter.htm.
Statistical filtering, once set up, requires no maintenance per se: instead, users mark messages as spam or nonspam and the filtering software learns from these judgements. Thus, a statistical filter does not reflect its author’s or administrator’s biases as to content, but it does reflect the user’s biases as to content; a biochemist who is researching Viagra won’t have messages containing the word “Viagra” flagged as spam, because “Viagra” will show up often in his or her legitimate messages. It can also respond quickly to changes in spam content, without administrative intervention.
Spammers have attempted to fight statistical filtering by invisibly inserting many random but valid words into their messages, making more likely that the filter will classify the message is neutral; they make the words invisible by giving them a very tiny font, by making the words the same color as the background, or both. However, the countermeasures seem to have been largely ineffective.
Software programs that implement statistical filtering include Bogofilter, the e-mail programs Mozilla and Mozilla Thunderbird, and later revisions of SpamAssassin. Another interesting project is CRM114 which hashes phrases and does bayesian classification on the phrases.
You can also check Popfile http://popfile.sourceforge.net that will sort mail in as many category as you want (family, friends, co-worker, spam, whatever) with bayesian filtering.
Checksum-based filtering
Checksum-based filter takes advantage of the fact that, for any individual spammer, all of the messages he or she sends out will be mostly identical, the only differences being web bugs, and when the text of the message contains the recipient’s name or email address. Checksum-based filters will strip out everything that might vary between messages, reduces it to a checksum, and compares it to a database which collects the checksums of messages that email recipients consider to be spam (some people have a button on their email client which they can click to nominate a message as being spam); if the checksum is in the database, the message is likely to be spam.
The advantage of this type of filtering is that it lets ordinary users help identify spam, and not just administrators, thus vastly increasing the pool of spam fighters. The disadvantage is that spammers can insert unique invisible gibberish — known as hashbusters — into the middle of each of their messages, thus making each message unique and having a different checksum. This leads to an arms race between the developers of the checksum software and the developers of the spam-generating software.
Checksum based filtering methods include:
- Distributed Checksum Clearinghouse
- Vipul’s Razor
Protocol extensions
A number of proposals and specifications have been written to extend the SMTP protocol to avoid spam, including:
- Sender Permitted From (SPF)
- Trusted Email Open Standard (TEOS)
- Tripoli protocol
Messages certified as not being spam
There are several third-party organizations which guarantee that certain messages aren’t spam, and have the means to prevent spammers from fraudulently using their system, by fining or suing them, for example. Administrators can use this to let through messages that would otherwise be filtered or blocked as spam, thus reducing the false positive rate.
Organizations that implement such systems include:
- Habeas Sender Warranted Email
- Bonded Sender.
Heuristic filtering
Heuristic filtering, such as is implemented in the program SpamAssassin, uses some or all of the various tests for spam mentioned above, and assigns a numerical score to each test. Each message is scanned for these patterns, and the applicable scores tallied up. If the total is above a fixed value, the message is rejected or flagged as spam. By ensuring that no single spam test by itself can flag a message as spam, the false positive rate can be greatly reduced. (http://www.spamassassin.org/)
Tarpits and Honeypots
A tarpit is any server software which intentionally responds pathologically slowly to client commands. A honeypot is a server which attempts to attract attacks. Some mail administrators operate tarpits to impede spammers’ attempts at sending messages, and honeypots to detect the activity of spammers. By running a tarpit which appears to be an open mail relay, or which treats acceptable mail normally and known spam slowly, a site can slow down the rate at which spammers can inject messages into the mail facility.
One tarpit design is the teergrube, whose name is simply German for “tarpit.” This is an ordinary SMTP server which intentionally responds very slowly to commands. Such a system will bog down SMTP client software, as further commands cannot be sent until the server acknowledges the earlier ones. Several SMTP MTAs, including Postfix, have a teergrube capacity built in: when confronted with a client session which causes errors such as spam rejections, they will slow down their responding. (http://www.iks-jena.de/mitarb/lutz/usenet/teergrube.en.html; http://www.postfix.org/rate.html)
Another design for tarpits directly controls the TCP/IP protocol stack, holding the spammer’s network socket open without allowing any traffic over it. By reducing the TCP window size to zero, but continuing to acknowledge packets, the spammer’s process may be tied up indefinitely. This design is more difficult to implement than the former. Aside from anti-spam purposes, it has also been used to absorb attacks from network worms. (http://www.hackbusters.net/)
A third design is simply an imitation MTA which gives the appearance of being an open mail relay. Spammers who probe systems for open relay will find such a host and attempt to send mail through it, wasting their time. Such a system may simply discard the spam attempts, submit them to DNSBLs, or store them for analysis. It may also selectively deliver relay test messages to give a stronger appearance of open relay. SMTP honeypots of this sort have been suggested as a way that end-users can interfere with spammers’ activities. (http://jackpot.uk.net/; http://llama.whoi.edu/smtpot.py)
Spammers also abuse open proxies, and open proxy honeypots (proxypots) are also used. (http://world.std.com/~pacman/proxypot.html) Ron Guillmette reported in 2003 that he succeeded in getting over 100 spammer accounts terminated in under 3 months, using his network (of unspecified size) of proxypots.
Unlike most other anti-spam techniques tarpits and honeypots work at the relay (or proxy) level. They work by targeting spammer behavior rather than targeting spam content.
Note also that there is some terminological confusion. Some people refer to spamtraps as honeypots. In this context a spamtrap is an email address created specifically to attract spam. These run at the destination level rather than at the relay or proxy level.
Challenge-response systems
Another method which may be used by internet service providers (or by specialized services) to combat spam is to require unknown senders to pass various tests before their messages are delivered. These strategies are termed challenge-response systems or C/R, and are currently controversial among email programmers and system administrators.
One example of a challenge-response system is a “captcha” test, in which a mail sender is required to view an image containing a word or phrase, and respond with that word or phrase in text. The purpose of this is to ensure that automated systems (incapable of reading the image) cannot transmit email.
Critics of C/R systems have raised several issues regarding their usefulness as an email defense:
- Some kinds of C/R system, such as captchas, discriminate against the disabled. A blind person can send and receive textual email (using a braille terminal, for instance), but cannot see an image and read text from it. A blurry image intended to defeat optical character recognition software may be impossible for sighted but visually-impaired persons.
- C/R systems interact badly with mailing list software. If a person subscribed to a mailing list begins to use C/R software, posters to the mailing list may be confronted by large numbers of challenge messages. Many regard these as junk mail equal in annoyance to actual spam. In response, some C/R advocates have suggested that a C/R user must simply “whitelist” mailing lists to which they subscribe — instructing the C/R software not to challenge their messages.
- Some C/R systems interact badly with other C/R systems. If two persons both use C/R and one emails the other, the two C/R systems may become trapped in a loop, each challenging the other, neither one willing to deliver the challenge messages — or the original message.
- A person who disseminates his email address in order that others may easily contact him should not (critics say) subsequently challenge those persons’ messages. For instance, if a person who gives a new acquaintance his email address, that acquaintance should expect to be able to send email to that address without “jumping through hoops” laid by a C/R system. Many C/R critics consider it rude to give someone your email address, then require them to play along with C/R software before they can send you mail.
- Spammers and viruses send forged messages — email with other people’s addresses in the From headers. A C/R system challenging a forged message will send its challenge to the uninvolved person whose address the spammer put in the spam. This effectively doubles the amount of unwanted email being distributed. Indeed, some argue that using a C/R system means sending unsolicited, bulk email (challenges to forged spam) to all those people whose addresses are forged in spam.
Nevertheless, users report C/R systems are extremely effective at eliminating spam, even for addresses that receive hundreds of spam messages per day. With C/R systems the only spam that get delivered is spam that has been personally authorized by the spammer.
Spam tips for users
Aside from installing client-side filtering software, end users can protect themselves from the brunt of spam’s impact in numerous other ways.
Address munging
One way that spammers obtain email addresses to target is to trawl the Web and Usenet for strings which look like addresses. Thus, if one’s address is never listed on these forums, they cannot find it. Posting anonymously, or with an entirely faked name and address, is one way to avoid this “address harvesting”. Users who want to receive legitimate email regarding their posts or Web sites can alter their addresses in some way that humans can figure out but spammers haven’t (yet). For instance, joe@example.net might post as joeNOS@PAM.example.net, or display his email address as an image instead of text. This is called address munging, from the jargon word “mung” meaning to break.
Address munging does not, however, evade so-called “dictionary attacks” in which the spammer generates a number of likely-to-exist addresses out of names and common words. For instance, if there is someone with the address adam@example.com, where ‘example.com’ is a popular ISP or mail provider, it is likely that he frequently receives spam.
Disposable e-mail addresses
Many email users sometimes need to give an address to a site without complete assurance that the site will not spam, or leak the address to spammers. One way to mitigate the risk of spam from such sites is to provide a disposable email address — a temporary address which forwards email to your real account, but which you can disable or abandon whenever you see fit.
A number of services, such as Spamgourmet (http://www.spamgourmet.com/), provide disposable address forwarding. Addresses can be manually disabled, can expire after a given time interval, or can expire after a certain number of messages have been forwarded.
Defeating Web bugs and JavaScript
Many modern mail programs incorporate Web browser functionality, such as the display of HTML and images. This can easily expose the user to pornographic or otherwise offensive images in spam. In addition, spam written in HTML can contain JavaScript programs to direct the user’s Web browser to an advertised page, or to make the spam message difficult or impossible to close or delete. In some cases, spam messages have contained attacks upon security vulnerabilities in the HTML renderer, using these holes to install spyware. (Some computer viruses are borne by the same mechanisms.) Also, the images can be used to find out whether a spam message is actually read and seen by a user.
Users can defend against these methods by using mail clients which do not automatically display HTML, images or attachments, or by configuring their clients not to display these by default.
Avoiding responding to spam
It is well established that some spammers regard responses to their messages — even responses which say “Don’t spam me” — as confirmation that an email address refers validly to a reader. Likewise, many spam messages contain Web links or addresses which the user is directed to follow to be removed from the spammer’s mailing list. In several cases, spam-fighters have tested these links and addresses and confirmed that they do not lead to the recipient address’s removal — if anything, they lead to more spam.
In Usenet, it is widely considered even more important to avoid responding to spam. Many ISP have software that seeks out and destroys duplicate messages. Often someone sees a spam and responds to it before it’s cancelled by their server. This can have the effect of reposting the spammer’s spam for them… and since it’s not just a duplicate, this reposted copy will actually last longer.
In late 2003, the Federal Communications Commission (FCC) launched a public relations campaign to encourage email users to simply never respond to a spam email — ever. This campaign stemmed from the tendency of casual email users to reply to spam, in order to complain about the spam and ask the spammer to stop sending spam. This has the effect of alerting spammers to the existence of a person who actually reads spam email, and it has the effect of increasing spam rather than stopping it.
Reporting spam
The majority of ISPs explicitly forbid their users from spamming, and eject from their service users who are found to have spammed. Tracking down a spammer’s ISP and reporting the offense often leads to the spammer’s service being terminated. Unfortunately, it can be difficult to track down the spammer — and while there are some online tools to assist, they are not always accurate.
Two such online tools are SpamCop and Network Abuse Clearinghouse. Both provide automated or semi-automated means to report spam to ISPs. Some spam-fighters regard them as inaccurate compared to what an expert in the email system can do; however, most email users are not experts.
Defense against email worms
In the past several years, scores of worm programs have used email systems as a conduit for infection. The worm program transmits itself in an email message, usually as a MIME attachment. In order to infect a computer, the executable worm attachment must be opened. In almost all cases, this means the user must click on the attachment. The worm also requires a software environment compatible with its programming.
Email users can defend against worms in a number of ways, including:
- Avoiding email client software which supports executable attachments. The most frequently-targeted client software for email worms is Microsoft Outlook and Outlook Express, both of which can easily be made to open executable attachments. However, other Windows-based email software is not immune to worms.
- Using an operating system which does not provide an environment compatible with present worms. Essentially all current email worms affect only the Microsoft Windows operating system. They cannot execute on Macintosh, Unix, Linux, or other operating systems. In some cases, it is conceivable that a worm could be written for one of these systems; however, various security features militate against it.
- Using up-to-date anti-virus software to detect incoming worms and quarantine or delete them before they can take effect.
- Being skeptical of unsolicited email attachments. Since worms and other email-borne malware arrive in this form, some email users simply refuse to open attachments that the sender has not given them advance notice of.
External links
- Coalition Against Unsolicited Commercial Email (http://www.cauce.org)
- Reporting Spam (http://spamcop.net)
- Spam Control Tools (http://www.samspade.org/ssw/)
- California lawyer who sues spammers (http://www.timothywalton.com)
- Address Munging FAQ: Spam-Blocking Your Email Address (http://members.aol.com/emailfaq/mungfaq.html)
- Challenge/Response at the SMTP Level (http://jamesthornton.com/writing/challenge-response-at-smtp-level.html): It may be possible to implement a challenge/response SPAM protection system using custom SMTP Delivery Status Notifications (DSNs) that include an HTML confirmation hyperlink.
Tools to reduce the impact of spam
- Mozilla (http://www.mozilla.org) and the soon to be released Thunderbird (http://texturizer.net/thunderbird/index.html): e-mail programs (“clients”) with a Bayesian filter, i.e. a filter that keeps learning and is therefore able to adapt to the constantly changing forms of spam
- Disposable e-mail accounts, various types for registering on web sites etc.
- E4ward.com (http://www.e4ward.com) You can use your own domain name or e4ward.com for your aliases
- Sneakemail (http://sneakemail.com) original disposable email address service
- spamgourmet (http://www.spamgourmet.com) expire after a number of emails, but can be reset or ignored for some senders
- jetable (http://jetable.org/) expiring in 1-8 days
- Making it harder to harvest e-mail addresses
- hide email addresses (http://ourworld.compuserve.com/homepages/jamesday/antispam/obfuscate/index.htm) on web sites from harvesting tools
- Tools to filter out spam
- SpamPal (http://www.spampal.org/) free (really) Windows filter with lots of filtering methods. Client or server-side filtering
- Bogofilter (http://bogofilter.sourceforge.net/) Bayesian filter
- Spambayes (http://spambayes.sourceforge.net/) Bayesian filter especially designed for use with Microsoft Outlook
- SpamAssassin (http://spamassassin.org/) heuristic filter
- TMDA, a challenge/response system
- Checksum-based filter:
- Distributed Checksum Clearinghouse (http://www.rhyolite.com/anti-spam/dcc/)
- Vipul’s razor (http://razor.sourceforge.net/)
- Other tools
- Services which guarantee messages as not being spam:
- Habeas Sender Warranted Email (http://www.habeas.com/)
- Bonded Sender (http://bondedsender.com/)
- Protocols for reducing spam
- Spam-proofing the mail system (http://lwn.net/Articles/63578/); Linux Weekly News; December 17, 2003.