Spam Goes Literary
The e-mail came from an unknown address and the contents were equally inscrutable. "If they fire, Watson, have no compunction about shooting them down."
What did this person want?
The answer lay in an advertisement attached to the e-mail: "Now listen! This stock could help you make huge amounts of money in weeks!"
It was spam, but a modern literary version. Spammers are still hocking stocks, selling schemes to shrink your waistline or make other things grow bigger. But they're now enlisting great writers in their efforts: Charles Dickens, George Bernard Shaw, or in this case, Sir Arthur Conan Doyle. The lines in the e-mail were from Sherlock Holmes.
Greg Newby immediately guessed that. He's the director of Project Gutenberg, a nonprofit that has posted online the full text of books since the very early days of the Internet. Newby says people sometimes contact him to complain about these kind of e-mails. But they're not his fault.
"No we don't send spam," he says. "We're not doing anything other than trying to give away good literature."
A better person to blame (or thank) would be Paul Graham. He's not a spammer; he's a programmer famous for creating one of the first really good spam filters.
In 2002, he was trying to write a little program to separate spam from ordinary e-mail. It did what you'd expect; it looked for keywords like "click" as in "click here to buy our product." Graham says the results were less than spectacular.
"For one thing, spammers could just replace the 'I' in click with a '1' and you'd be out of luck," he says. "And they did in fact start doing that."
Graham tried something different. He wrote a program to find out how to best separate spam from real e-mail. To train it, he fed it a good helping of spam and a separate sample of real e-mail.
The program looked at each word and counted how many times it appeared in spam or legitimate mail. It found, for instance that words like "lunch" tend to be in legitimate e-mails. And words like "Viagra" or "cl1ck" are more likely to be in spam.
"This was 50 lines of code," Graham said, "it took me a day to write."
He ran this simple filter on his incoming e-mail. It evaluated all the words in each e-mail, and calculated an overall probability that the e-mail was spam.
Remarkably, it caught more than 99 percent of new spam, and let all his real e-mail through.
"I was so delighted," Graham said. "It got practically all my spam the first time try."
Reformulating 'Paradise Lost'
And this is why the spammers have had to resort to literature. Filters like the one Graham wrote are everywhere now. In order to get past them, spammers try to make the text of their e-mails look more like something you'd actually write.
These spammers mine Web sites that post the full text of books, like Project Gutenberg, which, along with its affiliates, has more than 250,000 books online.
Spammers also need each e-mail to look different so the filters can't pick up on particular passages. Sometimes the spam-making programs do this by rearranging sentences. Other times they compose fake sentences out of pairs of words that tend to occur together.
This is called "Markov Chaining," after the Russian mathematician Andrey Markov. Graham says it explains the word salads you may see in spam.
"'Half lost on my firmness gains to more glad heart or violent and from forage drives a glimmering of all sun new begun,'" Graham quotes. "Every pair of words in there actually occurs in Paradise Lost," he says.
The filtering technique Graham pioneered still works pretty well, he says, because the great authors of old use different words than usually appear in modern e-mail. The word "Bolshevism" turns out to be a very good indicator of spam.
Greg Newby at the Gutenberg Project says one spam caught his interest. It contained these mysterious lines:
"No civilians, no outsiders, this is a secret operation all the way. It is now, but it goes public in seven days. We just need to make up a cover story."
He says he'd love to read the book, but can't figure out where the text is from.
Copyright 2022 NPR. To see more, visit https://www.npr.org.