.comment-link {margin-left:.6em;}

Pinging Your Blogs to Search Engines

Monday, July 18, 2005

RSS spam and how to deal with it

RSS feeds are starting to appear everywhere as people realize the great traffic asset that they can provide. However, just like any other attention getter, RSS feeds are facing a growing threat from spammers making RSS spam.

What exactly is RSS spam?

RSS spam is an RSS feed which usually contains confusing paragraphs of text for each article, meant to bypass bayesian spam filters. The articles usually look like authentic RSS articles, but if you read them, you can tell they make no sense and just contain keyword-rich content.

Clicking any of the article links take you over to a similarly spamish web page filled with keyword-rich junk content and, of course, Google Adsense ads. How else do the spammers make money?

I thought I would jot down some thoughts on possible ways to handle RSS spam on the RSS search engines like ReadABlog http://www.readablog.com and the others.

The first method is the typical blacklisting of spammy keywords like viagra, replica watches, etc. But RSS spam tends to be much more variable, covering many topics just to get google clicks from unsuspecting users.

A better method would be to implement a bayesian spam filter to check the RSS feeds against "good" RSS feeds and "bad" RSS feeds. This may provide better results, as spam feeds are almost certain to contain keywords to indicate it as spam. But what about the feeds which are created to bypass these filters?

A third method could be a combination check. This check would first include scanning the RSS feed URL and article URLs for spammish characters. Lots of dashes in the URL tends to lead to a spam web site such as viagra-pills-low-viagra-cost.com. A second part would include fetching the target web page of each article link in the feed and seeing if a Google Adwords script is located in the resulting page. Perhaps a check on the width, height, and position of the Adwords box would better help the check. Spam web pages usually contain a large Google Adwords box at the top of the page (near the top of the HTML code) and almost always use the two largest Adwords options for size. Legitement web sites rarely use such large Adwords boxes, and if they do, they are usually located further down in the page.

Perhaps an even better solution would be similar to what is trying to be implemented with email using the authentication check "Received-SPF". This is a way of telling if an email is coming from who the sender says it is coming from. An update to the RSS schema might allow a way of signing and authenticating an RSS feed, although I am not sure how this could be done without bringing in a 3rd-party to validate a feed, similarly to how software is signed by Verisign etc - RSS feeds could be signed.

These are just some ideas on RSS spam checking. The problem is only going to grow as RSS becomes more popular.



Post a Comment

Links to this post:

Create a Link

<< Home