Tuesday, June 12, 2007

Beating Scraper Sites

I've gotten a few emails recently asking me about scraper sites and how to beat them. I'm not sure anything is 100% effective, but you can probably use them to your advantage (somewhat). If you're unsure about what scraper sites are:

A scraper site is a website that pulls all of its information from other websites using web scraping. In essence, no part of a scraper site is original. A search engine is not an example of a scraper site. Sites such as Yahoo and Google gather content from other websites and index it so you can search the index for keywords. Search engines then display snippets of the original site content which they have scraped in response to your search.

In the last few years, and due to the advent of the Google Adsense web advertising program, scraper sites have proliferated at an amazing rate for spamming search engines. Open content, Wikipedia, are a common source of material for scraper sites.

from the main article at Wikipedia.org

Now it should be noted, that having a vast array of scraper sites that host your content may lower your rankings in Google, as you are sometimes perceived as spam. So I recommend doing everything you can to prevent that from happening. You won't be able to stop every one, but you'll be able to benefit from the ones you don't.

Things you can do:

Include links to other posts on your site in your posts.

Include your blog name and a link to your blog on your site.

Manually whitelist the good spiders (google,msn,yahoo etc).

Manually blacklist the bad ones (scrapers).

Automatically blog all at once page requests.

Automatically block visitors that disobey robots.txt.

Use a spider trap: you have to be able to block access to your site by an IP address…this is done through .htaccess (I do hope you're using a linux server..) Create a new page, that will log the ip address of anyone who visits it. (don't setup banning yet, if you see where this is going..). Then setup your robots.txt with a "nofollow" to that link. Next you much place the link in one of your pages, but hidden, where a normal user will not click it. Use a table set to display:none or something. Now, wait a few days, as the good spiders (google etc.) have a cache of your old robots.txt and could accidentally ban themselves. Wait until they have the new one to do the autobanning. Track this progress on the page that collects IP addresses. When you feel good, (and have added all the major search spiders to your whitelist for extra protection), change that page to log, and autoban each ip that views it, and redirect them to a dead end page. That should take care of quite a few of them.

No comments: