Category: Advanced
Content Scraping or Web Scraping is an act of data extraction from web sites. Content scraping involves the copy of only that data which is visible to you and necessarily isn’t analogous with hacking. However, the data which is visible and being copied might arise copyright issues.
In this tutorial, I would give a general introduction on how a content scraper would work and what steps you can do to avoid them.
In general, if we open those web pages manually and copy-paste the data, it would also mean web scraping. However, scripts can be written to do this monotonous work for a large amount of data, which takes up considerable time.
Pre-requisites:
-
HTML, CSS
-
Python (or a rough idea of how things work in Python)
Tools we will use:
-
Firebug
-
Python (libraries- beautiful soup, urllib)
The loophole of WordPress that we would take advantage of in this tutorial is the way it gives URLs to posts. If you hover over your your pages, you would find that the links are of the form “http://www.mysite.com/?p=1” where the ‘1’ refers to a number which gets incremented. That is, however, the default permalink. It can be of some other type, but you can generate or grab the URLs from pages.
Getting the webpage:
[code]#importing the libraries
from urllib import urlopen
from bs4 import BeautifulSoup
#Sending the http request
webpage = urlopen(‘http://www.mysite.com/?p=1’).read()
#making the soup! yummy 😉
soup = BeautifulSoup(self.webpage)
[/code]
Extracting the content:
Now what we have the soup, we need to extract that content which we require. Primarily, we would want the title and the content. To extract it, you would need to search it in the soup using their css properties. Run Firebug by pressing F12 to get it running. Alternately, you can right click on the title and select “inspect element with Firebug”.
The HTML content which houses the title is an “h1” with a CSS class “entry-title”. Here is how you get from the soup.
[code] title = soup.find(“h1”, class_ = “entry-title”).text [/code]
To find the body content you would make a similar approach.
Note that this time the content is a paragraph (“<p>”) housed under a div with the CSS class “entry-content”. So, extract it using
[code]body = soup.find(“div”, class_ = “entry-content”).text[/code]
Note that I am just extracting the text. You can make your program a bit more complex and extract the whole HTML and download the images in the posts too!
Now that you have the title and body, you can save it anywhere you wish, or just print it out for fun. Yet another way would be to save it in an xls file using the python module ‘xlwt’.
To get all the data off the site, put all of that code in a function and pass the page number in a loop. To demonstrate the use of page numbers, here is a code snippet.
[code]
def get_data(page_no):
webpage = urlopen(‘http://www.mysite.com/?p=‘ + str(i)).read()
for i in range (1,100):
get_data(i)
[/code]
Do you think you are done? Well, not quite. One small thing remains. You need a check to skip pages which do not exist. For example, posts which were unpublished or deleted would leave a void in the page. WordPress would redirect you to a 404 (page not found) page. Here is the check.
[code]
def get_data(page_no):
webpage = urlopen(‘http://www.mysite.com/?p=‘ + str(i)).read()
soup = BeautifulSoup(webpage)
#found a 404 page
if soup.title == “Page not found | <blog_name>”:
return False
[/code]
Replace <blog_name> with the actual name of the blog in the code.
There you go. That is how content scrapers get data off your site. They might use more advanced tools like mechanize or PhantomJs to emulate modern day browsers while sending those requests but the general idea remains the same.
How to find content scrapers:
Before we go on and catch them, let’s think for a moment why someone would want your content. The short answer to it would be that your content is great. It has such high quality that someone can’t keep their hands off it.
The Google Search way: It might not sound very exciting to you, but this is a very effective and efficient way of finding if your content is out there somewhere. Note that you must perform the search the following way.
Copy a paragraph of your content and put it within double quotes before searching in Google. Putting it in double quotes means that Google would search for EXACTLY that content and nothing less. It avoids giving unnecessary search results even if you are writing on a popular topic.
Google Webmasters: If you use Google Webmasters, go to Traffic > Links to Your Site under your site’s settings.
If someone is scraping your content, chances are that they should be among the top ones in the list.
Feedburner: If you use Feedburner for your RSS feeds, you can check if there are any such possible scraping attempts under uncommon uses.
What to do with content scrapers?
Now that you have searched for possible content scrapers and probably even found someone, what would you do with them?
Ask them to take it down: The first step is naturally to ask them politely. When you visit their site, there must be a link to contact the author or the site owner. At most, there must be be a contact email for the site owner or webmaster. Do not forget to mention the links to your original posts in the email and ask them to take down the plagiarised content.
The Hard Way: If they do not reply to your emails or refuse to take the content down, you can file a DMCA (Digital Millennium Copyright Act) report with their host. Note that their host would probably have no idea that they are hosting plagiarised content and you need to explain very patiently your situation. If you are not able to find the host of the site, you can do a Whois lookup.
Blocking them:
You can go another step ahead and block their IP addresses in your server so that they can’t access your site again. You can block IP addresses in your .htaccess file by adding the following.
Deny from 192.168.121.156
You can do one more thing and redirect the scrapers- they would not know what hit them!
RewriteCond %{REMOTE_ADDR} 192\.168\.121\.
RewriteRule .* http://google.com [R,L]
It is interesting to note that instead of Google, you can redirect them to some government website. You can make a RSS feed full of junk and send them the same, or include huge sized images and make their process slow. Best of all, you can redirect them right to their own server causing an infinite loop. You can read more on this in this great article by Jeff Starr.
Note that you need to restart Apache after making these changes for them to take effect.
Take advantage of content scrapers:
Internal Linking: Internal linking, in general, helps increase your blog traffic and reduces the bounce rate. However, this would help you in case your content is copied. In the new site, the article or post would still link back to posts on your own site. Since these scraping processed are automated, it is generally very difficult to remove or change the internal links.
Putting Ads: In general, when you put ads in your posts, they would either get copied directly by the scraper or the javascript that generates the ads gets copied. Either ways, when the ad appears in the scraper’s site and someone clicks on it- YOU get the benefit! Advertising is evil after all!
My experience with Plagiarism:
My blog was just a year old when one fine day, I got an anonymous comment by some well wisher that that the content of that post had been copied somewhere with a link to the plagiarised content (now removed). I was shocked to find that the post had been copied word by word! I contacted the author but did not receive any replies in the next 24 hours.
I chose the hard way and sent a mail to the hosting site with cc’s to the author and the blog owners. Within an hour, the author acknowledged my mail, apologized and removed the content.