Tuesday 30 December 2014

How To Access Information About PDF Data Scraping?

Scraping a way that the output of data from another program to extract data is used by a computer program can be heard. Simply put, this is a process of automatically sorting the information from the Internet, even within an HTML file can be found in various sources, including PDF documents and others. There is also a collection of relevant information. This information to the database or spreadsheet, allowing users to retrieve them later will be included.

Most websites today can be viewed and written text in the source code is simple. However, there are other companies that currently use Adobe PDF or Portable Document Format to choose from. This file is a type known as just the free Adobe Acrobat to be viewed using the software. Supports virtually all operating software, said. There are many advantages when you choose to create PDF files. Those document you just the same, even if you put it in another computer, so you can see it look. Therefore, business documents or completes the data sheet. Of course there are drawbacks. One of these is included in the text is converted into an image. In this case, it is often the problem with this is that when it comes to copy and paste, and could be.

That's why some are starting to scrape the information PDF. It is often said that the only scraping process information in your PDF file PDF is like to get data. PDF to start scraping the information from you, choose a device specially designed for this process must benefit. However, you feel that you have the right tools too effectively scrape PDF will be able to perform is not easy to detect. This is because the equipment is exactly the same data access without having personal problems.

However, if you look good, you look at programs that you may encounter. You have to know programming; you do not need to use them. You can easily specify their preferences for the software you use will do the rest. There are companies out there that you contact them and they work because they have the right tools they can use to be. If you choose to do things yourself, you will find it really difficult and complicated compared to professionals working for you, they will at no time possible. PDF scraping of information is a process whereby information can be found on the Internet and not copyright infringement to collect.

Well I hope you now understand how to scrape data in various forms. If you do not understand then go for one of the sites I mention below in the box of the author. We offer a variety of data services, such as HTML scraping services, the crop Scraping Web Services Web Content, Email Id scraping, scraping data ownership, data Linkedin scraping, scraping data Hotels, pharmaceutical Scraping data, Business Contact Scraping, Data Scraping For University etc. If you have any doubts, please feel free to ask us without hesitation. We will certainly be useful for you. Thank you.

Source:http://www.articlesbase.com/outsourcing-articles/how-to-access-information-about-pdf-data-scraping-5293692.html

Sunday 28 December 2014

Most Of The Recommended Web Scraping Data Into Business

More traditional Web search engines, websites visited, depending on how they were collected. The main disadvantage of these search engines is that they do not provide a method to extract the necessary information.

However, in modern times, the concept of scraping offs the website. Scraping all the relevant information and data contained in any web site can be found on the Internet together with the appearance.

Organizations and individuals to effectively and quickly recognized the need to gather information on the web scraping. Data structure that is more cut and paste can be accessed without having to contend with can not be collected.

If any other type of information to be able to arrange for the document. Traditional search engines use tools to harvest this website to a combination of individual clerks more sophisticated nuance with broad power. According to the criteria specified in the field of information is required.

News of the report on the software makes it easy for the crowd. The price and other analyzes to compare a pair of runs. Therefore, the Internet continues to work on the agencies that are required are a website as scrap. Web scraping by is the main reason for the growing number of companies.

Scraping the most reliable data Services Company based in India, offshore website provides information solutions to customers scraping. Data services to accomplish with your web search to try scraping, data mining, data conversion, data extraction, web scraping and web data in the data scraping.

Data Services are owned by scraping solution internet - India-based "Most of your trusted and reliable" service provider outsourcing. Data scraping Services offers high quality, accurate and manual internet scrape data and on the web scraping services at the lowest possible rate industry.

Data scraping Services is a firm based on the Indian expertise in outsourcing data entry, data processing, and Internet search and website scrape data. Data scraping Services offers great variety of data entry, data conversion, document scanning and data scraping service at the lowest possible rate industry since 2005. Services we offer cover the following areas; data entry, data mining, Web search, data conversion, data processing, scrape web sites, harvesting and collection of data internet email.

Data scraping Services follow the standard process to the highest quality Web search, data mining and web site services scratching. Search our website, data mining and data conversion projects to the process quality standards.

Most often the data must be scratched for the industry as part of lawyers, doctors, hospitals, students, schools, universities, chiropractor, dentists, hotels, property, real estate, pub, the bars, night club, a restaurant, and IT professionals. The most common medium to the database scraping and email numbers are directory business online, linked to, Twitter, Face book, social networking sites and search Google.

Data scraping service provider is the most trusted and reliable world of service, service of process data, data scrape, scrape data website, data mining, data extraction and business development database. We have already scraped some popular online business directories. We are only able to scrape public database available in any of the directory business.

Source:http://www.articlesbase.com/outsourcing-articles/most-of-the-recommended-web-scraping-data-into-business-5697814.html

Wednesday 24 December 2014

Data Mining Explained

Overview

Data mining is the crucial process of extracting implicit and possibly useful information from data. It uses analytical and visualization techniques to explore and present information in a format which is easily understandable by humans.

Data mining is widely used in a variety of profiling practices, such as fraud detection, marketing research, surveys and scientific discovery.

In this article I will briefly explain some of the fundamentals and its applications in the real world.

Herein I will not discuss related processes of any sorts, including Data Extraction and Data Structuring.

The Effort

Data Mining has found its application in various fields such as financial institutions, health-care & bio-informatics, business intelligence, social networks data research and many more.

Businesses use it to understand consumer behavior, analyze buying patterns of clients and expand its marketing efforts. Banks and financial institutions use it to detect credit card frauds by recognizing the patterns involved in fake transactions.

The Knack

There is definitely a knack to Data Mining, as there is with any other field of web research activities. That is why it is referred as a craft rather than a science. A craft is the skilled practicing of an occupation.

One point I would like to make here is that data mining solutions offers an analytical perspective into the performance of a company depending on the historical data but one need to consider unknown external events and deceitful activities. On the flip side it is more critical especially for Regulatory bodies to forecast such activities in advance and take necessary measures to prevent such events in future.

In Closing

There are many important niches of Web Data Research that this article has not covered. But I hope that this article will provide you a stage to drill down further into this subject, if you want to do so!

Should you have any queries, please feel free to mail me. I would be pleased to answer each of your queries in detail.

Source: http://ezinearticles.com/?Data-Mining-Explained&id=4341782

Monday 22 December 2014

GScholarXScraper: Hacking the GScholarScraper function with XPath

Kay Cichini recently wrote a word-cloud R function called GScholarScraper on his blog which when given a search string will scrape the associated search results returned by Google Scholar, across pages, and then produce a word-cloud visualisation.

This was of interest to me because around the same time I posted an independent Google Scholar scraper function  get_google_scholar_df() which does a similar job of the scraping part of Kay’s function using XPath (whereas he had used Regular Expressions). My function worked as follows: when given a Google Scholar URL it will extract as much information as it can from each search result on the URL webpage  into different columns of a dataframe structure.

In the comments of his blog post I figured it’d be fun to hack his function to provide an XPath alternative, GScholarXScraper. Essensially it’s still the same function he wrote and therefore full credit should go to Kay on this one as he fully deserves it – I certainly had no previous idea how to make a word cloud, plus I hadn’t used the tm package in ages (to the point where I’d forgotten most of it!). The main changes I made were as follows:

    Restructure internal code of GScholarScraper into a series of local functions which each do a seperate job (this made it easier for me to hack because I understood what was doing what and why).

    As far as possible, strip out Regular Expressions and replace with XPath alternatives (made possible via the XML package). Hence the change of name to GScholarXScraper. Basically, apart from a little messing about with the generation of the URLs I just copied over my get_google_scholar_df() function and removed the Regular Expression alternatives. I’m not saying one is better than the other but f0r me personally, I find XPath shorter and quicker to code but either is a good approach for web scraping like this (note to self: I really need to lean more about regular expressions!) :)

•    Vectorise a few of the loops I saw (it surprises me how second nature this has become to me – I used to find the *apply family of functions rather confusing but thankfully not so much any more!).
•    Make use of getURL from the RCurl package (I was getting some mutibyte string problems originally when using readLines but this approach automatically fixed it for me).
•    Add option to make a word-cloud from either the “title” or the “description” fields of the Google Scholar search results
•    Added steaming via the Rstem package because I couldn’t get the Snowball package to install with my version of java. This was important to me because I was getting word clouds with variations of the same word on it e.g. “game”, “games”, “gaming”.
•    Forced use of URLencode() on generation of URLs to automatically avoid problems with search terms like “Baldur’s Gate” which would otherwise fail.

I think that’s pretty much everything I added. Anyway, here’s how it works (link to full code at end of post):

</pre>
<div id="LC198"># #EXAMPLE 1: Display word cloud based on the title field of each Google Scholar search result returned</div>
<div id="LC199"># GScholarXScraper(search.str = "Baldur's Gate", field = "title", write.table = FALSE, stem = TRUE)</div>
<div id="LC200">#</div>
<div id="LC201"># # word freq</div>
<div id="LC202"># # game game 71</div>
<div id="LC203"># # comput comput 22</div>
<div id="LC204"># # video video 13</div>
<div id="LC205"># # learn learn 11</div>
<div id="LC206"># # [TRUNC...]</div>
<div id="LC207"># #</div>
<div id="LC208"># #</div>
<div id="LC209"># # Number of titles submitted = 210</div>
<div id="LC210"># #</div>
<div id="LC211"># # Number of results as retrieved from first webpage = 267</div>
<div id="LC212"># #</div>
<div id="LC213"># # Be aware that sometimes titles in Google Scholar outputs are truncated - that is why, i.e., some mandatory intitle-search strings may not be contained in all titles</div>

<pre>

// image

I think that’s kind of cool and corresponds to what I would expect for a search about the legendary Baldur’s Gate computer role playing game :)  The following is produced if we look at the ‘description’ filed instead of the ‘title’ field:

</pre>

<div id="LC215"># # EXAMPLE 2: Display word cloud based on the description field of each Google Scholar search result returned</div>
<div id="LC216">GScholarXScraper(search.str = "Baldur's Gate", field = "description", write.table = FALSE, stem = TRUE)</div>
<div id="LC217">#</div>
<div id="LC218"># # word freq</div>
<div id="LC219"># # page page 147</div>
<div id="LC220"># # gate gate 132</div>
<div id="LC221"># # game game 130</div>
<div id="LC222"># # baldur baldur 129</div>
<div id="LC223"># # roleplay roleplay 21</div>
<div id="LC224"># # [TRUNC...]</div>
<div id="LC225"># #</div>
<div id="LC226"># # Number of titles submitted = 210</div>
<div id="LC227"># #</div>
<div id="LC228"># # Number of results as retrieved from first webpage = 267</div>
<div id="LC229"># #</div>
<div id="LC230"># # Be aware that sometimes titles in Google Scholar outputs are truncated - that is why, i.e., some mandatory intitle-search strings may not be contained in all titles</div>
<pre>

//image

Not bad. I could see myself using the text mining and word cloud functionality with other projects I’ve been playing with such as Facebook, Google+, Yahoo search pages, Google search pages, Bing search pages… could be fun!

Many thanks again to Kay for making his code publicly available so that I could play with it and improve my programming skill set.

Code:

Full code for GScholarXScraper can be found here: https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/GScholarXScraper/GScholarXScraper

Original GSchloarScraper code is here: https://docs.google.com/document/d/1w_7niLqTUT0hmLxMfPEB7pGiA6MXoZBy6qPsKsEe_O0/edit?hl=en_US

Full code for just the XPath scraping function is here: https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/googleScholarXScraper/googleScholarXScraper.R

Source:http://www.r-bloggers.com/gscholarxscraper-hacking-the-gscholarscraper-function-with-xpath/

Wednesday 17 December 2014

Online Data Entry and Data Mining Services

Data entry job involves transcribing a particular type of data into some other form. It can be either online or offline. The input data may include printed documents like Application forms, survey forms, registration forms, handwritten documents etc.

Data entry process is an inevitable part of the job to any organization. One way or other each organization demands data entry. Data entry skills vary depends upon the nature of the job requirement, in some cases data to be entered from a hard copy formats and in some other cases data to be entered directly into a web portal. Online data entry job generally requires the data to be entered in to any online data base.

For a super market, data associate might be required to enter the goods which have sold in a particular day and the new goods received in a particular day to maintain the stock well in order. Also, by doing this the concerned authorities will get an idea about the sale particulars of each commodity as they requires. In another example, an office the account executive might be required to input the day to day expenses in to the online accounting database in order to keep the account well in order.

The aim of the data mining process is to collect the information from reliable online sources as per the requirement of the customer and convert it to a structured format for the further use. The major source of data mining is any of the internet search engine like Google, Yahoo, Bing, AOL, MSN etc. Many search engines such as Google and Bing provide customized results based on the user's activity history. Based on our keyword search, the search engine lists the details of the websites from where we can gather the details as per our requirement.

Collect the data from the online sources such as Company Name, Contact Person, Profile of the Company, Contact Phone Number of Email ID Etc. are doing for the marketing activities. Once the data is gathered from the online sources into a structured format, the marketing authorities will start their marketing promotions by calling or emailing the concerned persons, which may result to create a new customer. So basically data mining is playing a vital role in today's business expansions. By outsourcing the data entry and its related works, you can save the cost that would be incurred in setting up the necessary infrastructure and employee cost.

Source:http://ezinearticles.com/?Online-Data-Entry-and-Data-Mining-Services&id=7713395

Monday 15 December 2014

RAM Scraping a New Old Favorite For Hackers

Some of the best stories involve a conflict with an old enemy: a friend-turned-foe, long thought dead, returning from the grave for violent retribution; an ancient order of dark siders from the distant reaches of the galaxy, hiding in plain sight and waiting to seize power for themselves; a dark lord thought destroyed millennia ago, only to rise again and seek his favorite piece of jewelry.  The list goes on.

Granted, 2011 isn’t quite “millennia,” and this story isn’t meant for entertainment, but the old foe in this instance is nonetheless dangerous in its own right.  That is the year when RAM scraping malware first made major headlines: originating as an advanced version of the Trackr malware, controlled through a botnet, it was discovered in the compromised Point of Sale (POS) systems of a university and several hotels.  And while it seemed recently that this method had dwindled in popularity, the Target and other retail breaches saw it return with a vengeance.  With 110 million Target customers having their information compromised, it was easily one the largest incidents involving memory scrapers.

How does it work?  First, the malware has to be introduced into the POS network, which can happen via any machine that is connected to the network, or unsecured wireless networks.  Even with firewalls, an infected laptop could serve as a vector.  Once installed, the malware can hide in the shadows, employing encryption or antivirus-avoiding tools to prevent its identification until it’s ready to strike.  Then, when a customer’s card gets used at a POS machine, the data contained within—name, card number, security code, etc.—gets sent to the system memory.  “There is that opportunity to steal the credit card information when it is in memory, perhaps even before your payment has even been authorized, and the data hasn't even been written to the hard drive yet,” says security researcher Graham Cluley.

So, why not encrypt the system’s memory, when it’s at its most vulnerable?  Not that simple, sadly: “No matter how strong your encryption is, if the system needs to process data or process the code, everything needs to be decrypted in memory,” Chris Elisan, principal malware scientist at security firm RSA, explained to Dark Reading.

There are certain steps a company can take, of course, and should take, to reduce the risk.  Strong passwords to access the POS machines, firewalls to isolate the POS network from the Internet, disabling remote access to POS systems, to name a few.  All the same, while these measures are vital and should be used, I don’t think, in light of recent breaches, they are sufficient.  Now, I wrote a short time ago about the impending October 2014 deadline imposed by the credit card industry, regarding the systematic switch to chipped credit card technology; adopting this standard will definitely assist in eradicating this problem.  But, until such a time when a widespread implementation of new systems comes about, always be vigilant to protect your data from attack, because what’s old is new again, and a colossal data breach is a story consumers are liable to seek financial restitution for.

Source:http://www.netlib.com/blog/application-security/RAM-Scraping-a-New-Old-Favorite-For-Hackers.asp

Saturday 13 December 2014

Microfinance Data Scraping

I went to the Datakind‘s New York Datadive last November and met the Microfinance Information Exchange (MIX), a group that ‘delivers data services, analysis, research and business information on the institutions that provide financial services to the world’s poor’. They wanted to see whether web-scraping could save them from manually gathering data. So fellow divers and I showed MIX the utility of web-scraping. Over the course of a day, about six people scraped data about microfinance institutions from a bunch of websites, saving MIX an estimated year of manual data entry.

Over the past few months, I worked further with MIX to study who has access to what sorts of financial services. DataKind just put up our blog post about the project. Read the post, or just look at the map and explore the data.

Source:https://blog.scraperwiki.com/2012/05/microfinance-data-scraping/

Thursday 11 December 2014

Content Scraping Reuses Blog Posts without Permission

What do popular blogs and websites such as Social Media Examiner, Copy Blogger, CNN.com, Mashable, and Type A Parent have in common? No, it’s not traffic and a loyal online community, each was a victim of the content scraping site “BuzzMyFx.” Although most bloggers fall victim to content scrapers at least once, the offending website was such an extreme case the backlash against it was fast and furious. Thanks to the quick action of many angry bloggers, BuzzMyFix was taken down in a matter of days.

If you’re not familiar with content scraping sites and aren’t sure why they’re bad and what you can do if you fall prey, read on. Not knowing what steps you can take to remove your content from a scraping site can mean someone else is profiting from your hard work.

What is content scraping?

Content scraping is when a blog or website pulls in other bloggers’ content without permission, in many cases passing it off as their own. Instead of stocking their sites with unique content, they steal entire blog posts. Some do leave the original authors’ bylines, but there are plenty that don’t provide attribution at all. This is not a good thing at all.

If you don’t care about someone taking your content and putting it on their blogs and websites without your permission, you should. These sites are stealing traffic, search engine rankings, and even advertising revenue from bloggers. Moreover, by ignoring scraping sites you’re giving the message that this practice is OK.

It’s not OK.

How was BuzzMyFx different?

BuzzMyFx was a little different from your usual scrapers. Bloggers didn’t just find their content had been posted on this site, they learned their entire blogs — down to the design and comments — had been cloned. Plus, any bloggers checking to see if their blogs were being cloned immediately found themselves being scraped as well. Dozens, if not hundreds of blogs were affected. However, bloggers didn’t take this incident sitting down. They spread the word and contacted the site’s host en masse. Thanks to their swift action, and the high number of complaints, the site was removed quickly.

How can I tell if my content is being scraped?

Fortunately for content creators, scrapers are a lazy bunch. Because their sites are automated, and they don’t check or read the content being pulled, they don’t take many precautions to ensure the people they scrape from don’t find their sites. In fact, they may not even care. Fortunately, this makes it easy to learn if your content is being stolen.

    Link to your own articles — When you write a blog post and link to other (of your own) blog posts within that post, it’s not only good SEO. You also will get pingbacks whenever someone else steals your content because of your interlinks. You’re alerted when someone links to your content, and when content is published with your links, you’ll get that alert.

    Google Alerts — If your name, blog’s name, or other unique keywords are set up as Google Alerts, you’ll receive an e-mail every time content is published with these keywords.

    Analytics — When people click on your links that are in scraped content, it will show up as referring traffic in your analytics program. You should always check referring traffic so you can thank the referring site owner, but also to make sure no one is stealing your content.

What steps can I take to remove my content from a scraper?

If you find your content is being stolen, know you have several options. First, you’ll need to find out who owns the scraping site. You can find this out by doing a WHOis domain lookup, which will enable you to search for the website’s details, including the name of the webmaster, contact info, and the name of the site’s host.

Keep in mind that sometimes the website’s owner will pay extra to have his or her name kept private, but you will always be able to find the name of the host. Once you have this information, you can take the necessary steps to have your content removed.

    Contact the site’s owner personally: Your first step should always be a polite request to remove your content immediately. Let the website owner know he or she is in violation of the Digital Millennium Copyright Act (DMCA), and you will take the necessary steps to report him if he doesn’t comply.

    Contact the site’s host: If you can’t find the name of the person who owns the site, or if he won’t comply with your takedown request, contact the website’s host. You’ll have to prove your content is being stolen. As the host can be held liable for allowing the content theft, it’s in their best interest to contact the website owner and request removal.

    Contact Google: You can contact Google and fill out a form to have them remove the website from their search engines.

    Spread the word: Let all your blogging friends know about content scrapers when you come across them. The more people who take action against content scrapers, the less likely they are to do it again.

Contacting the webmaster with a takedown notice doesn’t have to be an intimidating process, either. The website Plagiarism Today has a wonderful set of stock letters to use to contact webmasters, web hosts, and even Google. All you have to do is insert the necessary information.

Content scrapers and cloners may try to steal your content, but you don’t have to let them. Stand up for what’s yours.

Source: http://www.dummies.com/how-to/content/content-scraping-reuses-blog-posts-without-permiss.html

Thursday 4 December 2014

Finding & Removing Spam Blogs Who Scrape Content Onto Free Hosted Blogs

The more popular you become in the blogging world, the more crap you have to deal with!
Content scraping is one chore that can be dealt with swiftly once you understand what to do.
This post contains links which you can use to quickly and easily report content scrapers and spam blogs.
Please share this post and help clean up spam blogs and punish content scrapers.
First step is to find your url’s which have been scraped of content and then get the scrapers spam blog removed.

Some of the tools i use to do this are:

    Google Webmaster Tools
    Google Alerts


Finding Scraped Content
Login to your Google Webmaster Tools account and go to traffic > links to your site.
You should see something like this:
Webmaster Tools Links to Your Site

The first domain is a site which has copied and embedded my homepage which i have already dealt with.
The second site is a search engine.
The third domain is the one i want to deal with.

A common method scrapers use is to post the scraped content from your rss feed on to a free hosted blog like WordPress.com or blogger.com.

Once you click the WordPress.com link in webmaster tools, you’ll find all the url’s which have been scraped.
Links to Your Site

There’s 32 url’s which have been linked to so its simply a matter of clicking each of your links and finding the culprits.

The first link is my homepage which has been linked to by legit domains like WordPress developers.
The others are mainly linked to by spam blogs who have scraped the content and used a free hosted service which in this case is WordPress.com.
WordPress.com Links to Your Site
 Reporting & Removing Spam Blogs

Once you have the url’s of the content scraping blogs as seen in the screenshot above:

    Fill in this basic form to report spam to WordPress.com
    Fill in this form to report copyright content to WordPress.com
    Use this form to report Blogspot and Blogger.com content which has been scraped.
    Fill in one of these forms to remove content from Google

Google Alerts

Its very easy to setup a Google alert to find your post titles when they get scraped.
If you’ve setup the WordPress SEO plugin correctly, you should have included your site title at the end of all your post titles.
Then all you need to do is setup a Google alert for your site title and you’ll be notified every time a scraper links to your content.

Link Notifications

You may also receive a pingback or trackback if you have this feature enabled in your discussion settings.

Link Notifications
RSS Feed Links


Most content scrapers use automated software to scrape the content from RSS feeds.
Make sure you configure your Reading settings so only a summary is displayed.
Reading Settings Feed Summary

Next step is to configure the settings in Yoast’s SEO plugin so links back to your site are included in all RSS feed post summaries.

RSS Feed Links

This will help search engines identify you and your domain as the original author of the content.
There’s other services like copyscape and dmca which can help you protect your sites content if you’re prepared to pay a premium.
That’s it folks.
Its easy to find and get spam sites removed once you know what to do.
Hope you don’t have to deal with this garbage to often.
Ever found out your content has been scraped?
What did you do about it?

Source: http://wpsites.net/blogging/content-scraping-monitoring-and-prevention-tips/

Sunday 30 November 2014

What you have to know before requesting web scraping services?

Before you request web scraping services you have to know what are your needs (what data you need, structure of it and where you can find this data).

Step 1: Define what data you need?
Data needs depending on purpose, if you want to find new customers you probably need contact data from players in your industry. Also if you want to study your competitors you need to define who are they. Only after that you can select data sources (websites feeds or other electronic sources) for this extraction.

In many cases for discovering and defining data sources are used search engines like Google, Bing, Yahoo, and others.

Step 2: Structure of data
Data structure it’s directly linked to usage purpose. In many cases data structure it’s a table where a row represents an entity and a cell of this row represents a property of this entity. In other cases Data structure is a a chart or another graphic representation builder with data extracted from a web source.

Step 3: Number of data extraction

In many cases is needed one time data extraction. In other cases when you need a regular report, are needed periodically extractions.

If you have defined all of above points you are ready to request a quote and an amount estimation from this contact form.

Source: http://thewebminer.com/blog/2013/08/

Thursday 27 November 2014

Webscraping using readLines and RCurl

There is a massive amount of data available on the web. Some of it is in the form of precompiled, downloadable datasets which are easy to access. But the majority of online data exists as web content such as blogs, news stories and cooking recipes. With precompiled files, accessing the data is fairly straightforward; just download the file, unzip if necessary, and import into R. For “wild” data however, getting the data into an analyzeable format is more difficult. Accessing online data of this sort is sometimes reffered to as “webscraping”. Two R facilities, readLines() from the base package and getURL() from the RCurl package make this task possible.

readLines

For basic webscraping tasks the readLines() function will usually suffice. readLines() allows simple access to webpage source data on non-secure servers. In its simplest form, readLines() takes a single argument – the URL of the web page to be read:

web_page <- readLines("http://www.interestingwebsite.com")

As an example of a (somewhat) practical use of webscraping, imagine a scenario in which we wanted to know the 10 most frequent posters to the R-help listserve for January 2009. Because the listserve is on a secure site (e.g. it has https:// rather than http:// in the URL) we can't easily access the live version with readLines(). So for this example, I've posted a local copy of the list archives on the this site.

One note, by itself readLines() can only acquire the data. You'll need to use grep(), gsub() or equivalents to parse the data and keep what you need.

# Get the page's source
web_page <- readLines("http://www.programmingr.com/jan09rlist.html")
# Pull out the appropriate line
author_lines <- web_page[grep("<I>", web_page)]
# Delete unwanted characters in the lines we pulled out
authors <- gsub("<I>", "", author_lines, fixed = TRUE)
# Present only the ten most frequent posters
author_counts <- sort(table(authors), decreasing = TRUE)
author_counts[1:10]
[webscrape results]


We can see that Gabor Grothendieck was the most frequent poster to R-help in January 2009.

The RCurl package

To get more advanced http features such as POST capabilities and https access, you'll need to use the RCurl package. To do webscraping tasks with the RCurl package use the getURL() function. After the data has been acquired via getURL(), it needs to be restructured and parsed. The htmlTreeParse() function from the XML package is tailored for just this task. Using getURL() we can access a secure site so we can use the live site as an example this time.

# Install the RCurl package if necessary
install.packages("RCurl", dependencies = TRUE)
library("RCurl")
# Install the XML package if necessary
install.packages("XML", dependencies = TRUE)
library("XML")
# Get first quarter archives
jan09 <- getURL("https://stat.ethz.ch/pipermail/r-help/2009-January/date.html", ssl.verifypeer = FALSE)
jan09_parsed <- htmlTreeParse(jan09)
# Continue on similar to above
...

For basic webscraping tasks readLines() will be enough and avoids over complicating the task. For more difficult procedures or for tasks requiring other http features getURL() or other functions from the RCurl package may be required. For more information on cURL visit the project page here.

Source: http://www.r-bloggers.com/webscraping-using-readlines-and-rcurl-2/

Wednesday 26 November 2014

Screen scrapers: To program or to purchase?

Companies today use screen scraping tools for a variety of purposes, including collecting competitive information, capturing product specs, moving data between legacy and new systems, and keeping inventory or price lists accurate.

Because of their popularity and reputation as being extremely efficient tools for quickly gathering applicable display data, screen scraping tools or browser add-ons are a dime a dozen: some free, some low cost, and some part of a larger solution. Alternatively, you can build your own if you are (or know) a programming whiz. Each tool has its potential pros and cons, however, to keep in mind as you determine which type of tool would best fit your business need.

Program-your-own screen scraper

Pros:

    Using in-house resources doesn't require additional budget

Cons:

    Properly creating scripts to automate screen scraping can take a significant amount of time initially, and continues to take time in order to maintain the process. If, for instance, objects from which you're gathering data move on a web page, the entire process will either need to be re-automated, or someone with programming acumen will have to edit the script every time there is a change.

    It's questionable whether or not this method actually saves time and resources

Free or cheap scrapers

Pros:

    Here again, budget doesn't ever enter the picture, and you can drive the process yourself.

    Some tools take care of at least some of the programming heavy lifting required to screen scrape effectively

Cons:

    Many inexpensive screen scrapers require that you get up to speed on their programming language—a time-consuming process that negates the idea of efficiency that prompted the purchase.

Screen scraping as part of a full automation solution

Pros:

    In the amount of time it takes to perform one data extraction task, you have a completely composed script that the system writes for you

    It's the easiest to use out of all of the options

    Screen scraping is only part of the package; you can leverage automation software to automate nearly any task or process including tasks in Windows, Excel automation, IT processes like uploads, backups, and integrations, and business processes like invoice processing.

    You're likely to get buy-in for other automation projects (and visibility for the efficiency you're introducing to the organization) if you pick a solution with a clear and scalable business purpose, not simply a tool to accomplish a single task.

Cons:

    This option has the highest price tag because of its comprehensive capabilities.

Looking for more information?

Here are some options to dig deeper into screen scraping, and deciding on the right tool for you:

 Watch a couple demos of what screen scraping looks like with an automation solution driving the process.

 Read our web data extraction guide for a complete overview.

 Try screen scraping today by downloading a free trial.

Source: https://www.automationanywhere.com/screen-scrapers

Sunday 23 November 2014

Data Mining Outsourcing in a Better and Unique Approach

Data mining outsourcing services are ideal for clarity in various decision making processes.  It is the ultimate goal of any organization and business to increase on its profits as well as strengthen the bond with its customers. Equipping the business in such a way that it’s very easy to detect frauds and manage risks in a convenient manner is equally important. Volumes of data that are irrelevant or cannot be used when raw needs to be converted to a more useful form.  The data mining outsourcing services can greatly help you to analyze and interpret data in a more diligent way.

This service to reliable, experienced and qualified hands is very important. Your research project or engineering project can be easily and conveniently handled by experienced staff who guarantees you an accuracy level of about 98% and a massive reduction in operating costs. The quality of work is unsurpassed and the presentation is done in a format that is easy and simple for you. The project is done in a very short time alleviating you delays as well as ensuring on-time completion of your projects. To enjoy a successful outsourcing experience, you need to bank on a famous and reliable expertise.

The only time to rely with data mining outsourcing services is when you do not have a reliable, experienced expertise in your business.  Statistics indicate that it’s very easy to lose business intelligence or expose the privacy of the customers through this process. However companies which offer secure outsourcing process are on the increase as a result of massive competition. It’s an opportunity to develop your potential of sourced data and improve your business in all fields. 

Data mining potential applications are infinite. However major applications are in the marketing research and scientific projects. It’s done both on large and small quantities of data by experienced staff well known for their best analytical procedures to guarantee you accurate and easy to use information. Data mining outsourcing services are the only perfect way to profitability.

Source:http://www.e-edge.biz/Data_Mining_Outsourcing_in_a_Better_and_Unique_Approach.html

Wednesday 19 November 2014

Is It Time to End Screen Scraping?

As the industry works to improve the way online banking information is shared with personal financial management apps, a debate is brewing over whether to end the decades-old practice of screen scraping.

Proponents of the popular method say it is a valuable supplement to direct data feeds that may be incomplete or out-of-date. But screen scraping also raises risk concerns, since like other data collection methods it requires consumers to cough up their banking credentials.

"I have not talked to a bank that hasn't confirmed it's a growing problem in their organization," said Jim Routh, the chairman of the products and services committee at Financial Services Information Sharing and Analysis Center.

Financial institutions worry that data aggregators may not take all the appropriate security precautions. According to the FS-ISAC, an industry organization, startups are entering the aggregation market without making security a higher priority.

Routh, who is Aetna's chief information security officer and a former global head of application and mobile security for JPMorgan Chase, said the upstarts do some things well, but "protecting credentials isn't necessarily high on their priorities." The problem is worsened by data aggregators that collect marketing data, such as the device a consumer is using, to understand their behaviors across channels, he said.

The FS-ISAC has proposed creating a standard application programming interface to share information from bank accounts. The API would serve as the conduit for data when consumers wish to use a web or mobile app to receive push bill reminders, to verify their bank accounts or for numerous other PFM use cases.

The proposed API would also be designed to reduce the storage of financial data. But if the industry embraces the model, it would be harder for aggregators to do screen-scraping.

For years, PFM companies have used this tool to obtain customers' banking account information. With consumers' permission, aggregators log in with the customer's user name and password to grab financial data and use it to populate the mobile or web app of the customer's choice — whether or not the bank supports the technique.

Yodlee, which works with more than 300 banks as well as startups, argues that there is a place and a need for aggregators to collect data through various techniques to provide the best customer experience.

Brian Costello, vice president of operations and security at Yodlee, said his company uses a combination of methods to gather customer account data. If it couldn't get data from a direct feed, it could also screen scrape.

If the industry moved to embracing only one data exchange method, Yodlee could be more vulnerable to the problem of receiving outdated information from the banks.

When a bank changes an annual percentage rate, if it doesn't update the data feed it sends to the aggregator right away, the PFM services that rely on that data will appear stale. (Services like Credit Karma, Mint and Wallaby, for example, rely on aggregation technology to recommend financial products to consumers according to price, among other things.)

Proper maintenance of data feeds, of course, takes time and money — resources many banks are short on. But delays could also result from the bankers' dilemma: On the one hand, they want to let customers aggregate their accounts to gather intelligence on their competitors. On the other hand, they may have reservations about their rivals collecting that same data in the battle for wallet share.

"Banks are under tremendous pressure to retain and obtain more clients," said Costello.

Screen scraping also has maintenance requirements, though. The FS-ISAC white paper draft said the approach "requires some coordination from the FI to allow what appears to be an automated attack against their application. To avoid blocking the aggregator's attempt to screen scrape the financial institution's application with this or other current security controls, a whitelist of aggregator IPs are set up and maintained by the FIs."

Like Costello, Marc West, president of digital channels at Fiserv, said a combination of data collection methods is better than a standard data exchange approach that might fail to extract the necessary information. Any data feed, said West, offers a limited set of data and information, while a scrape can enable a custom data extract.

But Aetna's Routh said moving to a real-time API model would improve a recurring issue caused by screen scraping: customer service hiccups. A consumer may call the company behind the personal financial app when a link to an account is broken. The PFM provider might tell him to call the bank, when the problem could lie with the aggregator not knowing of an update to the bank's code.

"The consumer gets in the middle of a customer service issue that is thorny at best and unsolvable at worst," Routh said. "Unfortunately that happens more frequently than anyone would like to it happen.

The new model, then, is "inevitable" in Routh's point of view because of the risk and economics involved. "This won't happen overnight," he said. "It needs some legs."

Kristin Moyer, a research vice president in industry advisory services and banking and investment services at Gartner, said she expects more banks to embrace APIs as a way to compete in a digital world.

Already financial institutions like Capital One, Agricole Bank and Fidor Bank are piloting and testing the OAuth specification, which lets banks keep ownership of the customer log-in data but requires them to make available an API. (The FS-ISAC is also promoting OAuth 2.0 as a way to strengthen aggregation security.)

"It's something we will see a lot more of in the next two to three years," said Moyer. "It's an exciting time…I think the use of APIs will enable us as an industry [to do things] that we never really imagined possible before."

LESSONS ABROAD

The move away from screen scraping has already happened in some countries that lack a data exchange standard. Regulators in Poland, for example, recently recommended the practice halt. Responding to the guidance, mBank is one of the banks that changed its aggregation roadmap.

The bank, which spun off from BRE Bank, had been piloting a PFM service with friends and family and has now suspended the pilot. It had, however, already made use of aggregation technology so consumers, who weren't customers of the bank, could get loan decisions from mBank within half an episode of "Modern Family." Indeed, the bank would screen scrape consumers' external bank accounts to make a loan decision within five to 15 minutes. Now, loan decisions have to be made at a branch or for a smaller dollar amount after a consumer sends the bank a copy of an electronic statement.

"Right now we have to put it on the shelf. We haven't killed it. We want to resurrect it," said Michal Panowicz, senior director at mBank.

Overall, he sounds calm about the setback. "This is a regulator decision," said Panowicz. "We have to respect that. …We have to live with them on good footing."

But that doesn't mean it has given up on aggregation. Payday lenders can continue to screen scrape financial data in order to make loan decisions in Poland — which makes it an uneven playing field.

"We will try to convey the logic that [screen scraping] cannot be stopped," said Panowicz.

He views it as a longer term game for something he believes is valuable to consumers. mBank like other banks wants to realize the true aggregation dream: letting customers quickly switch bank accounts and products if they wish.

"To be honest, it's the most exciting part about aggregation... to move accounts to us without spending a minute of physical labor," he said.

Source:http://www.americanbanker.com/news/technology/is-it-time-to-end-screen-scraping-1071118-1.html

Monday 17 November 2014

Data Scraping Guide for SEO & Analytics

Data scraping can help you a lot in competitive analysis as well as pulling out data from your client’s website like extracting the titles, keywords and content categories.

You can quickly get an idea of which keywords are driving traffic to a website, which content categories are attracting links and user engagement, what kind of resources will it take to rank your site…………and the list goes on…

 Scraping Organic Search Results

By scraping organic search results you can quickly find out your SEO competitors for a particular search term. You can determine the title tags and the keywords they are targeting.

    The easiest way to scrape organic search results is by using the SERPs Redux bookmarklet.

For e.g if you scrape organic listings for the search term ‘seo tools’ using this bookmarklet, you may see the following results:

You can copy paste the websites URLs and title tags easily into your spreadsheet from the text boxes.

    Pro Tip by Tahir Fayyaz:

    Just wanted to add a tip for people using the SERPs Redux bookmarklet.

    If you have a data separated over multiple pages that you want to scrape you can use AutoPager for Firefox or Chrome to loads x amount of pages all on one page and then scrape it all using the bookmarklet.

Scraping on page elements from a web document

Through this Excel Plugin by Niels Bosma you can fetch several on-page elements from a URL or list of URLs like:

    Title tag
    Meta description tag
    Meta keywords tag
    Meta robots tag
    H1 tag
    H2 tag
    HTTP Header
    Backlinks
    Facebook likes etc.

Scraping data through Google Docs

Google docs provide a function known as importXML through which you can import data from web documents directly into Google Docs spreadsheet. However to use this function you must be familiar with X-path expressions.

    Syntax: =importXML(URL,X-path-query)

    url=> URL of the web page from which you want to import the data.

    x-path-query => A query language used to extract data from web pages.

You need to understand following things about X-path in order to use importXML function:

1. Xpath terminology- What are nodes and kind of nodes like element nodes, attribute nodes etc.

2. Relationship between nodes- How different nodes are related to each other. Like parent node, child node, siblings etc.

3. Selecting nodes- The node is selected by following a path known as the path expression.

4. Predicates – They are used to find a specific node or a node that contains a specific value. They are always embedded in square brackets.

If you follow the x-path tutorial then it should not take you more than an hour to understand how X path expressions works.

Understanding path expressions is easy but building them is not. That’s is why i use a firefbug extension named ‘X-Pather‘ to quickly generate path expressions while browsing HTML and XML documents.

Since X-Pather is a firebug extension, it means you first need to install firebug in order to use it.

 How to scrape data using importXML()

Step-1: Install firebug – Through this add on you can edit & monitor CSS, HTML, and JavaScript while you browse.

Step-2: Install X-pather – Through this tool you can generate path expressions while browsing a web document. You can also evaluate path expressions.

Step-3: Go to the web page whose data you want to scrape. Select the type of element you want to scrape. For e.g. if you want to scrape anchor text, then select one anchor text.

Step-4: Right click on the selected text and then select ‘show in Xpather’ from the drop down menu.

Then you will see the Xpather browser from where you can copy the X-path.

Here i have selected the text ‘Google Analytics’, that is why the xpath browser is showing ‘Google Analytics’ in the content section. This is my xpath:

    /html/body/div[@id='page']/div[@id='page-ext']/div[@id='main']/div[@id='main-ext']/div[@id='mask-3']/div[@id='mask-2']/div[@id='mask-1']/div[@id='primary-content']/div/div/div[@id='post-58']/div/ol[2]/li[1]/a

Pretty scary huh. It can be even more scary if you try to build it manually. I want to scrape the name of all the analytic tools from this page: killer seo tools. For this i need to modify the aforesaid path expression into a formula.

This is possible only if i can determine static and variable nodes between two or more path expressions. So i determined the path expression of another element ‘Google Analytics Help center’ (second in the list) through X-pather:

    /html/body/div[@id='page']/div[@id='page-ext']/div[@id='main']/div[@id='main-ext']/div[@id='mask-3']/div[@id='mask-2']/div[@id='mask-1']/div[@id='primary-content']/div/div/div[@id='post-58']/div/ol[2]/li[2]/a

Now we can see that the node which has changed between the original and new path expression is the final ‘li’ element: li[1] to li[2]. So i can come up with following final path expression:

    /html/body/div[@id='page']/div[@id='page-ext']/div[@id='main']/div[@id='main-ext']/div[@id='mask-3']/div[@id='mask-2']/div[@id='mask-1']/div[@id='primary-content']/div/div/div[@id='post-58']/div/ol[2]//li/a

Now all i have to do is copy-paste this final path expression as an argument to the importXML function in Google Docs spreadsheet. Then the function will extract all the names of Google Analytics tool from my killer SEO tools page.

This is how you can scrape data using importXML.

    Pro Tip by Niels Bosma: “Anything you can do with importXML in Google docs you can do with XPathOnUrl directly in Excel.”

    To use XPathOnUrl function you first need to install the Niels Bosma’s Excel plugin. It is not a built in function in excel.

Note:You can also use a free tool named Scrapy for data scraping. It is an an open source web scraping framework and is used to extract structured data from web pages & APIs. You need to know Python (a programming language) in order to use scrapy.

Scraping on-page elements of an entire website

There are two awesome tools which can help you in scraping on-page elements (title tags, meta descriptions, meta keywords etc) of an entire website. One is the evergreen and free Xenu Link Sleuth and the other is the mighty Screaming Frog SEO Spider.

What make these tools amazing is that you can scrape the data of entire website and download it into excel. So if you want to know the keywords used in the title tag on all the web pages of your competitor’s website then you know what you need to do.

Note: Save the Xenu data as a tab separated text file and then open the file in Excel.

 Scraping organic and paid keywords of an entire website

The tool that i use for scraping keywords is SEMRush. Through this awesome tool i can determine which organic and paid keyword are driving traffic to my competitor’s website and then can download the whole list into excel for keyword research. You can get more details about this tool through this post: Scaling Keyword Research & Competitive Analysis to new heights

Scraping keywords from a webpage

Through this excel macro spreadsheet from seogadget you can fetch keywords from the text of a URL(s). However you need an Alchemy API key to use this macro.

You can get the Alchemy API key from here

Scraping keywords data from Google Adwords API

If you have access to Google Adwords API then you can install this plugin from seogadget website. This plugin creates a series of functions designed to fetch keywords data from the Google Adwords API like:

getAdWordAvg()- returns average search volume from the adwords API.

getAdWordStats() – returns local search volume and previous 12 months separated by commas

getAdWordIdeas() – returns keyword suggestions based on API suggest service.

Check out this video to know how this plug-in works

Scraping Google Adwords Ad copies of any website

I use the tool SEMRush to scrape and download the Google Adwords ad copies of my competitors into excel and then mine keywords or just get ad copy ideas.  Go to semrush, type the competitor website URL and then click on ‘Adwords Ad texts’ link on the left hand side menu. Once you see the report you can download it into excel.

Scraping back links of an entire website

The tool that you can use to scrape and download the back links of an entire website is: open site explorer

Scraping Outbound links from web pages

Garrett French of citation Labs has shared an excellent tool: OBL Scraper+Contact Finder which can scrape outbound links and contact details from a URL or URL list. This tool can help you a lot in link building. Check out this video to know more about this awesome tool:

Scraper – Google chrome extension

This chrome extension can scrape data from web pages and export it to Google docs. This tool is simple to use. Select the web page element/node you want to scrape. Then right click on the selected element and select ‘scrape similar’.

Any element/node that’s similar to what you have selected will be scraped by the tool which you can later export to Google Docs. One big advantage of this tool is that it reduces our dependency on building Xpath expressions and make scraping easier.

See how easy it is to scrape name and URLs of all the Analytics tools without using Xpath expressions.

Source: http://www.optimizesmart.com/data-scraping-guide-for-seo/

Saturday 15 November 2014

Screenscraping from Java using jsoup – effective data gathering from websites

In a recent article I discussed screenscraping in a in hindsight fairly clumsy way (http://technology.amis.nl/blog/12786/building-java-object-graph-with-tour-de-france-results-using-screen-scraping-java-util-parser-and-assorted-facilities). While preparing for a series of articles on data visualizations, I had need of statistics regarding the Olympic Games – more specifically: the overall medal count per country during the 2008 Bejing Olympic Games. This information is readily available from dozens of websites. However, I could not find one hat offered the data in easy to process XML or CSV format – all websites had human consumers in mind.

Using screenscraping – we use a programmatic facility to consume the content that is intended to be displayed on screen to human users and subsequently process that content by extracting the required data from it. Some web-pages are easier to scrape than others – this depends on the richness of the HTML (the poorer the better for scraping), the required interactivity (JavaScript, AJAX – the less the better) and the structure used to present the data (tables, frequently despised by web developers, work rather well).

I came across a tool for screenscraping from Java, called jsoup – http://jsoup.org/. It turned out to be so incredibly easy to use – that I thouht I should share it.

Getting going with jsoup is as easy as can be:

1. download jsoup-1.6.1.jar (or whatever the latest version is) from http://jsoup.org/download

2. add this jar as a dependency in your project and/or application CLASSPATH

3. make use of jsoup in the code that does the screenscraping.

A simple example of code that uses jsoup (more examples on: http://jsoup.org/cookbook/):

One of the websites offering the overall medal count is http://www.databaseolympics.com/games/gamesyear.htm?g=26. The page looks as follows:

Image

Well, more importantly, the page looks like this:

Image

This means in terms of screenscraping: I will find the medal count for each country inside a TABLE element with styleclass pt8. Each country has a TR element. Only the first TR element does not represent a country score, as it is the table header. The first TD element in the TR represents the country. The name of the country can be retrieved as the text content from the A element in the TD. The next TD elements contain the numbers of medals in Gold, Silver, Bronze and Total.

The corresponding Java code with jsoup boils down to:

public static void main(String[] args) throws IOException, SQLException, InterruptedException {

        Document doc = Jsoup.connect(OlympicMedalMirrorProcessor.baseUrl + "?g=26").get();
        String title = doc.title();
        System.out.println(title);
        Element table = doc.select("table.pt8").get(0);
        Elements trs = table.select("tr");
        Iterator trIter = trs.iterator();
        boolean firstRow = true;
        while (trIter.hasNext()) {


            Element tr = (Element)trIter.next();
            if (firstRow) {
                firstRow = false;
                continue;
            }
            Elements tds = tr.select("td");
            Iterator tdIter = tds.iterator();
            int tdCount = 1;
            String country = null;
            Integer gold = null;
            Integer silver = null;
            Integer bronze = null;
            Integer total = null;
            // process new line
            while (tdIter.hasNext()) {

                Element td = (Element)tdIter.next();
                switch (tdCount++) {
                case 1:
                    country = td.select("a").text();
                    break;
                case 2:
                    gold = Integer.parseInt(td.text());
                    break;
                case 3:
                    silver = Integer.parseInt(td.text());
                    break;
                case 4:
                    bronze = Integer.parseInt(td.text());
                    break;
                case 5:
                    total = Integer.parseInt(td.text());
                    break;
                }

            }
            System.out.println(country + ": gold " + gold + " silver " + silver + " bronze " + bronze + " total " +
                               total);
        } //table rows

Source:http://technology.amis.nl/2011/08/03/screenscraping-from-java-using-jsoup-effective-data-gathering-from-websites/

Thursday 13 November 2014

Scraping Data: Site-specific Extractors vs. Generic Extractors

Scraping is becoming a rather mundane job with every other organization getting its feet wet with it for their own data gathering needs. There have been enough number of crawlers built – some open-sourced and others internal to organizations for in-house utilities. Although crawling might seem like a simple technique at the onset, doing this at a large-scale is the real deal. You need to have a distributed stack set up to take care of handling huge volumes of data, to provide data in a low-latency model and also to deal with fail-overs. This still is achievable after crossing the initial tech barrier and via continuous optimizations. (P.S. Not under-estimating this part because it still needs a team of Engineers monitoring the stats and scratching their heads at times).

Social Media Scraping

Focused crawls on a predefined list of sites

However, you bump into a completely new land if your goal is to generate clean and usable data sets from these crawls i.e. “extract” data in a format that your DB can process and aid in generating insights. There are 2 ways of tackling this:

a. site-specific extractors which give desired results

b. generic extractors that result in few surprises

Assuming you still do focused crawls on a predefined list of sites, let’s go over specific scenarios when you have to pick between the two-

1. Mass-scale crawls; high-level meta data - Use generic extractors when you have a large-scale crawling requirement on a continuous basis. Large-scale would mean having to crawl sites in the range of hundreds of thousands. Since the web is a jungle and no two sites share the same template, it would be impossible to write an extractor for each. However, you have to settle in with just the document-level information from such crawls like the URL, meta keywords, blog or news titles, author, date and article content which is still enough information to be happy with if your requirement is analyzing sentiment of the data.

cb1c0_one-size

A generic extractor case

Generic extractors don’t yield accurate results and often mess up the datasets deeming it unusable. Reason being

programatically distinguishing relevant data from irrelevant datasets is a challenge. For example, how would the extractor know to skip pages that have a list of blogs and only extract the ones with the complete article. Or delineating article content from the title on a blog page is not easy either.

To summarize, below is what to expect of a generic extractor.

Pros-

minimal manual intervention

low on effort and time

can work on any scale

Cons-

Data quality compromised

inaccurate and incomplete datasets

lesser details suited only for high-level analyses

Suited for gathering- blogs, forums, news

Uses- Sentiment Analysis, Brand Monitoring, Competitor Analysis, Social Media Monitoring.

2. Low/Mid scale crawls; detailed datasets - If precise extraction is the mandate, there’s no going away from site-specific extractors. But realistically this is do-able only if your scope of work is limited i.e. few hundred sites or less. Using site-specific extractors, you could extract as many number of fields from any nook or corner of the web pages. Most of the times, most pages on a website share similar templates. If not, they can still be accommodated for using site-specific extractors.

cutlery

Designing extractor for each website

Pros-

High data quality

Better data coverage on the site

Cons-

High on effort and time

Site structures keep changing from time to time and maintaining these requires a lot of monitoring and manual intervention

Only for limited scale

Suited for gathering - any data from any domain on any site be it product specifications and price details, reviews, blogs, forums, directories, ticket inventories, etc.

Uses- Data Analytics for E-commerce, Business Intelligence, Market Research, Sentiment Analysis

Conclusion

Quite obviously you need both such extractors handy to take care of various use cases. The only way generic extractors can work for detailed datasets is if everyone employs standard data formats on the web (Read our post on standard data formats here). However, given the internet penetration to the masses and the variety of things folks like to do on the web, this is being overly futuristic.

So while site-specific extractors are going to be around for quite some time, the challenge now is to tweak the generic ones to work better. At PromptCloud, we have added ML components to make them smarter and they have been working well for us so far.

What have your challenges been? Do drop in your comments.

Source: https://www.promptcloud.com/blog/scraping-data-site-specific-extractors-vs-generic-extractors/

Monday 10 November 2014

My Experience in Choosing a Web Scraping Service

Recently I decided to outsource a web scraping project to another company. I typed “web scraping service” in Google, chose six services from the first two search result pages and sent the project specifications to all of them to get quotes. Eventually I decided to go another way and did not order the services, but my experience may be useful for others who want to entrust web scraping jobs to third party services.

If you are interested in price comparisons only and not ready to read the whole story just scroll down.

A list of web scraping services I sent my project to:

    www.datahen.com - Canadian web scraping service with nice web design
    webdata-scraping.com - Indian service by Keval Kothari
    www.iwebscraping.com - India based web scraping company (same as www.3idatascraping.com)
    scrapinghub.com - A scraping service founded by creators of Scrapy
    web-scraper.com - Yet another web scraping service
    grepsr.com - A scraping service that we already reviewed two years ago


Sending the request

All the services except scrapinghub.com have quite simple forms for the description of the project requirements. Basically, you just need to give your contact details and a project description in any form. Some of them are pretty (like datahen.com), some of them are more ascetic (like web-scraper.com), but all of them allow you to send your requirements to developers.

Scrapinghub.com has a quite long form, but most of the fields are optional and all the questions are quite natural. If you really know what you need, then it won’t be hard to answer all of them; moreover they rather help you to describe your need in detail.

Note, that in the context of the project I didn’t make a request for a scraper itself. I asked to receive data on a weekly basis only.

Getting responses

Since I sent my request on Sunday it would have been ok not to receive responses the same day, but I got the first response in 3 hrs! It was from web-scraper.com and stated that this project will cost me $250 monthly. Simple and clear. Thank you, Thang!

Right after that, I received the second response. This time it was Keval from webdata-scraping.com. He had some questions regarding the project. Then after two days he wrote me that it would be hard to scrape some of my data with the software he uses, and that he will try to use a custom scraper. After that he disappeared… ((

Then on Monday I received Cost & ETAT details from datahen.com. It looked quite professional and contained not only price, but also time estimation. They were ready to create such a scraper in 3-4 days for $249 and then maintain it for just $65/month.

On the same day I received a quote from iwebscraping.com. It was $60 per week. Everything is fine, but I’d like to mention that it wasn’t the last letter from them. After I replied to them (right after receiving the quote), I received a reminder letter from them every other day for about a week. So be ready for aggressive marketing if you ask them for a quote )).

Finally in two days after requesting a quote I got a response from scrapinghub.com. Paul Tremberth wrote me that they were ready to build a scraper for $1200 and then maintain it for $300/month.

It is interesting that I have never received an answer from grepsr.com! Two years ago it was the first web scraping service we faced on the web, but now they simply ignored my request! Or perhaps they didn’t receive it somehow? Anyway I had no time for investigation.

So what?

Let us put everything together. Out of six web scraping  services I received four quotes with the following prices:

Service     Setup fee     Monthly fee

web-scraper.com     -     $250
datahen.com     $249     $65
iwebscraping.com     -     $240
scrapinghub.com     $1200     $300


From this table you can see that  scrapinghub.com appears to be the most expensive service among those compared.

EDIT: These $300/month gives you as much support and development needed to fix a 5M multi-site web crawler, for example. If you need a cheaper solution you can use their Autoscraping tool, which is free, and would have costed around $2/month to crawl at my requested rates.

The average cost of monthly scraping is about $250, but from a long term perspective datahen.com may save you money due to their low monthly fee.

That’s it! If I had enough money available it would be interesting to compare all these services in operation and provide you a more complete report, but this is all I have for now.

If you have anything to share about your experience in using similar services, please contribute to this post by commenting on it below. Cheers!

Source: http://scraping.pro/choosing-web-scraping-service/

Saturday 8 November 2014

Why People Hesitate To Try Data Mining

What is hindering a number of people from venturing into the promising world of data mining? Despite so much encouragement, promotions, testimonials, and evidences of the benefits of online data collection, still only a handful take the challenge and really gain the pay offs it has to offer.

It may sound unthinkable that such an opportunity for success has been neglected by many. It may also sound absurd why many well-meaning individuals are hindered from enjoying the benefits of the blessings of the 21st century.

The Causes

After considerable observation and analysis of the human psyche, one can understand the underlying reasons behind the hesitance to try the profitable data mining service. The most common reasons why people are afraid to try new technology or why they remain passive and uninvolved are: fear; lack of knowledge; and pride.

Fear. The most paralyzing of human emotions is fear. It can, to some extent, cause a person to be insane, unprofitable, sick, and lost. Although fear is a normal reaction to certain stimuli and a natural feeling experienced by humans, it must always be monitored and controlled.  Usually, people share common fears, such as: fear of change; fear of anything new; and fear of the unknown.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/people-hesitate-try-data-mining/

Wednesday 5 November 2014

Why Web Scraping is Indispensable

The 21st century has opened the gates to hidden treasures and unlimited access to information globally without the constraints of time and space, through Internet technology. Along with this development comes the necessity for each business or company to get as much information as possible in order in order to thrive in the ever increasing demand for new innovations, comparisons, and trends.

Web scraping has consequently become an indispensable option to achieve all the needed data as quickly and efficiently as possible. In this view, data mining then appears to be the best and the only way to answer the present demand for updates, data, coping, foreknowledge, analysis, and evaluation. Indeed, information has inevitably become a valuable commodity and the most sought after product among online and offline entrepreneurs.

Need for Data

The increasing need for new data makes it possible for the experts to become increasingly creative in accessing information worldwide. The more knowledge one has, the better are his or her chances of growing and surviving. There seems to be no other time in the human existence where data has become so much a major source of revenue as the contemporary times.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/web-scraping-indispensable/

Thursday 11 September 2014

Scraping webdata from a website that loads data in a streaming fashion

I'm trying to scrape some data off of the FEC.gov website using python for a project of mine. Normally I use python

mechanize and beautifulsoup to do the scraping.

I've been able to figure out most of the issues but can't seem to get around a problem. It seems like the data is

streamed into the table and mechanize.Browser() just stops listening.

So here's the issue: If you visit http://query.nictusa.com/cgi-bin/can_ind/2011_P80003338/1/A ... you get the first 500

contributors whose last name starts with A and have given money to candidate P80003338 ... however, if you use

browser.open() at that url all you get is the first ~5 rows.

I'm guessing its because mechanize isn't letting the page fully load before the .read() is executed. I tried putting a

time.sleep(10) between the .open() and .read() but that didn't make much difference.

And I checked, there's no javascript or AJAX in the website (or at least none are visible when you use the 'view-

source'). SO I don't think its a javascript issue.

Any thoughts or suggestions? I could use selenium or something similar but that's something that I'm trying to avoid.

-Will

2 Answers

Why not use an html parser like lxml with xpath expressions.

I tried

>>> import lxml.html as lh
>>> data = lh.parse('http://query.nictusa.com/cgi-bin/can_ind/2011_P80003338/1/A')
>>> name = data.xpath('/html/body/table[2]/tr[5]/td[1]/a/text()')
>>> name
[' AABY, TRYGVE']
>>> name = data.xpath('//table[2]/*/td[1]/a/text()')
>>> len(name)
500
>>> name[499]
' AHMED, ASHFAQ'
>>>



Similarly, you can create xpath expression of your choice to work with.


Source: http://stackoverflow.com/questions/9435512/scraping-webdata-from-a-website-that-loads-data-in-a-streaming-

fashion

Monday 8 September 2014

How can I circumvent page view limits when scraping web data using Python?

I am using Python to scrape US postal code population data from http:/www.city-data.com, through this directory: http://www.city-data.com/zipDir.html. The specific pages I am trying to scrape are individual postal code pages with URLs like this: http://www.city-data.com/zips/01001.html. All of the individual zip code pages I need to access have this same URL Format, so my script simply does the following for postal_code in range:

    Creates URL given postal code
    Tries to get response from URL
    If (2), Check the HTTP of that URL
    If HTTP is 200, retrieves the HTML and scrapes the data into a list
    If HTTP is not 200, pass and count error (not a valid postal code/URL)
    If no response from URL because of error, pass that postal code and count error
    At end of script, print counter variables and timestamp

The problem is that I run the script and it works fine for ~500 postal codes, then suddenly stops working and returns repeated timeout errors. My suspicion is that the site's server is limiting the page views coming from my IP address, preventing me from completing the amount of scraping that I need to do (all 100,000 potential postal codes).

My question is as follows: Is there a way to confuse the site's server, for example using a proxy of some kind, so that it will not limit my page views and I can scrape all of the data I need?

Thanks for the help! Here is the code:

##POSTAL CODE POPULATION SCRAPER##

import requests

import re

import datetime

def zip_population_scrape():

    """
    This script will scrape population data for postal codes in range
    from city-data.com.
    """
    postal_code_data = [['zip','population']] #list for storing scraped data

    #Counters for keeping track:
    total_scraped = 0
    total_invalid = 0
    errors = 0


    for postal_code in range(1001,5000):

        #This if statement is necessary because the postal code can't start
        #with 0 in order for the for statement to interate successfully
        if postal_code <10000:
            postal_code_string = str(0)+str(postal_code)
        else:
            postal_code_string = str(postal_code)

        #all postal code URLs have the same format on this site
        url = 'http://www.city-data.com/zips/' + postal_code_string + '.html'

        #try to get current URL
        try:
            response = requests.get(url, timeout = 5)
            http = response.status_code

            #print current for logging purposes
            print url +" - HTTP:  " + str(http)

            #if valid webpage:
            if http == 200:

                #save html as text
                html = response.text

                #extra print statement for status updates
                print "HTML ready"

                #try to find two substrings in HTML text
                #add the substring in between them to list w/ postal code
                try:           

                    found = re.search('population in 2011:</b> (.*)<br>', html).group(1)

                    #add to # scraped counter
                    total_scraped +=1

                    postal_code_data.append([postal_code_string,found])

                    #print statement for logging
                    print postal_code_string + ": " + str(found) + ". Data scrape successful. " + str(total_scraped) + " total zips scraped."
                #if substrings not found, try searching for others
                #and doing the same as above   
                except AttributeError:
                    found = re.search('population in 2010:</b> (.*)<br>', html).group(1)

                    total_scraped +=1

                    postal_code_data.append([postal_code_string,found])
                    print postal_code_string + ": " + str(found) + ". Data scrape successful. " + str(total_scraped) + " total zips scraped."

            #if http =404, zip is not valid. Add to counter and print log        
            elif http == 404:
                total_invalid +=1

                print postal_code_string + ": Not a valid zip code. " + str(total_invalid) + " total invalid zips."

            #other http codes: add to error counter and print log
            else:
                errors +=1

                print postal_code_string + ": HTTP Code Error. " + str(errors) + " total errors."

        #if get url fails by connnection error, add to error count & pass
        except requests.exceptions.ConnectionError:
            errors +=1
            print postal_code_string + ": Connection Error. " + str(errors) + " total errors."
            pass

        #if get url fails by timeout error, add to error count & pass
        except requests.exceptions.Timeout:
            errors +=1
            print postal_code_string + ": Timeout Error. " + str(errors) + " total errors."
            pass


    #print final log/counter data, along with timestamp finished
    now= datetime.datetime.now()
    print now.strftime("%Y-%m-%d %H:%M")
    print str(total_scraped) + " total zips scraped."
    print str(total_invalid) + " total unavailable zips."
    print str(errors) + " total errors."



Source: http://stackoverflow.com/questions/25452798/how-can-i-circumvent-page-view-limits-when-scraping-web-data-using-python

Sunday 7 September 2014

Web data scraping (online news comments) with Scrapy (Python)

Since you seem like the try-first ask-question later type (that's a very good thing), I won't give you an answer, but a

(very detailed) guide on how to find the answer.

The thing is, unless you are a yahoo developer, you probably don't have access to the source code you're trying to

scrape. That is to say, you don't know exactly how the site is built and how your requests to it as a user are being

processed on the server-side. You can, however, investigate the client-side and try to emulate it. I like using Chrome

Developer Tools for this, but you can use others such as FF firebug.

So first off we need to figure out what's going on. So the way it works, is you click on the 'show comments' it loads

the first ten, then you need to keep clicking for the next ten comments each time. Notice, however, that all this

clicking isn't taking you to a different link, but lively fetches the comments, which is a very neat UI but for our

case requires a bit more work. I can tell two things right away:

    They're using javascript to load the comments (because I'm staying on the same page).
    They load them dynamically with AJAX calls each time you click (meaning instead of loading the comments with the

page and just showing them to you, with each click it does another request to the database).

Now let's right-click and inspect element on that button. It's actually just a simple span with text:

<span>View Comments (2077)</span>

By looking at that we still don't know how that's generated or what it does when clicked. Fine. Now, keeping the

devtools window open, let's click on it. This opened up the first ten. But in fact, a request was being made for us to

fetch them. A request that chrome devtools recorded. We look in the network tab of the devtools and see a lot of

confusing data. Wait, here's one that makes sense:

http://news.yahoo.com/_xhr/contentcomments/get_comments/?content_id=42f7f6e0-7bae-33d3-aa1d-

3dfc7fb5cdfc&_device=full&count=10&sortBy=highestRated&isNext=true&offset=20&pageNumber=2&_media.modules.content_commen

ts.switches._enable_view_others=1&_media.modules.content_comments.switches._enable_mutecommenter=1&enable_collapsed_com

ment=1

See? _xhr and then get_comments. That makes a lot of sense. Going to that link in the browser gave me a JSON object

(looks like a python dictionary) containing all the ten comments which that request fetched. Now that's the request you

need to emulate, because that's the one that gives you what you want. First let's translate this to some normal reqest

that a human can read:

go to this url: http://news.yahoo.com/_xhr/contentcomments/get_comments/
include these parameters: {'_device': 'full',
          '_media.modules.content_comments.switches._enable_mutecommenter': '1',
          '_media.modules.content_comments.switches._enable_view_others': '1',
          'content_id': '42f7f6e0-7bae-33d3-aa1d-3dfc7fb5cdfc',
          'count': '10',
          'enable_collapsed_comment': '1',
          'isNext': 'true',
          'offset': '20',
          'pageNumber': '2',
          'sortBy': 'highestRated'}

Now it's just a matter of trial-and-error. However, a few things to note here:

    Obviously the count is what decides how many comments you're getting. I tried changing it to 100 to see what

happens and got a bad request. And it was nice enough to tell me why - "Offset should be multiple of total rows". So

now we understand how to use offset

    The content_id is probably something that identifies the article you are reading. Meaning you need to fetch that

from the original page somehow. Try digging around a little, you'll find it.

    Also, you obviously don't want to fetch 10 comments at a time, so it's probably a good idea to find a way to fetch

the number of total comments somehow (either find out how the page gets it, or just fetch it from within the article

itself)

    Using the devtools you have access to all client-side scripts. So by digging you can find that that link to

/get_comments/ is kept within a javascript object named YUI. You can then try to understand how it is making the

request, and try to emulate that (though you can probably figure it out yourself)

    You might need to overcome some security measures. For example, you might need a session-key from the original

article before you can access the comments. This is used to prevent direct access to some parts of the sites. I won't

trouble you with the details, because it doesn't seem like a problem in this case, but you do need to be aware of it in

case it shows up.

    Finally, you'll have to parse the JSON object (python has excellent built-in tools for that) and then parse the

html comments you are getting (for which you might want to check out BeautifulSoup).

As you can see, this will require some work, but despite all I've written, it's not an extremely complicated task

either.

So don't panic.

It's just a matter of digging and digging until you find gold (also, having some basic WEB knowledge doesn't hurt).

Then, if you face a roadblock and really can't go any further, come back here to SO, and ask again. Someone will help

you.


Source: http://stackoverflow.com/questions/20218855/web-data-scraping-online-news-comments-with-scrapy-python