Sunday 30 November 2014

What you have to know before requesting web scraping services?

Before you request web scraping services you have to know what are your needs (what data you need, structure of it and where you can find this data).

Step 1: Define what data you need?
Data needs depending on purpose, if you want to find new customers you probably need contact data from players in your industry. Also if you want to study your competitors you need to define who are they. Only after that you can select data sources (websites feeds or other electronic sources) for this extraction.

In many cases for discovering and defining data sources are used search engines like Google, Bing, Yahoo, and others.

Step 2: Structure of data
Data structure it’s directly linked to usage purpose. In many cases data structure it’s a table where a row represents an entity and a cell of this row represents a property of this entity. In other cases Data structure is a a chart or another graphic representation builder with data extracted from a web source.

Step 3: Number of data extraction

In many cases is needed one time data extraction. In other cases when you need a regular report, are needed periodically extractions.

If you have defined all of above points you are ready to request a quote and an amount estimation from this contact form.

Source: http://thewebminer.com/blog/2013/08/

Thursday 27 November 2014

Webscraping using readLines and RCurl

There is a massive amount of data available on the web. Some of it is in the form of precompiled, downloadable datasets which are easy to access. But the majority of online data exists as web content such as blogs, news stories and cooking recipes. With precompiled files, accessing the data is fairly straightforward; just download the file, unzip if necessary, and import into R. For “wild” data however, getting the data into an analyzeable format is more difficult. Accessing online data of this sort is sometimes reffered to as “webscraping”. Two R facilities, readLines() from the base package and getURL() from the RCurl package make this task possible.

readLines

For basic webscraping tasks the readLines() function will usually suffice. readLines() allows simple access to webpage source data on non-secure servers. In its simplest form, readLines() takes a single argument – the URL of the web page to be read:

web_page <- readLines("http://www.interestingwebsite.com")

As an example of a (somewhat) practical use of webscraping, imagine a scenario in which we wanted to know the 10 most frequent posters to the R-help listserve for January 2009. Because the listserve is on a secure site (e.g. it has https:// rather than http:// in the URL) we can't easily access the live version with readLines(). So for this example, I've posted a local copy of the list archives on the this site.

One note, by itself readLines() can only acquire the data. You'll need to use grep(), gsub() or equivalents to parse the data and keep what you need.

# Get the page's source
web_page <- readLines("http://www.programmingr.com/jan09rlist.html")
# Pull out the appropriate line
author_lines <- web_page[grep("<I>", web_page)]
# Delete unwanted characters in the lines we pulled out
authors <- gsub("<I>", "", author_lines, fixed = TRUE)
# Present only the ten most frequent posters
author_counts <- sort(table(authors), decreasing = TRUE)
author_counts[1:10]
[webscrape results]


We can see that Gabor Grothendieck was the most frequent poster to R-help in January 2009.

The RCurl package

To get more advanced http features such as POST capabilities and https access, you'll need to use the RCurl package. To do webscraping tasks with the RCurl package use the getURL() function. After the data has been acquired via getURL(), it needs to be restructured and parsed. The htmlTreeParse() function from the XML package is tailored for just this task. Using getURL() we can access a secure site so we can use the live site as an example this time.

# Install the RCurl package if necessary
install.packages("RCurl", dependencies = TRUE)
library("RCurl")
# Install the XML package if necessary
install.packages("XML", dependencies = TRUE)
library("XML")
# Get first quarter archives
jan09 <- getURL("https://stat.ethz.ch/pipermail/r-help/2009-January/date.html", ssl.verifypeer = FALSE)
jan09_parsed <- htmlTreeParse(jan09)
# Continue on similar to above
...

For basic webscraping tasks readLines() will be enough and avoids over complicating the task. For more difficult procedures or for tasks requiring other http features getURL() or other functions from the RCurl package may be required. For more information on cURL visit the project page here.

Source: http://www.r-bloggers.com/webscraping-using-readlines-and-rcurl-2/

Wednesday 26 November 2014

Screen scrapers: To program or to purchase?

Companies today use screen scraping tools for a variety of purposes, including collecting competitive information, capturing product specs, moving data between legacy and new systems, and keeping inventory or price lists accurate.

Because of their popularity and reputation as being extremely efficient tools for quickly gathering applicable display data, screen scraping tools or browser add-ons are a dime a dozen: some free, some low cost, and some part of a larger solution. Alternatively, you can build your own if you are (or know) a programming whiz. Each tool has its potential pros and cons, however, to keep in mind as you determine which type of tool would best fit your business need.

Program-your-own screen scraper

Pros:

    Using in-house resources doesn't require additional budget

Cons:

    Properly creating scripts to automate screen scraping can take a significant amount of time initially, and continues to take time in order to maintain the process. If, for instance, objects from which you're gathering data move on a web page, the entire process will either need to be re-automated, or someone with programming acumen will have to edit the script every time there is a change.

    It's questionable whether or not this method actually saves time and resources

Free or cheap scrapers

Pros:

    Here again, budget doesn't ever enter the picture, and you can drive the process yourself.

    Some tools take care of at least some of the programming heavy lifting required to screen scrape effectively

Cons:

    Many inexpensive screen scrapers require that you get up to speed on their programming language—a time-consuming process that negates the idea of efficiency that prompted the purchase.

Screen scraping as part of a full automation solution

Pros:

    In the amount of time it takes to perform one data extraction task, you have a completely composed script that the system writes for you

    It's the easiest to use out of all of the options

    Screen scraping is only part of the package; you can leverage automation software to automate nearly any task or process including tasks in Windows, Excel automation, IT processes like uploads, backups, and integrations, and business processes like invoice processing.

    You're likely to get buy-in for other automation projects (and visibility for the efficiency you're introducing to the organization) if you pick a solution with a clear and scalable business purpose, not simply a tool to accomplish a single task.

Cons:

    This option has the highest price tag because of its comprehensive capabilities.

Looking for more information?

Here are some options to dig deeper into screen scraping, and deciding on the right tool for you:

 Watch a couple demos of what screen scraping looks like with an automation solution driving the process.

 Read our web data extraction guide for a complete overview.

 Try screen scraping today by downloading a free trial.

Source: https://www.automationanywhere.com/screen-scrapers

Sunday 23 November 2014

Data Mining Outsourcing in a Better and Unique Approach

Data mining outsourcing services are ideal for clarity in various decision making processes.  It is the ultimate goal of any organization and business to increase on its profits as well as strengthen the bond with its customers. Equipping the business in such a way that it’s very easy to detect frauds and manage risks in a convenient manner is equally important. Volumes of data that are irrelevant or cannot be used when raw needs to be converted to a more useful form.  The data mining outsourcing services can greatly help you to analyze and interpret data in a more diligent way.

This service to reliable, experienced and qualified hands is very important. Your research project or engineering project can be easily and conveniently handled by experienced staff who guarantees you an accuracy level of about 98% and a massive reduction in operating costs. The quality of work is unsurpassed and the presentation is done in a format that is easy and simple for you. The project is done in a very short time alleviating you delays as well as ensuring on-time completion of your projects. To enjoy a successful outsourcing experience, you need to bank on a famous and reliable expertise.

The only time to rely with data mining outsourcing services is when you do not have a reliable, experienced expertise in your business.  Statistics indicate that it’s very easy to lose business intelligence or expose the privacy of the customers through this process. However companies which offer secure outsourcing process are on the increase as a result of massive competition. It’s an opportunity to develop your potential of sourced data and improve your business in all fields. 

Data mining potential applications are infinite. However major applications are in the marketing research and scientific projects. It’s done both on large and small quantities of data by experienced staff well known for their best analytical procedures to guarantee you accurate and easy to use information. Data mining outsourcing services are the only perfect way to profitability.

Source:http://www.e-edge.biz/Data_Mining_Outsourcing_in_a_Better_and_Unique_Approach.html

Wednesday 19 November 2014

Is It Time to End Screen Scraping?

As the industry works to improve the way online banking information is shared with personal financial management apps, a debate is brewing over whether to end the decades-old practice of screen scraping.

Proponents of the popular method say it is a valuable supplement to direct data feeds that may be incomplete or out-of-date. But screen scraping also raises risk concerns, since like other data collection methods it requires consumers to cough up their banking credentials.

"I have not talked to a bank that hasn't confirmed it's a growing problem in their organization," said Jim Routh, the chairman of the products and services committee at Financial Services Information Sharing and Analysis Center.

Financial institutions worry that data aggregators may not take all the appropriate security precautions. According to the FS-ISAC, an industry organization, startups are entering the aggregation market without making security a higher priority.

Routh, who is Aetna's chief information security officer and a former global head of application and mobile security for JPMorgan Chase, said the upstarts do some things well, but "protecting credentials isn't necessarily high on their priorities." The problem is worsened by data aggregators that collect marketing data, such as the device a consumer is using, to understand their behaviors across channels, he said.

The FS-ISAC has proposed creating a standard application programming interface to share information from bank accounts. The API would serve as the conduit for data when consumers wish to use a web or mobile app to receive push bill reminders, to verify their bank accounts or for numerous other PFM use cases.

The proposed API would also be designed to reduce the storage of financial data. But if the industry embraces the model, it would be harder for aggregators to do screen-scraping.

For years, PFM companies have used this tool to obtain customers' banking account information. With consumers' permission, aggregators log in with the customer's user name and password to grab financial data and use it to populate the mobile or web app of the customer's choice — whether or not the bank supports the technique.

Yodlee, which works with more than 300 banks as well as startups, argues that there is a place and a need for aggregators to collect data through various techniques to provide the best customer experience.

Brian Costello, vice president of operations and security at Yodlee, said his company uses a combination of methods to gather customer account data. If it couldn't get data from a direct feed, it could also screen scrape.

If the industry moved to embracing only one data exchange method, Yodlee could be more vulnerable to the problem of receiving outdated information from the banks.

When a bank changes an annual percentage rate, if it doesn't update the data feed it sends to the aggregator right away, the PFM services that rely on that data will appear stale. (Services like Credit Karma, Mint and Wallaby, for example, rely on aggregation technology to recommend financial products to consumers according to price, among other things.)

Proper maintenance of data feeds, of course, takes time and money — resources many banks are short on. But delays could also result from the bankers' dilemma: On the one hand, they want to let customers aggregate their accounts to gather intelligence on their competitors. On the other hand, they may have reservations about their rivals collecting that same data in the battle for wallet share.

"Banks are under tremendous pressure to retain and obtain more clients," said Costello.

Screen scraping also has maintenance requirements, though. The FS-ISAC white paper draft said the approach "requires some coordination from the FI to allow what appears to be an automated attack against their application. To avoid blocking the aggregator's attempt to screen scrape the financial institution's application with this or other current security controls, a whitelist of aggregator IPs are set up and maintained by the FIs."

Like Costello, Marc West, president of digital channels at Fiserv, said a combination of data collection methods is better than a standard data exchange approach that might fail to extract the necessary information. Any data feed, said West, offers a limited set of data and information, while a scrape can enable a custom data extract.

But Aetna's Routh said moving to a real-time API model would improve a recurring issue caused by screen scraping: customer service hiccups. A consumer may call the company behind the personal financial app when a link to an account is broken. The PFM provider might tell him to call the bank, when the problem could lie with the aggregator not knowing of an update to the bank's code.

"The consumer gets in the middle of a customer service issue that is thorny at best and unsolvable at worst," Routh said. "Unfortunately that happens more frequently than anyone would like to it happen.

The new model, then, is "inevitable" in Routh's point of view because of the risk and economics involved. "This won't happen overnight," he said. "It needs some legs."

Kristin Moyer, a research vice president in industry advisory services and banking and investment services at Gartner, said she expects more banks to embrace APIs as a way to compete in a digital world.

Already financial institutions like Capital One, Agricole Bank and Fidor Bank are piloting and testing the OAuth specification, which lets banks keep ownership of the customer log-in data but requires them to make available an API. (The FS-ISAC is also promoting OAuth 2.0 as a way to strengthen aggregation security.)

"It's something we will see a lot more of in the next two to three years," said Moyer. "It's an exciting time…I think the use of APIs will enable us as an industry [to do things] that we never really imagined possible before."

LESSONS ABROAD

The move away from screen scraping has already happened in some countries that lack a data exchange standard. Regulators in Poland, for example, recently recommended the practice halt. Responding to the guidance, mBank is one of the banks that changed its aggregation roadmap.

The bank, which spun off from BRE Bank, had been piloting a PFM service with friends and family and has now suspended the pilot. It had, however, already made use of aggregation technology so consumers, who weren't customers of the bank, could get loan decisions from mBank within half an episode of "Modern Family." Indeed, the bank would screen scrape consumers' external bank accounts to make a loan decision within five to 15 minutes. Now, loan decisions have to be made at a branch or for a smaller dollar amount after a consumer sends the bank a copy of an electronic statement.

"Right now we have to put it on the shelf. We haven't killed it. We want to resurrect it," said Michal Panowicz, senior director at mBank.

Overall, he sounds calm about the setback. "This is a regulator decision," said Panowicz. "We have to respect that. …We have to live with them on good footing."

But that doesn't mean it has given up on aggregation. Payday lenders can continue to screen scrape financial data in order to make loan decisions in Poland — which makes it an uneven playing field.

"We will try to convey the logic that [screen scraping] cannot be stopped," said Panowicz.

He views it as a longer term game for something he believes is valuable to consumers. mBank like other banks wants to realize the true aggregation dream: letting customers quickly switch bank accounts and products if they wish.

"To be honest, it's the most exciting part about aggregation... to move accounts to us without spending a minute of physical labor," he said.

Source:http://www.americanbanker.com/news/technology/is-it-time-to-end-screen-scraping-1071118-1.html

Monday 17 November 2014

Data Scraping Guide for SEO & Analytics

Data scraping can help you a lot in competitive analysis as well as pulling out data from your client’s website like extracting the titles, keywords and content categories.

You can quickly get an idea of which keywords are driving traffic to a website, which content categories are attracting links and user engagement, what kind of resources will it take to rank your site…………and the list goes on…

 Scraping Organic Search Results

By scraping organic search results you can quickly find out your SEO competitors for a particular search term. You can determine the title tags and the keywords they are targeting.

    The easiest way to scrape organic search results is by using the SERPs Redux bookmarklet.

For e.g if you scrape organic listings for the search term ‘seo tools’ using this bookmarklet, you may see the following results:

You can copy paste the websites URLs and title tags easily into your spreadsheet from the text boxes.

    Pro Tip by Tahir Fayyaz:

    Just wanted to add a tip for people using the SERPs Redux bookmarklet.

    If you have a data separated over multiple pages that you want to scrape you can use AutoPager for Firefox or Chrome to loads x amount of pages all on one page and then scrape it all using the bookmarklet.

Scraping on page elements from a web document

Through this Excel Plugin by Niels Bosma you can fetch several on-page elements from a URL or list of URLs like:

    Title tag
    Meta description tag
    Meta keywords tag
    Meta robots tag
    H1 tag
    H2 tag
    HTTP Header
    Backlinks
    Facebook likes etc.

Scraping data through Google Docs

Google docs provide a function known as importXML through which you can import data from web documents directly into Google Docs spreadsheet. However to use this function you must be familiar with X-path expressions.

    Syntax: =importXML(URL,X-path-query)

    url=> URL of the web page from which you want to import the data.

    x-path-query => A query language used to extract data from web pages.

You need to understand following things about X-path in order to use importXML function:

1. Xpath terminology- What are nodes and kind of nodes like element nodes, attribute nodes etc.

2. Relationship between nodes- How different nodes are related to each other. Like parent node, child node, siblings etc.

3. Selecting nodes- The node is selected by following a path known as the path expression.

4. Predicates – They are used to find a specific node or a node that contains a specific value. They are always embedded in square brackets.

If you follow the x-path tutorial then it should not take you more than an hour to understand how X path expressions works.

Understanding path expressions is easy but building them is not. That’s is why i use a firefbug extension named ‘X-Pather‘ to quickly generate path expressions while browsing HTML and XML documents.

Since X-Pather is a firebug extension, it means you first need to install firebug in order to use it.

 How to scrape data using importXML()

Step-1: Install firebug – Through this add on you can edit & monitor CSS, HTML, and JavaScript while you browse.

Step-2: Install X-pather – Through this tool you can generate path expressions while browsing a web document. You can also evaluate path expressions.

Step-3: Go to the web page whose data you want to scrape. Select the type of element you want to scrape. For e.g. if you want to scrape anchor text, then select one anchor text.

Step-4: Right click on the selected text and then select ‘show in Xpather’ from the drop down menu.

Then you will see the Xpather browser from where you can copy the X-path.

Here i have selected the text ‘Google Analytics’, that is why the xpath browser is showing ‘Google Analytics’ in the content section. This is my xpath:

    /html/body/div[@id='page']/div[@id='page-ext']/div[@id='main']/div[@id='main-ext']/div[@id='mask-3']/div[@id='mask-2']/div[@id='mask-1']/div[@id='primary-content']/div/div/div[@id='post-58']/div/ol[2]/li[1]/a

Pretty scary huh. It can be even more scary if you try to build it manually. I want to scrape the name of all the analytic tools from this page: killer seo tools. For this i need to modify the aforesaid path expression into a formula.

This is possible only if i can determine static and variable nodes between two or more path expressions. So i determined the path expression of another element ‘Google Analytics Help center’ (second in the list) through X-pather:

    /html/body/div[@id='page']/div[@id='page-ext']/div[@id='main']/div[@id='main-ext']/div[@id='mask-3']/div[@id='mask-2']/div[@id='mask-1']/div[@id='primary-content']/div/div/div[@id='post-58']/div/ol[2]/li[2]/a

Now we can see that the node which has changed between the original and new path expression is the final ‘li’ element: li[1] to li[2]. So i can come up with following final path expression:

    /html/body/div[@id='page']/div[@id='page-ext']/div[@id='main']/div[@id='main-ext']/div[@id='mask-3']/div[@id='mask-2']/div[@id='mask-1']/div[@id='primary-content']/div/div/div[@id='post-58']/div/ol[2]//li/a

Now all i have to do is copy-paste this final path expression as an argument to the importXML function in Google Docs spreadsheet. Then the function will extract all the names of Google Analytics tool from my killer SEO tools page.

This is how you can scrape data using importXML.

    Pro Tip by Niels Bosma: “Anything you can do with importXML in Google docs you can do with XPathOnUrl directly in Excel.”

    To use XPathOnUrl function you first need to install the Niels Bosma’s Excel plugin. It is not a built in function in excel.

Note:You can also use a free tool named Scrapy for data scraping. It is an an open source web scraping framework and is used to extract structured data from web pages & APIs. You need to know Python (a programming language) in order to use scrapy.

Scraping on-page elements of an entire website

There are two awesome tools which can help you in scraping on-page elements (title tags, meta descriptions, meta keywords etc) of an entire website. One is the evergreen and free Xenu Link Sleuth and the other is the mighty Screaming Frog SEO Spider.

What make these tools amazing is that you can scrape the data of entire website and download it into excel. So if you want to know the keywords used in the title tag on all the web pages of your competitor’s website then you know what you need to do.

Note: Save the Xenu data as a tab separated text file and then open the file in Excel.

 Scraping organic and paid keywords of an entire website

The tool that i use for scraping keywords is SEMRush. Through this awesome tool i can determine which organic and paid keyword are driving traffic to my competitor’s website and then can download the whole list into excel for keyword research. You can get more details about this tool through this post: Scaling Keyword Research & Competitive Analysis to new heights

Scraping keywords from a webpage

Through this excel macro spreadsheet from seogadget you can fetch keywords from the text of a URL(s). However you need an Alchemy API key to use this macro.

You can get the Alchemy API key from here

Scraping keywords data from Google Adwords API

If you have access to Google Adwords API then you can install this plugin from seogadget website. This plugin creates a series of functions designed to fetch keywords data from the Google Adwords API like:

getAdWordAvg()- returns average search volume from the adwords API.

getAdWordStats() – returns local search volume and previous 12 months separated by commas

getAdWordIdeas() – returns keyword suggestions based on API suggest service.

Check out this video to know how this plug-in works

Scraping Google Adwords Ad copies of any website

I use the tool SEMRush to scrape and download the Google Adwords ad copies of my competitors into excel and then mine keywords or just get ad copy ideas.  Go to semrush, type the competitor website URL and then click on ‘Adwords Ad texts’ link on the left hand side menu. Once you see the report you can download it into excel.

Scraping back links of an entire website

The tool that you can use to scrape and download the back links of an entire website is: open site explorer

Scraping Outbound links from web pages

Garrett French of citation Labs has shared an excellent tool: OBL Scraper+Contact Finder which can scrape outbound links and contact details from a URL or URL list. This tool can help you a lot in link building. Check out this video to know more about this awesome tool:

Scraper – Google chrome extension

This chrome extension can scrape data from web pages and export it to Google docs. This tool is simple to use. Select the web page element/node you want to scrape. Then right click on the selected element and select ‘scrape similar’.

Any element/node that’s similar to what you have selected will be scraped by the tool which you can later export to Google Docs. One big advantage of this tool is that it reduces our dependency on building Xpath expressions and make scraping easier.

See how easy it is to scrape name and URLs of all the Analytics tools without using Xpath expressions.

Source: http://www.optimizesmart.com/data-scraping-guide-for-seo/

Saturday 15 November 2014

Screenscraping from Java using jsoup – effective data gathering from websites

In a recent article I discussed screenscraping in a in hindsight fairly clumsy way (http://technology.amis.nl/blog/12786/building-java-object-graph-with-tour-de-france-results-using-screen-scraping-java-util-parser-and-assorted-facilities). While preparing for a series of articles on data visualizations, I had need of statistics regarding the Olympic Games – more specifically: the overall medal count per country during the 2008 Bejing Olympic Games. This information is readily available from dozens of websites. However, I could not find one hat offered the data in easy to process XML or CSV format – all websites had human consumers in mind.

Using screenscraping – we use a programmatic facility to consume the content that is intended to be displayed on screen to human users and subsequently process that content by extracting the required data from it. Some web-pages are easier to scrape than others – this depends on the richness of the HTML (the poorer the better for scraping), the required interactivity (JavaScript, AJAX – the less the better) and the structure used to present the data (tables, frequently despised by web developers, work rather well).

I came across a tool for screenscraping from Java, called jsoup – http://jsoup.org/. It turned out to be so incredibly easy to use – that I thouht I should share it.

Getting going with jsoup is as easy as can be:

1. download jsoup-1.6.1.jar (or whatever the latest version is) from http://jsoup.org/download

2. add this jar as a dependency in your project and/or application CLASSPATH

3. make use of jsoup in the code that does the screenscraping.

A simple example of code that uses jsoup (more examples on: http://jsoup.org/cookbook/):

One of the websites offering the overall medal count is http://www.databaseolympics.com/games/gamesyear.htm?g=26. The page looks as follows:

Image

Well, more importantly, the page looks like this:

Image

This means in terms of screenscraping: I will find the medal count for each country inside a TABLE element with styleclass pt8. Each country has a TR element. Only the first TR element does not represent a country score, as it is the table header. The first TD element in the TR represents the country. The name of the country can be retrieved as the text content from the A element in the TD. The next TD elements contain the numbers of medals in Gold, Silver, Bronze and Total.

The corresponding Java code with jsoup boils down to:

public static void main(String[] args) throws IOException, SQLException, InterruptedException {

        Document doc = Jsoup.connect(OlympicMedalMirrorProcessor.baseUrl + "?g=26").get();
        String title = doc.title();
        System.out.println(title);
        Element table = doc.select("table.pt8").get(0);
        Elements trs = table.select("tr");
        Iterator trIter = trs.iterator();
        boolean firstRow = true;
        while (trIter.hasNext()) {


            Element tr = (Element)trIter.next();
            if (firstRow) {
                firstRow = false;
                continue;
            }
            Elements tds = tr.select("td");
            Iterator tdIter = tds.iterator();
            int tdCount = 1;
            String country = null;
            Integer gold = null;
            Integer silver = null;
            Integer bronze = null;
            Integer total = null;
            // process new line
            while (tdIter.hasNext()) {

                Element td = (Element)tdIter.next();
                switch (tdCount++) {
                case 1:
                    country = td.select("a").text();
                    break;
                case 2:
                    gold = Integer.parseInt(td.text());
                    break;
                case 3:
                    silver = Integer.parseInt(td.text());
                    break;
                case 4:
                    bronze = Integer.parseInt(td.text());
                    break;
                case 5:
                    total = Integer.parseInt(td.text());
                    break;
                }

            }
            System.out.println(country + ": gold " + gold + " silver " + silver + " bronze " + bronze + " total " +
                               total);
        } //table rows

Source:http://technology.amis.nl/2011/08/03/screenscraping-from-java-using-jsoup-effective-data-gathering-from-websites/

Thursday 13 November 2014

Scraping Data: Site-specific Extractors vs. Generic Extractors

Scraping is becoming a rather mundane job with every other organization getting its feet wet with it for their own data gathering needs. There have been enough number of crawlers built – some open-sourced and others internal to organizations for in-house utilities. Although crawling might seem like a simple technique at the onset, doing this at a large-scale is the real deal. You need to have a distributed stack set up to take care of handling huge volumes of data, to provide data in a low-latency model and also to deal with fail-overs. This still is achievable after crossing the initial tech barrier and via continuous optimizations. (P.S. Not under-estimating this part because it still needs a team of Engineers monitoring the stats and scratching their heads at times).

Social Media Scraping

Focused crawls on a predefined list of sites

However, you bump into a completely new land if your goal is to generate clean and usable data sets from these crawls i.e. “extract” data in a format that your DB can process and aid in generating insights. There are 2 ways of tackling this:

a. site-specific extractors which give desired results

b. generic extractors that result in few surprises

Assuming you still do focused crawls on a predefined list of sites, let’s go over specific scenarios when you have to pick between the two-

1. Mass-scale crawls; high-level meta data - Use generic extractors when you have a large-scale crawling requirement on a continuous basis. Large-scale would mean having to crawl sites in the range of hundreds of thousands. Since the web is a jungle and no two sites share the same template, it would be impossible to write an extractor for each. However, you have to settle in with just the document-level information from such crawls like the URL, meta keywords, blog or news titles, author, date and article content which is still enough information to be happy with if your requirement is analyzing sentiment of the data.

cb1c0_one-size

A generic extractor case

Generic extractors don’t yield accurate results and often mess up the datasets deeming it unusable. Reason being

programatically distinguishing relevant data from irrelevant datasets is a challenge. For example, how would the extractor know to skip pages that have a list of blogs and only extract the ones with the complete article. Or delineating article content from the title on a blog page is not easy either.

To summarize, below is what to expect of a generic extractor.

Pros-

minimal manual intervention

low on effort and time

can work on any scale

Cons-

Data quality compromised

inaccurate and incomplete datasets

lesser details suited only for high-level analyses

Suited for gathering- blogs, forums, news

Uses- Sentiment Analysis, Brand Monitoring, Competitor Analysis, Social Media Monitoring.

2. Low/Mid scale crawls; detailed datasets - If precise extraction is the mandate, there’s no going away from site-specific extractors. But realistically this is do-able only if your scope of work is limited i.e. few hundred sites or less. Using site-specific extractors, you could extract as many number of fields from any nook or corner of the web pages. Most of the times, most pages on a website share similar templates. If not, they can still be accommodated for using site-specific extractors.

cutlery

Designing extractor for each website

Pros-

High data quality

Better data coverage on the site

Cons-

High on effort and time

Site structures keep changing from time to time and maintaining these requires a lot of monitoring and manual intervention

Only for limited scale

Suited for gathering - any data from any domain on any site be it product specifications and price details, reviews, blogs, forums, directories, ticket inventories, etc.

Uses- Data Analytics for E-commerce, Business Intelligence, Market Research, Sentiment Analysis

Conclusion

Quite obviously you need both such extractors handy to take care of various use cases. The only way generic extractors can work for detailed datasets is if everyone employs standard data formats on the web (Read our post on standard data formats here). However, given the internet penetration to the masses and the variety of things folks like to do on the web, this is being overly futuristic.

So while site-specific extractors are going to be around for quite some time, the challenge now is to tweak the generic ones to work better. At PromptCloud, we have added ML components to make them smarter and they have been working well for us so far.

What have your challenges been? Do drop in your comments.

Source: https://www.promptcloud.com/blog/scraping-data-site-specific-extractors-vs-generic-extractors/

Monday 10 November 2014

My Experience in Choosing a Web Scraping Service

Recently I decided to outsource a web scraping project to another company. I typed “web scraping service” in Google, chose six services from the first two search result pages and sent the project specifications to all of them to get quotes. Eventually I decided to go another way and did not order the services, but my experience may be useful for others who want to entrust web scraping jobs to third party services.

If you are interested in price comparisons only and not ready to read the whole story just scroll down.

A list of web scraping services I sent my project to:

    www.datahen.com - Canadian web scraping service with nice web design
    webdata-scraping.com - Indian service by Keval Kothari
    www.iwebscraping.com - India based web scraping company (same as www.3idatascraping.com)
    scrapinghub.com - A scraping service founded by creators of Scrapy
    web-scraper.com - Yet another web scraping service
    grepsr.com - A scraping service that we already reviewed two years ago


Sending the request

All the services except scrapinghub.com have quite simple forms for the description of the project requirements. Basically, you just need to give your contact details and a project description in any form. Some of them are pretty (like datahen.com), some of them are more ascetic (like web-scraper.com), but all of them allow you to send your requirements to developers.

Scrapinghub.com has a quite long form, but most of the fields are optional and all the questions are quite natural. If you really know what you need, then it won’t be hard to answer all of them; moreover they rather help you to describe your need in detail.

Note, that in the context of the project I didn’t make a request for a scraper itself. I asked to receive data on a weekly basis only.

Getting responses

Since I sent my request on Sunday it would have been ok not to receive responses the same day, but I got the first response in 3 hrs! It was from web-scraper.com and stated that this project will cost me $250 monthly. Simple and clear. Thank you, Thang!

Right after that, I received the second response. This time it was Keval from webdata-scraping.com. He had some questions regarding the project. Then after two days he wrote me that it would be hard to scrape some of my data with the software he uses, and that he will try to use a custom scraper. After that he disappeared… ((

Then on Monday I received Cost & ETAT details from datahen.com. It looked quite professional and contained not only price, but also time estimation. They were ready to create such a scraper in 3-4 days for $249 and then maintain it for just $65/month.

On the same day I received a quote from iwebscraping.com. It was $60 per week. Everything is fine, but I’d like to mention that it wasn’t the last letter from them. After I replied to them (right after receiving the quote), I received a reminder letter from them every other day for about a week. So be ready for aggressive marketing if you ask them for a quote )).

Finally in two days after requesting a quote I got a response from scrapinghub.com. Paul Tremberth wrote me that they were ready to build a scraper for $1200 and then maintain it for $300/month.

It is interesting that I have never received an answer from grepsr.com! Two years ago it was the first web scraping service we faced on the web, but now they simply ignored my request! Or perhaps they didn’t receive it somehow? Anyway I had no time for investigation.

So what?

Let us put everything together. Out of six web scraping  services I received four quotes with the following prices:

Service     Setup fee     Monthly fee

web-scraper.com     -     $250
datahen.com     $249     $65
iwebscraping.com     -     $240
scrapinghub.com     $1200     $300


From this table you can see that  scrapinghub.com appears to be the most expensive service among those compared.

EDIT: These $300/month gives you as much support and development needed to fix a 5M multi-site web crawler, for example. If you need a cheaper solution you can use their Autoscraping tool, which is free, and would have costed around $2/month to crawl at my requested rates.

The average cost of monthly scraping is about $250, but from a long term perspective datahen.com may save you money due to their low monthly fee.

That’s it! If I had enough money available it would be interesting to compare all these services in operation and provide you a more complete report, but this is all I have for now.

If you have anything to share about your experience in using similar services, please contribute to this post by commenting on it below. Cheers!

Source: http://scraping.pro/choosing-web-scraping-service/

Saturday 8 November 2014

Why People Hesitate To Try Data Mining

What is hindering a number of people from venturing into the promising world of data mining? Despite so much encouragement, promotions, testimonials, and evidences of the benefits of online data collection, still only a handful take the challenge and really gain the pay offs it has to offer.

It may sound unthinkable that such an opportunity for success has been neglected by many. It may also sound absurd why many well-meaning individuals are hindered from enjoying the benefits of the blessings of the 21st century.

The Causes

After considerable observation and analysis of the human psyche, one can understand the underlying reasons behind the hesitance to try the profitable data mining service. The most common reasons why people are afraid to try new technology or why they remain passive and uninvolved are: fear; lack of knowledge; and pride.

Fear. The most paralyzing of human emotions is fear. It can, to some extent, cause a person to be insane, unprofitable, sick, and lost. Although fear is a normal reaction to certain stimuli and a natural feeling experienced by humans, it must always be monitored and controlled.  Usually, people share common fears, such as: fear of change; fear of anything new; and fear of the unknown.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/people-hesitate-try-data-mining/

Wednesday 5 November 2014

Why Web Scraping is Indispensable

The 21st century has opened the gates to hidden treasures and unlimited access to information globally without the constraints of time and space, through Internet technology. Along with this development comes the necessity for each business or company to get as much information as possible in order in order to thrive in the ever increasing demand for new innovations, comparisons, and trends.

Web scraping has consequently become an indispensable option to achieve all the needed data as quickly and efficiently as possible. In this view, data mining then appears to be the best and the only way to answer the present demand for updates, data, coping, foreknowledge, analysis, and evaluation. Indeed, information has inevitably become a valuable commodity and the most sought after product among online and offline entrepreneurs.

Need for Data

The increasing need for new data makes it possible for the experts to become increasingly creative in accessing information worldwide. The more knowledge one has, the better are his or her chances of growing and surviving. There seems to be no other time in the human existence where data has become so much a major source of revenue as the contemporary times.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/web-scraping-indispensable/