Wednesday 28 December 2016

Data Mining - Retrieving Information From Data

Data Mining - Retrieving Information From Data

Data mining definition is the process of retrieving information from data. It has become very important now days because data that is processed is usually kept for future reference and mainly for security purposes in a company. Data transforms is processed into information and it is mostly used in different ways depending on what information one is extracting and from where the person is extracting the information.

It is commonly used in marketing, scientific information and research work, fraud detection and surveillance and many more and most of this work is done using a computer. This definition can come in different terms data snooping, data fishing and data dredging all this refer to data mining but it depends in which department one is. One must know data mining definition so that he can be in a position to make data.

The method of data mining has been there for so many centuries and it is used up to date. There were early methods which were used to identify data mining there are mainly two: regression analysis and bayes theorem. These methods are never used now days because a lot of people have advanced and technology has really changed the entire system.

With the coming up or with the introduction of computers and technology, it becomes very fast and easy to save information. Computers have made work easier and one can be able to expand more knowledge about data crawling and learn on how data is stored and processed through computer science.

Computer science is a course that sharpens one skill and expands more about data crawling and the definition of what data mining means. By studying computer science one can be in a position to know: clustering, support vector machines and decision trees there are some of the units that are found on computer science.

It's all about all this and this knowledge must be applied here. Government institutions, small scale business and supermarkets use data.

The main reason most companies use data mining is because data assist in the collection of information and observations that a company goes through in their daily activity. Such information is very vital in any companies profile and needs to be checked and updated for future reference just in case something happens.

Businesses which use data crawling focus mainly on return of investments, and they are able to know whether they are making a profit or a loss within a very short period. If the company or the business is making a profit they can be in a position to give customers an offer on the product in which they are selling so that the business can be a position to make more profit in an organization, this is very vital in human resource departments it helps in identifying the character traits of a person in terms of job performance.

Most people who use this method believe that is ethically neutral. The way it is being used nowadays raises a lot of questions about security and privacy of its members. Data mining needs good data preparation which can be in a position to uncover different types of information especially those that require privacy.

A very common way in this occurs is through data aggregation.

Data aggregation is when information is retrieved from different sources and is usually put together so that one can be in a position to be analyze one by one and this helps information to be very secure. So if one is collecting data it is vital for one to know the following:

    How will one use the data that he is collecting?
    Who will mine the data and use the data.
    Is the data very secure when am out can someone come and access it.
    How can one update the data when information is needed
    If the computer crashes do I have any backup somewhere.

It is important for one to be very careful with documents which deal with company's personal information so that information cannot easily be manipulated.

source : http://ezinearticles.com/?Data-Mining---Retrieving-Information-From-Data&id=5054887

Friday 16 December 2016

One of the Main Differences Between Statistical Analysis and Data Mining

One of the Main Differences Between Statistical Analysis and Data Mining

Two methods of analyzing data that are common in both academic and commercial fields are statistical analysis and data mining. While statistical analysis has a long scientific history, data mining is a more recent method of data analysis that has arisen from Computer Science. In this article I want to give an introduction to these methods and outline what I believe is one of the main differences between the two fields of analysis.

Statistical analysis commonly involves an analyst formulating a hypothesis and then testing the validity of this hypothesis by running statistical tests on data that may have been collected for the purpose. For example, if an analyst was studying the relationship between income level and the ability to get a loan, the analyst may hypothesis that there will be a correlation between income level and the amount of credit someone may qualify for.

The analyst could then test this hypothesis with the use of a data set that contains a number of people along with their income levels and the credit available to them. A test could be run that indicates for example that there may be a high degree of confidence that there is indeed a correlation between income and available credit. The main point here is that the analyst has formulated a hypothesis and then used a statistical test along with a data set to provide evidence in support or against that hypothesis.

Data mining is another area of data analysis that has arisen more recently from computer science that has a number of differences to traditional statistical analysis. Firstly, many data mining techniques are designed to be applied to very large data sets, while statistical analysis techniques are often designed to form evidence in support or against a hypothesis from a more limited set of data.

Probably the mist significant difference here, however, is that data mining techniques are not used so much to form confidence in a hypothesis, but rather extract unknown relationships may be present in the data set. This is probably best illustrated with an example. Rather than in the above case where a statistician may form a hypothesis between income levels and an applicants ability to get a loan, in data mining, there is not typically an initial hypothesis. A data mining analyst may have a large data set on loans that have been given to people along with demographic information of these people such as their income level, their age, any existing debts they have and if they have ever defaulted on a loan before.

A data mining technique may then search through this large data set and extract a previously unknown relationship between income levels, peoples existing debt and their ability to get a loan.

While there are quite a few differences between statistical analysis and data mining, I believe this difference is at the heart of the issue. A lot of statistical analysis is about analyzing data to either form confidence for or against a stated hypothesis while data mining is often more about applying an algorithm to a data set to extract previously unforeseen relationships.

Source:http://ezinearticles.com/?One-of-the-Main-Differences-Between-Statistical-Analysis-and-Data-Mining&id=4578250

Wednesday 7 December 2016

Data Mining vs Screen-Scraping

Data Mining vs Screen-Scraping

Data mining isn't screen-scraping. I know that some people in the room may disagree with that statement, but they're actually two almost completely different concepts.

In a nutshell, you might state it this way: screen-scraping allows you to get information, where data mining allows you to analyze information. That's a pretty big simplification, so I'll elaborate a bit.

The term "screen-scraping" comes from the old mainframe terminal days where people worked on computers with green and black screens containing only text. Screen-scraping was used to extract characters from the screens so that they could be analyzed. Fast-forwarding to the web world of today, screen-scraping now most commonly refers to extracting information from web sites. That is, computer programs can "crawl" or "spider" through web sites, pulling out data. People often do this to build things like comparison shopping engines, archive web pages, or simply download text to a spreadsheet so that it can be filtered and analyzed.

Data mining, on the other hand, is defined by Wikipedia as the "practice of automatically searching large stores of data for patterns." In other words, you already have the data, and you're now analyzing it to learn useful things about it. Data mining often involves lots of complex algorithms based on statistical methods. It has nothing to do with how you got the data in the first place. In data mining you only care about analyzing what's already there.

The difficulty is that people who don't know the term "screen-scraping" will try Googling for anything that resembles it. We include a number of these terms on our web site to help such folks; for example, we created pages entitled Text Data Mining, Automated Data Collection, Web Site Data Extraction, and even Web Site Ripper (I suppose "scraping" is sort of like "ripping"). So it presents a bit of a problem-we don't necessarily want to perpetuate a misconception (i.e., screen-scraping = data mining), but we also have to use terminology that people will actually use.

Source: http://ezinearticles.com/?Data-Mining-vs-Screen-Scraping&id=146813

Saturday 3 December 2016

An Easy Way For Data Extraction

An Easy Way For Data Extraction

There are so many data scraping tools are available in internet. With these tools you can you download large amount of data without any stress. From the past decade, the internet revolution has made the entire world as an information center. You can obtain any type of information from the internet. However, if you want any particular information on one task, you need search more websites. If you are interested in download all the information from the websites, you need to copy the information and pate in your documents. It seems a little bit hectic work for everyone. With these scraping tools, you can save your time, money and it reduces manual work.

The Web data extraction tool will extract the data from the HTML pages of the different websites and compares the data. Every day, there are so many websites are hosting in internet. It is not possible to see all the websites in a single day. With these data mining tool, you are able to view all the web pages in internet. If you are using a wide range of applications, these scraping tools are very much useful to you.

The data extraction software tool is used to compare the structured data in internet. There are so many search engines in internet will help you to find a website on a particular issue. The data in different sites is appears in different styles. This scraping expert will help you to compare the date in different site and structures the data for records.

And the web crawler software tool is used to index the web pages in the internet; it will move the data from internet to your hard disk. With this work, you can browse the internet much faster when connected. And the important use of this tool is if you are trying to download the data from internet in off peak hours. It will take a lot of time to download. However, with this tool you can download any data from internet at fast rate.There is another tool for business person is called email extractor. With this toll, you can easily target the customers email addresses. You can send advertisement for your product to the targeted customers at any time. This the best tool to find the database of the customers.

However, there are some more scraping tolls are available in internet. And also some of esteemed websites are providing the information about these tools. You download these tools by paying a nominal amount.

Source: http://ezinearticles.com/?An-Easy-Way-For-Data-Extraction&id=3517104

Friday 18 November 2016

How Xpath Plays Vital Role In Web Scraping Part 2

How Xpath Plays Vital Role In Web Scraping Part 2

Here is a piece of content on  Xpaths which is the follow up of How Xpath Plays Vital Role In Web Scraping

Let’s dive into a real-world example of scraping amazon website for getting information about deals of the day. Deals of the day in amazon can be found at this URL. So navigate to the amazon (deals of the day) in Firefox and find the XPath selectors. Right click on the deal you like and select “Inspect Element with Firebug”:

If you observe the image below keenly, there you can find the source of the image(deal) and the name of the deal in src, alt attribute’s respectively.

So now let’s write a generic XPath which gathers the name and image source of the product(deal).

  //img[@role=”img”]/@src  ## for image source
  //img[@role=”img”]/@alt   ## for product name

In this post, I’ll show you some tips we found valuable when using XPath in the trenches.

If you have an interest in Python and web scraping, you may have already played with the nice requests library to get the content of pages from the Web. Maybe you have toyed around using Scrapy selector or lxml to make the content extraction easier. Well, now I’m going to show you some tips I found valuable when using XPath in the trenches and we are going to use both lxml and Scrapy selector for HTML parsing.

Avoid using expressions which contains(.//text(), ‘search text’) in your XPath conditions. Use contains(., ‘search text’) instead.

Here is why: the expression .//text() yields a collection of text elements — a node-set(collection of nodes).and when a node-set is converted to a string, which happens when it is passed as argument to a string function like contains() or starts-with(), results in the text for the first element only.

from scrapy import Selector
html_code = “””<a href=”#”>Click here to go to the <strong>Next Page</strong></a>”””
sel = Selector(text=html_code)
xp = lambda x: sel.xpath(x).extract()           # Let’s type this only once
print xp(‘//a//text()’)                                       # Take a peek at the node-set
[u’Click here to go to the ‘, u’Next Page’]   # output of above command
print xp(‘string(//a//text())’)                           # convert it to a string
  [u’Click here to go to the ‘]                           # output of the above command

Let’s do the above one by using lxml then you can implement XPath by both lxml or Scrapy selector as XPath expression is same for both methods.

lxml code:

from lxml import html
html_code = “””<a href=”#”>Click here to go to the <strong>Next Page</strong></a>””” # Parse the text into a tree
parsed_body = html.fromstring(html_code)  # Perform xpaths on the tree
print parsed_body(‘//a//text()’)                      # take a peek at the node-set
[u’Click here to go to the ‘, u’Next Page’]   # output
print parsed_body(‘string(//a//text())’)              # convert it to a string
[u’Click here to go to the ‘]                    # output

A node converted to a string, however, puts together the text of itself plus of all its descendants:

>>> xp(‘//a[1]’)  # selects the first a node
[u'<a href=”#”>Click here to go to the <strong>Next Page</strong></a>’]

>>> xp(‘string(//a[1])’)  # converts it to string
[u’Click here to go to the Next Page’]

Beware of the difference between //node[1] and (//node)[1]//node[1] selects all the nodes occurring first under their respective parents and (//node)[1] selects all the nodes in the document, and then gets only the first of them.

from scrapy import Selector

html_code = “””<ul class=”list”>
<li>1</li>
<li>2</li>
<li>3</li>
</ul>

<ul class=”list”>
<li>4</li>
<li>5</li>
<li>6</li>
</ul>”””

sel = Selector(text=html_code)
xp = lambda x: sel.xpath(x).extract()

xp(“//li[1]”) # get all first LI elements under whatever it is its parent

[u'<li>1</li>’, u'<li>4</li>’]

xp(“(//li)[1]”) # get the first LI element in the whole document

[u'<li>1</li>’]

xp(“//ul/li[1]”)  # get all first LI elements under an UL parent

[u'<li>1</li>’, u'<li>4</li>’]

xp(“(//ul/li)[1]”) # get the first LI element under an UL parent in the document

[u'<li>1</li>’]

Also,

//a[starts-with(@href, ‘#’)][1] gets a collection of the local anchors that occur first under their respective parents and (//a[starts-with(@href, ‘#’)])[1] gets the first local anchor in the document.

When selecting by class, be as specific as necessary.

If you want to select elements by a CSS class, the XPath way to do the same job is the rather verbose:

*[contains(concat(‘ ‘, normalize-space(@class), ‘ ‘), ‘ someclass ‘)]

Let’s cook up some examples:

>>> sel = Selector(text='<p class=”content-author”>Someone</p><p class=”content text-wrap”>Some content</p>’)

>>> xp = lambda x: sel.xpath(x).extract()

BAD: because there are multiple classes in the attribute

>>> xp(“//*[@class=’content’]”)

[]

BAD: gets more content than we need

 >>> xp(“//*[contains(@class,’content’)]”)

     [u'<p class=”content-author”>Someone</p>’,
     u'<p class=”content text-wrap”>Some content</p>’]

GOOD:

>>> xp(“//*[contains(concat(‘ ‘, normalize-space(@class), ‘ ‘), ‘ content ‘)]”)
[u'<p class=”content text-wrap”>Some content</p>’]

And many times, you can just use a CSS selector instead, and even combine the two of them if needed:

ALSO GOOD:

>>> sel.css(“.content”).extract()
[u'<p class=”content text-wrap”>Some content</p>’]

>>> sel.css(‘.content’).xpath(‘@class’).extract()
[u’content text-wrap’]

Learn to use all the different axes.

It is handy to know how to use the axes, you can follow through these examples.

In particular, you should note that following and following-sibling are not the same thing, this is a common source of confusion. The same goes for preceding and preceding-sibling, and also ancestor and parent.

Useful trick to get text content

Here is another XPath trick that you may use to get the interesting text contents: 

//*[not(self::script or self::style)]/text()[normalize-space(.)]

This excludes the content from the script and style tags and also skip whitespace-only text nodes.

Tools & Libraries Used:

Firefox
Firefox inspect element with firebug
Scrapy : 1.1.1
Python : 2.7.12
Requests : 2.11.0

 Have questions? Comment below. Please share if you found this helpful.

Source: http://blog.datahut.co/how-xpath-plays-vital-role-in-web-scraping-part-2/

Thursday 3 November 2016

Data Mining Process - Why Outsource Data Mining Service?

Data Mining Process - Why Outsource Data Mining Service?

Overview of Data Mining and Process:

Data mining is one of the unique techniques for investigating information to extract certain data patterns and decide to outcome of existing requirements. Data mining is widely use in client research, services analysis, market research and so on. It is totally based on mathematical algorithm and analytical skills to drive the desired results from the huge database collection.

Information mining is mostly used by financial analyzer, business and professional organization and also there are many growing area of business that are get maximum advantages of data extract with use of data warehouses in their small to large level of businesses.

Most of functionalities which are used in information collecting process define as under:

* Retrieving Data

* Analyzing Data

* Extracting Data

* Transforming Data

* Loading Data

* Managing Databases

Most of small, medium and large levels of businesses are collect huge amount of data or information for analysis and research to develop business. Such kind of large amount will help and makes it much important whenever information or data required.

Why Outsource Data Online Mining Service?

Outsourcing advantages of data mining services:
o Almost save 60% operating cost
o High quality analysis processes ensuring accuracy levels of almost 99.98%
o Guaranteed risk free outsourcing experience ensured by inflexible information security policies and practices
o Get your project done within a quick turnaround time
o You can measure highly skilled and expertise by taking benefits of Free Trial Program.
o Get the gathered information presented in a simple and easy to access format

Thus, data or information mining is very important part of the web research services and it is most useful process. By outsource data extraction and mining service; you can concentrate on your co relative business and growing fast as you desire.

Outsourcing web research is trusted and well known Internet Market research organization having years of experience in BPO (business process outsourcing) field.

If you want to more information about data mining services and related web research services, then contact us.

Source: http://ezinearticles.com/?Data-Mining-Process---Why-Outsource-Data-Mining-Service?&id=3789102

Tuesday 18 October 2016

Scraping Yelp Data and How to use?

Scraping Yelp Data and How to use?

We get a lot of requests to scrape data from Yelp. These requests come in on a daily basis, sometimes several times a day. At the same time we have not seen a good business case for a commercial project with scraping Yelp.

We have decided to release a simple example Yelp robot which anyone can run on Chrome inside your computer, tune to your own requirements and collect some data. With this robot you can save business contact information like address, postal code, telephone numbers, website addresses etc.  Robot is placed in our Demo space on Web Robots portal for anyone to use, just sign up, find the robot and use it.

How to use it:

    Sign in to our portal here.
    Download our scraping extension from here.
    Find robot named Yelp_us_demo in the dropdown.
    Modify start URL to the first page of your search results. For example: http://www.yelp.com/search?find_desc=Restaurants&find_loc=Arlington,+VA,+USA
    Click Run.
    Let robot finish it’s job and download data from portal.

Some things to consider:

This robot is placed in our Demo space – therefore it is accessible to anyone. Anyone will be able to modify and run it, anyone will be able to download collected data. Robot’s code may be edited by someone else, but you can always restore it from sample code below. Yelp limits number of search results, so do not expect to scrape more results than you would normally see by search.

In case you want to create your own version of such robot, here it’s full code:

// starting URL above must be the first page of search results.
// Example: http://www.yelp.com/search?find_desc=Restaurants&find_loc=Arlington,+VA,+USA

steps.start = function () {

   var rows = [];

   $(".biz-listing-large").each (function (i,v) {
     if ($("h3 a", v).length > 0)
       {
        var row = {};
        row.company = $(".biz-name", v).text().trim();
        row.reviews =$(".review-count", v).text().trim();
        row.companyLink = $(".biz-name", v)[0].href;
        row.location = $(".secondary-attributes address", v).text().trim();
        row.phone = $(".biz-phone", v).text().trim();
        rows.push (row);
      }
   });

   emit ("yelp", rows);
   if ($(".next").length === 1) {
     next ($(".next")[0].href, "start");
   }
 done();
};

Source: https://webrobots.io/scraping-yelp-data/

Wednesday 28 September 2016

Easy Web Scraping using PHP Simple HTML DOM Parser Library

Easy Web Scraping using PHP Simple HTML DOM Parser Library

Web scraping is only way to get data from website when  website don’t provide API to access it’s data. Web scraping involves following steps to get data:

    Make request to web page
    Parse/Extract data that you want to scrape from website.
    Store data for final output (excel, csv,mysql database etc).

Web scraping can be implemented in any language like PHP, Java, .Net, Python and any language that allows to make web request to get web page content (HTML text) in to variable. In this article I will show you how to use Simple HTML DOM PHP library to do web scraping using PHP.
PHP Simple HTML DOM Parser

Simple HTML DOM is a PHP library to parse data from webpages, in short you can use this library to do web scraping using PHP and even store data to MySQL database.  Simple HTML DOM has following features:

    The parser library is written in PHP 5+
    It requires PHP 5+ to run
    Parser supports invalid HTML parsing.
    It allows to select html tags like Jquery way.
    Supports Xpath and CSS path based web extraction
    Provides both the way – Object oriented way and procedure way to write code

Scrape All Links

<?php
include "simple_html_dom.php";

//create object
$html=new simple_html_dom();

//load specific URL
$html->load_file("http://www.google.com");

// This will Find all links
foreach($html->find('a') as $element)
   echo $element->href . '<br>';

?>

Scrape images

<?php
include "simple_html_dom.php";

//create object
$html=new simple_html_dom();

//load specific url
$html->load_file("http://www.google.com");

// This will Find all links
foreach($html->find('img') as $element)
   echo $element->src . '<br>';

?>

This is just little idea how you can do web scraping using PHP.Keep in mind that Xpath can make your job simple and fast. You can find all methods available in SimpleHTMLDom documentation page.

Source: http://webdata-scraping.com/web-scraping-using-php-simple-html-dom-parser-library/

Monday 19 September 2016

Things to take care while doing Web Scraping!!!

Things to take care while doing Web Scraping!!!

In the present day and age, web scraping word becomes most popular in data science. Basically web scraping is extracting the information from the websites using pre-written programs and web scraping scripts. Many organizations have successfully used web site scraping to build relevant and useful database that they use on a daily basis to enhance their business interests. This is the age of the Big Data and web scraping is one of the trending techniques in the data science.

Throughout my journey of learning web scraping and implementing many successful scraping projects, I have come across some great experiences we can learn from.  In this post, I’m going to discuss some of the approaches to take and approaches to avoid while executing web scraping.

User Proxies: Anonymously scraping data from websites

One should not scrape website with a single IP Address. Because when you repeatedly request the web page for web scraping, there is a chance that the remote web server might block your IP address preventing further request to the web page. To overcome this situation, one should scrape websites with the help of proxy servers (anonymous scraping). This will minimize the risk of getting trapped and blacklisted by a website. Use of Proxies to hide your identity (network details) to remote web servers while scraping data. You may also use a VPN instead of proxies to anonymously scrape websites.

Take maximum data and store it.

Do not follow “process the web page as it comes from the remote server”. Instead take all the information and store it to disk. This approach will be useful when your scraping algorithm breaks in the middle. In this case you don’t have to start scraping again. Never download the same content more than once as you are just wasting bandwidth. Try and download all content to disk in one go and then do the processing.

Follow strict rules in parsing:

Check various rules while parsing the information from the web site. For example if you expect a value to be a date then check that it’s really a date. This may greatly improve the quality of information. When you get unexpected data, then the algorithm need to be changed accordingly.

Respect Robots.txt

Robots.txt specifies the set of rules that should be followed by web crawlers and robots. I strongly advise you to consider and adjust your crawler to fully respect robots.txt. Robots.txt contains instructions on the exact pages that you are allowed to crawl, user-agent, and the requisite intervals between page requests. Following to these instructions minimizes the chance of getting blacklisted and banned from website owner.

Use XPath Smartly

XPath is a nice option to select elements of the HTML document more flexibly than CSS Selectors.  Be careful about HTML structure change through page to page so one xpath you made may be failed to extract data on another page due to changes in HTML structure.

Obey Website TOC:

Some websites make it absolutely apparent in their terms and conditions that they are particularly against to web scraping activities on their content. This can make you vulnerable against possible ethical and legal implications.

Test sample scrape and verify the data with actual scrape

Once you are done with web scraping project set up, you need to test it for sometimes. Check the extracted data. If something is not good, find out the cause and make changes accordingly and finally come to a perfect web scraping project.

Source: http://webdata-scraping.com/things-take-care-web-scraping/

Tuesday 6 September 2016

How to Use Microsoft Excel as a Web Scraping Tool

How to Use Microsoft Excel as a Web Scraping Tool

Microsoft Excel is undoubtedly one of the most powerful tools to manage information in a structured form. The immense popularity of Excel is not without reasons. It is like the Swiss army knife of data with its great features and capabilities. Here is how Excel can be used as a basic web scraping tool to extract web data directly into a worksheet. We will be using Excel web queries to make this happen.

Web queries is a feature of Excel which is basically used to fetch data on a web page into the Excel worksheet easily. It can automatically find tables on the webpage and would let you pick the particular table you need data from. Web queries can also be handy in situations where an ODBC connection is impossible to maintain apart from just extracting data from web pages. Let’s see how web queries work and how you can scrape HTML tables off the web using them.
Getting started

We’ll start with a simple Web query to scrape data from the Yahoo! Finance page. This page is particularly easier to scrape and hence is a good fit for learning the method. The page is also pretty straightforward and doesn’t have important information in the form of links or images. Here is the URL we will be using for the tutorial:

http://finance.yahoo.com/q/hp?s=GOOG

To create a new Web query:

1. Select the cell in which you want the data to appear.
2. Click on Data-> From Web
3. The New Web query box will pop up as shown below.

4. Enter the web page URL you need to extract data from in the Address bar and hit the Go button.
5. Click on the yellow-black buttons next to the table you need to extract data from.

6. After selecting the required tables, click on the Import button and you’re done. Excel will now start downloading the content of the selected tables into your worksheet.

Once you have the data scraped into your Excel worksheet, you can do a host of things like creating charts, sorting, formatting etc. to better understand or present the data in a simpler way.
Customizing the query

Once you have created a web query, you have the option to customize it according to your requirements. To do this, access Web query properties by right clicking on a cell with the extracted data. The page you were querying appears again, click on the Options button to the right of the address bar. A new pop up box will be displayed where you can customize how the web query interacts with the target page. The options here lets you change some of the basic things related to web pages like the formatting and redirections.

Apart from this, you can also alter the data range options by right clicking on a random cell with the query results and selecting Data range properties. The data range properties dialog box will pop up where you can make the required changes. You might want to rename the data range to something you can easily recognize like ‘Stock Prices’.

Auto refresh

Auto-refresh is a feature of web queries worth mentioning, and one which makes our Excel web scraper truly powerful. You can make the extracted data to be auto-refreshing so that your Excel worksheet will update the data whenever the source website changes. You can set how often you need the data to be updated from the source web page in data range options menu. The auto refresh feature can be enabled by ticking the box beside ‘Refresh every’ and setting your preferred time interval for updating the data.
Web scraping at scale

Although extracting data using Excel can be a great way to scrape html tables from the web, it is nowhere close to a real web scraping solution. This can prove to be useful if you are collecting data for your college research paper or you are a hobbyist looking for a cheap way to get your hands on some data. If data for business is your need, you will definitely have to depend on a web scraping provider with expertise in dealing with web scraping at scale. Outsourcing the complicated process that web scraping will also give you more room to deal with other things that need extra attention such as marketing your business.

Source: https://www.promptcloud.com/blog/how-to-use-excel-to-scrape-websites

Monday 29 August 2016

How Web Scraping can Help you Detect Weak spots in your Business

How Web Scraping can Help you Detect Weak spots in your Business

Business intelligence is not a new term. Businesses have always been employing experts for analysing the progress, market and industry trends to keep their growth graph going up. Now that we have big data and the tool to gather this data – Web scraping, business intelligence has become even more fruitful. In fact, business intelligence has become a necessary thing to survive now that the competition is fierce in every industry. This is the reason why most enterprises depend on web scraping solutions to gather the data relevant to their businesses. This data is highly insightful and dependable enough to make critical business decisions. Business intelligence from web scraping is definitely a game changer for companies as it can supply relevant and actionable data with minimal effort.

Most businesses have weak spots that are being overlooked or hidden from the plain sight. These weak spots, if left unnoticed can gradually result in the downfall of your company. Here is how you can use data acquired through web scraping to detect weak spots in your business and strengthen them.

Competitor analysis

Many a times, you can find out the flaws in your business by keeping a close watch on your competitors. Competitor analysis is something that we owe to web scraping as the level of competitive intelligence that you can derive from web scraping has never been achievable in the past. With crawling forums and social media sites where your target audience is, you can easily find out if your competitor is leveraging something you have overlooked. Competitor analysis is all about staying updated to each and every action by your competitors, so that you can always be prepared for their next strategic move. If your competitors are doing better than you, this data can be used to make a comparison between your business and theirs which would give you insights on where you lack.

Brand monitoring on Social media

With social media platforms acting like platforms where businesses and customers can interact with each other, the data available on these sites are increasingly becoming relevant to businesses. Any issues in your business operations will also reflect on your customer sentiments. Social media is a goldmine of sentiment data that can help you detect issues within your company. By analysing the posts that mention your brand or product on social media sites, you can identify what department of your company is functioning well and what isn’t.

For example, if you are an Ecommerce portal and many users are complaining about delivery issues from your company on social media, you might want to switch to a better logistics partner who does a better job. The ability to identify such issues at the earliest is extremely important and that’s where web scraping becomes a life saver. With social media scraping, monitoring your brand on social media is easy like never before and the chances of minor issues escalating to bigger ones is almost non-existent. Brand monitoring is extremely crucial if you are a business operating in the online space. Social media scraping solutions are provided by many leading web scraping companies, which totally eliminates the technical complications associated with the process for you.

Finding untapped opportunities

There are always new and untapped markets and opportunities that are relevant to your business. Finding them is not going to be an easy task with manual and outdated methods of research. Web scraping can fill this gap and help you find opportunities that your company can make use of to leverage your reach and progress. Sometimes, targeting the right audience makes all the difference that you’ve been trying to make. By using web crawling to find mentions of your relevant keywords on the web, you can easily stay updated on your niche and fill in to any new untapped markets. Web crawling for keywords is better explained in our previous blog.

Bottom line

It is not a cakewalk to stay ahead in the competition considering how competitive every industry has become in this digital age. It is crucial to find the weak spots and untapped opportunities of your business before someone else does. Of course, you can always use some help from the technology when you need it. Web scraping is clearly the best way to find and gather data that would help you figure these out. With web crawling solutions that can completely take care of this niche process, nothing is stopping you from using the data and insights that the web has in stock for your business.

Source: https://www.promptcloud.com/blog/web-scraping-detect-weak-spots-business

Why Healthcare Companies should look towards Web Scraping

Why Healthcare Companies should look towards Web Scraping

The internet is a massive storehouse of information which is available in the form of text, media and other formats. To be competitive in this modern world, most businesses need access to this storehouse of information. But, all this information is not freely accessible as several websites do not allow you to save the data. This is where the process of Web Scraping comes in handy.

Web scraping is not new—it has been widely used by financial organizations, for detecting fraud; by marketers, for marketing and cross-selling; and by manufacturers for maintenance scheduling and quality control. Web scraping has endless uses for business and personal users. Every business or individual can have his or her own particular need for collecting data. You might want to access data belonging to a particular category from several websites. The different websites belonging to the particular category display information in non-uniform formats. Even if you are surfing a single website, you may not be able to access all the data at one place.

The data may be distributed across multiple pages under various heads. In a market that is vast and evolving rapidly, strategic decision-making demands accurate and thorough data to be analyzed, and on a periodic basis. The process of web scraping can help you mine data from several websites and store it in a single place so that it becomes convenient for you to a alyze the data and deliver results.

In the context of healthcare, web scraping is gaining foothold gradually but qualitatively. Several factors have led to the use of web scraping in healthcare. The voluminous amount of data produced by healthcare industry is too complex to be analyzed by traditional techniques. Web scraping along with data extraction can improve decision-making by determining trends and patterns in huge amounts of intricate data. Such intensive analyses are becoming progressively vital owing to financial pressures that have increased the need for healthcare organizations to arrive at conclusions based on the analysis of financial and clinical data. Furthermore, increasing cases of medical insurance fraud and abuse are encouraging healthcare insurers to resort to web scraping and data extraction techniques.

Healthcare is no longer a sector relying solely on person to person interaction. Healthcare has gone digital in its own way and different stakeholders of this industry such as doctors, nurses, patients and pharmacists are upping their ante technologically to remain in sync with the changing times. In the existing setup, where all choices are data-centric, web scraping in healthcare can impact lives, educate people, and create awareness. As people no more depend only on doctors and pharmacists, web scraping in healthcare can improve lives by offering rational solutions.

To be successful in the healthcare sector, it is important to come up with ways to gather and present information in innovative and informative ways to patients and customers. Web scraping offers a plethora of solutions for the healthcare industry. With web scraping and data extraction solutions, healthcare companies can monitor and gather information as well as track how their healthcare product is being received, used and implemented in different locales. It offers a safer and comprehensive access to data allowing healthcare experts to take the right decisions which ultimately lead to better clinical experience for the patients.

Web scraping not only gives healthcare professionals access to enterprise-wide information but also simplifies the process of data conversion for predictive analysis and reports. Analyzing user reviews in terms of precautions and symptoms for diseases that are incurable till date and are still undergoing medical research for effective treatments, can mitigate the fear in people. Data analysis can be based on data available with patients and is one way of creating awareness among people.

Hence, web scraping can increase the significance of data collection and help doctors make sense of the raw data. With web scraping and data extraction techniques, healthcare insurers can reduce the attempts of frauds, healthcare organizations can focus on better customer relationship management decisions, doctors can identify effective cure and best practices, and patients can get more affordable and better healthcare services.

Web scraping applications in healthcare can have remarkable utility and potential. However, the triumph of web scraping and data extraction techniques in healthcare sector depends on the accessibility to clean healthcare data. For this, it is imperative that the healthcare industry think about how data can be better recorded, stored, primed, and scraped. For instance, healthcare sector can consider standardizing clinical vocabulary and allow sharing of data across organizations to heighten the benefits from healthcare web scraping practices.

Healthcare sector is one of the top sectors where data is multiplying exponentially with time and requires a planned and structured storage of data. Continuous web scraping and data extraction is necessary to gain useful insights for renewing health insurance policies periodically as well as offer affordable and better public health solutions. Web scraping and data extraction together can process the mammoth mounds of healthcare data and transform it into information useful for decision making.

To reduce the gap between various components of healthcare sector-patients, doctors, pharmacies and hospitals, healthcare organizations and websites will have to tap the technology to collect data in all formats and present in a usable form. The healthcare sector needs to overcome the lag in implementing effective web scraping and data extraction techniques as well as intensify their pace of technology adoption. Web scraping can contribute enormously to the healthcare industry and facilitate organizations to methodically collect data and process it to identify inadequacies and best practices that improve patient care and reduce costs.

Source: https://www.promptcloud.com/blog/why-health-care-companies-should-use-web-scraping

How to use Social Media Scraping to be your Competitors’ Nightmare

How to use Social Media Scraping to be your Competitors’ Nightmare

Big data and competitive intelligence have been in the limelight for quite some time now. The almost magical power of big data to help a company make just the right decisions have been talked about a lot. When it comes to big data, the kind of benefits that a business can get totally depends upon the sources they acquire it from. Social media is one of the best sources from where you can get data that helps your business in a multitude of ways. Now that every business is deep rooted on the internet, social media data becomes all the more relevant and crucial. Here is how you can use data scraped from social media sites to get an edge in the competition.

Keeping watch on your competitors

Social media is the best place to watch your competitors’ activity and take counter initiatives to keep up or take over them. If you want to know what your competitors are up to, a social media scraping setup for scraping the posts that mention your competitors’ brand/product names can do the trick. This can also be used to learn a thing or two from their activities on social media so that you can take respective measures to stay ahead of them. For example, you could know if your competitor is running a special promotional offer at the moment and come up with something better than theirs to keep up. This can do wonders if you are in a highly competitive industry like Ecommerce where the competition is intense. If you are not using some help from web scraping technology to keep a close watch on your competitors, you could easily get left over in this fast-paced business scene.

Solving customer issues at the earliest

Customers are vocal about their experience with different products and services on social media sites these days. If you have a customer whose issue was left unsolved, there is a good chance that he/she will take it to the social media to vent the frustration. Watching out for such instances and giving them prompt support should be something you should do if you want to retain these customers and stop them from ruining your brand’s image. By scraping social media sites for posts that mention your product/service, you can easily find out if there are such grievances from customers. This can make sure to an extent that you don’t let unhappy customers stay that way, which eventually hurts your business in the long run. Customers can make or break your company, so using social media scraping to serve the customers better can help you succeed eventually.

Sentiment analysis

Social media data can play a good job at helping you understand user sentiments. With the help of social media scraping, a business can get the big picture about general perception of their brand by their users. This can go a long way since this level of feedback can help you fix unnoticed issues with your company and service quickly. By rectifying them, you can make your brand more appealing to the customers. Sentiment analysis will provide you with the opportunity to transform your business into how customers want it to be. Social media scraping is the one and only way to have access to this user sentiment data which can help you optimize your business for the customers.

Web crawling for social media data

When social media data possess so much value to businesses, it makes sense to look for efficient ways to gather and use this data. Manually scrolling through millions of tweets doesn’t make sense, this is why you should use social media scraping to aggregate the relevant data for your business. Besides, web scraping technologies make it possible to handle huge amounts of data with ease. Since the size of data is huge when it comes to business related requirements, web scraping is the only scalable solution worth considering. To make things even simpler, there are reliable web scraping solutions that offer social media scraping services for brand monitoring.

Bottom line

Since social media has become an integral part of online businesses, the data available on these sites possess immense value to companies in every industry. Social media scraping can be used for brand monitoring and gaining competitive intelligence that can be used to optimize your business model for maximum effectiveness. This will in turn make your company stand out from the competition and the added advantage of insights gained from social media data will help you to take over your competitors.

Source: https://www.promptcloud.com/blog/social-media-scraping-for-competitive-intelligence

Wednesday 17 August 2016

ERP Data Conversions - Best Practices and Steps

ERP Data Conversions - Best Practices and Steps

Every company who has gone through an ERP project has gone through the painful process of getting the data ready for the new system. The process of executing this typically goes through the following steps:

(1) Extract or define

(2) Clean and transform

(3) Load

(4) Validate and verify

This process is typically executed multiple times (2 - 5+ times depending on complexity) through an ERP project to ensure that the good data ends up in the new system. If the data is either incorrect, not well enough cleaned or adjusted or loaded incorrectly in to the new system it can cause serious problems as the new system is launched.

(1) Extract or define

This involves extracting the data from legacy systems, which are to be decommissioned. In some cases the data may not exist in a legacy system, as the old process may be spreadsheet-based and has to be created from scratch. Typically this involves creating some extraction programs or leveraging existing reports to get the data in to a format which can be put in to a spreadsheet or a data management application.

(2) Data cleansing

Once extracted it normally reviewed is for accuracy by the business, supported by the IT team, and/or adjusted if incorrect or in a structure which the new ERP system does not understand. Depending on the level of change and data quality this can represent a significant effort involving many business stakeholders and required to go through multiple cycles.

(3) Load data to new system

As the data gets structured to a format which the receiving ERP system can handle the load programs may also be build to handle certain changes as part of the process of getting the data converted in to the new system. Data is loaded in to interface tables and loaded in to the new system's core master data and transactions tables.

When loading the data in to the new system the inter-dependency of the different data elements is key to consider and validate the cross dependencies. Exceptions are dealt with and go in to lessons learned and to modify extracts, data cleansing or load process in to the next cycle.

(4) Validate and verify

The final phase of the data conversion process is to verify the converted data through extracts, reports or manually to ensure that all the data went in correctly. This may also include both internal and external audit groups and all the key data owners. Part of the testing will also include attempting to transact using the converted data successfully.

The topmost success factors or best practices to execute a successful conversion I would prioritize as follows:

(1) Start the data conversion early enough by assessing the quality of the data. Starting too late can result in either costly project delays or decisions to load garbage and "deal with it later" resulting in an increase in problems as the new system is launched.

(2) Identify and assign data owners and customers (often forgotten) for the different elements. Ensure that not only the data owners sign-off on the data conversions but that also the key users of the data are involved in reviewing the selection criteria's, data cleansing process and load verification.

(3) Run sufficient enough rounds of testing of the data, including not only validating the loads but also transacting with the converted data.

(4) Depending on the complexity, evaluate possible tools beyond spreadsheets and custom programming to help with the data conversion process for cleansing, transformation and load process.

(5) Don't under-estimate the effort in cleansing and validating the converted data.

(6) Define processes and consider other tools to help how the accuracy of the data will be maintained after the system goes live.

Source: http://ezinearticles.com/?ERP-Data-Conversions---Best-Practices-and-Steps&id=7263314

Monday 8 August 2016

Difference between Data Mining and KDD

Difference between Data Mining and KDD

Data, in its raw form, is just a collection of things, where little information might be derived. Together with the development of information discovery methods(Data Mining and KDD), the value of the info is significantly improved.

Data mining is one among the steps of Knowledge Discovery in Databases(KDD) as can be shown by the image below.KDD is a multi-step process that encourages the conversion of data to useful information. Data mining is the pattern extraction phase of KDD. Data mining can take on several types, the option influenced by the desired outcomes.

Knowledge Discovery in Databases Steps
Data Selection

KDD isn’t prepared without human interaction. The choice of subset and the data set requires knowledge of the domain from which the data is to be taken. Removing non-related information elements from the dataset reduces the search space during the data mining phase of KDD. The sample size and structure are established during this point, if the dataset can be assessed employing a testing of the info.
Pre-processing

Databases do contain incorrect or missing data. During the pre-processing phase, the information is cleaned. This warrants the removal of “outliers”, if appropriate; choosing approaches for handling missing data fields; accounting for time sequence information, and applicable normalization of data.
Transformation

Within the transformation phase attempts to reduce the variety of data elements can be assessed while preserving the quality of the info. During this stage, information is organized, changed in one type to some other (i.e. changing nominal to numeric) and new or “derived” attributes are defined.
Data mining

Now the info is subjected to one or several data-mining methods such as regression, group, or clustering. The information mining part of KDD usually requires repeated iterative application of particular data mining methods. Different data-mining techniques or models can be used depending on the expected outcome.
Evaluation

The final step is documentation and interpretation of the outcomes from the previous steps. Steps during this period might consist of returning to a previous step up the KDD approach to help refine the acquired knowledge, or converting the knowledge in to a form clear for the user.In this stage the extracted data patterns are visualized for further reviews.
Conclusion

Data mining is a very crucial step of the KDD process.

For further reading aboud KDD and data mining ,please check this link.

Source: http://nocodewebscraping.com/difference-data-mining-kdd/

Thursday 4 August 2016

What's difference between web scraping and data mining?

What's difference between web scraping and data mining?

Data mining: automatically searching large stores of data for patterns. How you get the data is irrelevant, only how you analyze it. Data mining involves the use of complex statistical algorithms.

Screen/web scraping is a method for extracting textual characters from screens so that they could be analyzed. Commonly, it is used to extract characters from websites (web scraping), though not exclusively. This method for gathering data is direct, either through looking at websites' html code or visual abstraction techniques.

Web scraping could be a source for data mining but it doesn't have to be because your data may not come from the web.

Data Mining can take any source of data and if that process requires data available from the public web then web scraping could be one of the methods to get such data.
You can also perform web scraping. without mining it later.

The reality is that a lot of data today IS on the web and a lot of data mining does use web related data.

Web scraping is getting data from web. Data mining is getting knowledge from data.

Source: https://www.quora.com/Whats-difference-between-web-scraping-and-data-mining

Sunday 31 July 2016

Tips for scraping business directories

Tips for scraping business directories

Are you looking to scrape business directories to generate leads?

Here are a few tips for scraping business directories.

Web scraping is not rocket science. But there are good and bad and worst ways of doing it.

Generating sales qualified leads is always a headache. The old school ways are to buy a list from sites like Data.com. But they are quite expensive.

Scraping business directories can help generate sales qualified leads. The following tips can help you scrape data from business directories efficiently.

1) Choose a good framework to write the web scrapers. This can help save a lot of time and trouble. Python Scrapy is our favourite, but there are other non-pythonic frameworks too.

2) The business directories might be having anti-scraping mechanisms. You have to use IP rotating services to do the scrape. Using IP rotating services, crawl with multiple changing IP addresses which can cover your tracks.

3) Some sites really don’t want you to scrape and they will block the bot. In these cases, you may need to disguise your web scraper as a human being. Browser automation tools like selenium can help you do this.

4) Web sites will update their data quite often. The scraper bot should be able to update the data according to the changes. This is a hard task and you need professional services to do that.

One of the easiest ways to generate leads is to scrape from business directories and use enrich them. We made Leadintel for lead research and enrichment.

Source: http://blog.datahut.co/tips-for-scraping-business-directories/

Tuesday 12 July 2016

Python 3 web-scraping examples with public data

Someone on the NICAR-L listserv asked for advice on the best Python libraries for web scraping. My advice below includes what I did for last spring’s Computational Journalism class, specifically, the Search-Script-Scrape project, which involved 101-web-scraping exercises in Python.

Best Python libraries for web scraping

For the remainder of this post, I assume you’re using Python 3.x, though the code examples will be virtually the same for 2.x. For my class last year, I had everyone install the Anaconda Python distribution, which comes with all the libraries needed to complete the Search-Script-Scrape exercises, including the ones mentioned specifically below:
The best package for general web requests, such as downloading a file or submitting a POST request to a form, is the simply-named requests library (“HTTP for Humans”).

Here’s an overly verbose example:

import requests
base_url = 'http://maps.googleapis.com/maps/api/geocode/json'
my_params = {'address': '100 Broadway, New York, NY, U.S.A',
             'language': 'ca'}
response = requests.get(base_url, params = my_params)
results = response.json()['results']
x_geo = results[0]['geometry']['location']
print(x_geo['lng'], x_geo['lat'])
# -74.01110299999999 40.7079445

For the parsing of HTML and XML, Beautiful Soup 4 seems to be the most frequently recommended. I never got around to using it because it was malfunctioning on my particular installation of Anaconda on OS X.
But I’ve found lxml to be perfectly fine. I believe both lxml and bs4 have similar capabilities – you can even specify lxml to be the parser for bs4. I think bs4 might have a friendlier syntax, but again, I don’t know, as I’ve gotten by with lxml just fine:

import requests
from lxml import html
page = requests.get("http://www.example.com").text
doc = html.fromstring(page)
link = doc.cssselect("a")[0]
print(link.text_content())
# More information...
print(link.attrib['href'])
# http://www.iana.org/domains/example

The standard urllib package also has a lot of useful utilities – I frequently use the methods from urllib.parse. Python 2 also has urllib but the methods are arranged differently.

Here’s an example of using the urljoin method to resolve the relative links on the California state data for high school test scores. The use of os.path.basename is simply for saving the each spreadsheet to your local hard drive:

from os.path import basename
from urllib.parse import urljoin
from lxml import html
import requests
base_url = 'http://www.cde.ca.gov/ds/sp/ai/'
page = requests.get(base_url).text
doc = html.fromstring(page)
hrefs = [a.attrib['href'] for a in doc.cssselect('a')]
xls_hrefs = [href for href in hrefs if 'xls' in href]
for href in xls_hrefs:
  print(href) # e.g. documents/sat02.xls
  url = urljoin(base_url, href)
  with open("/tmp/" + basename(url), 'wb') as f:
    print("Downloading", url)
    # Downloading http://www.cde.ca.gov/ds/sp/ai/documents/sat02.xls
    data = requests.get(url).content
    f.write(data)

And that’s about all you need for the majority of web-scraping work – at least the part that involves reading HTML and downloading files.
Examples of sites to scrape

The 101 scraping exercises didn’t go so great, as I didn’t give enough specifics about what the exact answers should be (e.g. round the numbers? Use complete sentences?) or even where the data files actually were – as it so happens, not everyone Googles things the same way I do. And I should’ve made them do it on a weekly basis, rather than waiting till the end of the quarter to try to cram them in before finals week.

The Github repo lists each exercise with the solution code, the relevant URL, and the number of lines in the solution code.

The exercises run the gamut of simple parsing of static HTML, to inspecting AJAX-heavy sites in which knowledge of the network panel is required to discover the JSON files to grab. In many of these exercises, the HTML-parsing is the trivial part – just a few lines to parse the HTML to dynamically find the URL for the zip or Excel file to download (via requests)…and then 40 to 50 lines of unzipping/reading/filtering to get the answer. That part is beyond what typically considered “web-scraping” and falls more into “data wrangling”.

I didn’t sort the exercises on the list by difficulty, and many of the solutions are not particulary great code. Sometimes I wrote the solution as if I were teaching it to a beginner. But other times I solved the problem using the style in the most randomly bizarre way relative to how I would normally solve it – hey, writing 100+ scrapers gets boring.

But here are a few representative exercises with some explanation:
1. Number of datasets currently listed on data.gov

I think data.gov actually has an API, but this script relies on finding the easiest tag to grab from the front page and extracting the text, i.e. the 186,569 from the text string, "186,569 datasets found". This is obviously not a very robust script, as it will break when data.gov is redesigned. But it serves as a quick and easy HTML-parsing example.
29. Number of days until Texas’s next scheduled execution

Texas’s death penalty site is probably one of the best places to practice web scraping, as the HTML is pretty straightforward on the main landing pages (there are several, for scheduled and past executions, and current inmate roster), which have enough interesting tabular data to collect. But you can make it more complex by traversing the links to collect inmate data, mugshots, and final words. This script just finds the first person on the scheduled list and does some math to print the number of days until the execution (I probably made the datetime handling more convoluted than it needs to be in the provided solution)
3. The number of people who visited a U.S. government website using Internet Explorer 6.0 in the last 90 days

The analytics.usa.gov site is a great place to practice AJAX-data scraping. It’s a very simple and robust site, but either you are aware of AJAX and know how to use the network panel (and in this case, locate ie.json, or you will have no clue how to scrape even a single number on this webpage. I think the difference between static HTML and AJAX sites is one of the tougher things to teach novices. But they pretty much have to learn the difference given how many of today’s websites use both static and dynamically-rendered pages.
6. From 2010 to 2013, the change in median cost of health, dental, and vision coverage for California city employees

There’s actually no HTML parsing if you assume the URLs for the data files can be hard coded. So besides the nominal use of the requests library, this ends up being a data-wrangling exercise: download two specific zip files, unzip them, read the CSV files, filter the dictionaries, then do some math.
90. The currently serving U.S. congressmember with the most Twitter followers

Another example with no HTML parsing, but probably the most complicated example. You have to download and parse Sunlight Foundation’s CSV of Congressmember data to get all the Twitter usernames. Then authenticate with Twitter’s API, then perform mulitple batch lookups to get the data for all 500+ of the Congressional Twitter usernames. Then join the sorted result with the actual Congressmember identity. I probably shouldn’t have assigned this one.
HTML is not necessary

I included no-HTML exercises because there are plenty of data programming exercises that don’t have to deal with the specific nitty-gritty of the Web, such as understanding HTTP and/or HTML. It’s not just that a lot of public data has moved to JSON (e.g. the FEC API) – but that much of the best public data is found in bulk CSV and database files. These files can be programmatically fetched with simple usage of the requests library.

It’s not that parsing HTML isn’t a whole boatload of fun – and being able to do so is a useful skill if you want to build websites. But I believe novices have more than enough to learn from in sorting/filtering dictionaries and lists without worrying about learning how a website works.

Besides analytics.usa.gov, the data.usajobs.gov API, which lists federal job openings, is a great one to explore, because its data structure is simple and the site is robust. Here’s a Python exercise with the USAJobs API; and here’s one in Bash.

There’s also the Google Maps geocoding API, which can be hit up for a bit before you run into rate limits, and you get the bonus of teaching geocoding concepts. The NYTimes API requires creating an account, but you not only get good APIs for some political data, but for content data (i.e. articles, bestselling books) that is interesting fodder for journalism-related analysis.

But if you want to scrape HTML, then the Texas death penalty pages are the way to go, because of the simplicity of the HTML and the numerous ways you can traverse the pages and collect interesting data points. Besides the previously mentioned Texas Python scraping exercise, here’s one for Florida’s list of executions. And here’s a Bash exercise that scrapes data from Texas, Florida, and California and does a simple demographic analysis.

If you want more interesting public datasets – most of which require only a minimal of HTML-parsing to fetch – check out the list I talked about in last week’s info session on Stanford’s Computational Journalism Lab.

Source URL :  http://blog.danwin.com/examples-of-web-scraping-in-python-3-x-for-data-journalists/

Monday 11 July 2016

Python 3 web-scraping examples with public data

Someone on the NICAR-L listserv asked for advice on the best Python libraries for web scraping. My advice below includes what I did for last spring’s Computational Journalism class, specifically, the Search-Script-Scrape project, which involved 101-web-scraping exercises in Python.

Best Python libraries for web scraping

For the remainder of this post, I assume you’re using Python 3.x, though the code examples will be virtually the same for 2.x. For my class last year, I had everyone install the Anaconda Python distribution, which comes with all the libraries needed to complete the Search-Script-Scrape exercises, including the ones mentioned specifically below:
The best package for general web requests, such as downloading a file or submitting a POST request to a form, is the simply-named requests library (“HTTP for Humans”).

Here’s an overly verbose example:

import requests
base_url = 'http://maps.googleapis.com/maps/api/geocode/json'
my_params = {'address': '100 Broadway, New York, NY, U.S.A',
             'language': 'ca'}
response = requests.get(base_url, params = my_params)
results = response.json()['results']
x_geo = results[0]['geometry']['location']
print(x_geo['lng'], x_geo['lat'])
# -74.01110299999999 40.7079445

For the parsing of HTML and XML, Beautiful Soup 4 seems to be the most frequently recommended. I never got around to using it because it was malfunctioning on my particular installation of Anaconda on OS X.
But I’ve found lxml to be perfectly fine. I believe both lxml and bs4 have similar capabilities – you can even specify lxml to be the parser for bs4. I think bs4 might have a friendlier syntax, but again, I don’t know, as I’ve gotten by with lxml just fine:

import requests
from lxml import html
page = requests.get("http://www.example.com").text
doc = html.fromstring(page)
link = doc.cssselect("a")[0]
print(link.text_content())
# More information...
print(link.attrib['href'])
# http://www.iana.org/domains/example

The standard urllib package also has a lot of useful utilities – I frequently use the methods from urllib.parse. Python 2 also has urllib but the methods are arranged differently.

Here’s an example of using the urljoin method to resolve the relative links on the California state data for high school test scores. The use of os.path.basename is simply for saving the each spreadsheet to your local hard drive:

from os.path import basename
from urllib.parse import urljoin
from lxml import html
import requests
base_url = 'http://www.cde.ca.gov/ds/sp/ai/'
page = requests.get(base_url).text
doc = html.fromstring(page)
hrefs = [a.attrib['href'] for a in doc.cssselect('a')]
xls_hrefs = [href for href in hrefs if 'xls' in href]
for href in xls_hrefs:
  print(href) # e.g. documents/sat02.xls
  url = urljoin(base_url, href)
  with open("/tmp/" + basename(url), 'wb') as f:
    print("Downloading", url)
    # Downloading http://www.cde.ca.gov/ds/sp/ai/documents/sat02.xls
    data = requests.get(url).content
    f.write(data)

And that’s about all you need for the majority of web-scraping work – at least the part that involves reading HTML and downloading files.
Examples of sites to scrape

The 101 scraping exercises didn’t go so great, as I didn’t give enough specifics about what the exact answers should be (e.g. round the numbers? Use complete sentences?) or even where the data files actually were – as it so happens, not everyone Googles things the same way I do. And I should’ve made them do it on a weekly basis, rather than waiting till the end of the quarter to try to cram them in before finals week.

The Github repo lists each exercise with the solution code, the relevant URL, and the number of lines in the solution code.

The exercises run the gamut of simple parsing of static HTML, to inspecting AJAX-heavy sites in which knowledge of the network panel is required to discover the JSON files to grab. In many of these exercises, the HTML-parsing is the trivial part – just a few lines to parse the HTML to dynamically find the URL for the zip or Excel file to download (via requests)…and then 40 to 50 lines of unzipping/reading/filtering to get the answer. That part is beyond what typically considered “web-scraping” and falls more into “data wrangling”.

I didn’t sort the exercises on the list by difficulty, and many of the solutions are not particulary great code. Sometimes I wrote the solution as if I were teaching it to a beginner. But other times I solved the problem using the style in the most randomly bizarre way relative to how I would normally solve it – hey, writing 100+ scrapers gets boring.

But here are a few representative exercises with some explanation:
1. Number of datasets currently listed on data.gov

I think data.gov actually has an API, but this script relies on finding the easiest tag to grab from the front page and extracting the text, i.e. the 186,569 from the text string, "186,569 datasets found". This is obviously not a very robust script, as it will break when data.gov is redesigned. But it serves as a quick and easy HTML-parsing example.
29. Number of days until Texas’s next scheduled execution

Texas’s death penalty site is probably one of the best places to practice web scraping, as the HTML is pretty straightforward on the main landing pages (there are several, for scheduled and past executions, and current inmate roster), which have enough interesting tabular data to collect. But you can make it more complex by traversing the links to collect inmate data, mugshots, and final words. This script just finds the first person on the scheduled list and does some math to print the number of days until the execution (I probably made the datetime handling more convoluted than it needs to be in the provided solution)
3. The number of people who visited a U.S. government website using Internet Explorer 6.0 in the last 90 days

The analytics.usa.gov site is a great place to practice AJAX-data scraping. It’s a very simple and robust site, but either you are aware of AJAX and know how to use the network panel (and in this case, locate ie.json, or you will have no clue how to scrape even a single number on this webpage. I think the difference between static HTML and AJAX sites is one of the tougher things to teach novices. But they pretty much have to learn the difference given how many of today’s websites use both static and dynamically-rendered pages.
6. From 2010 to 2013, the change in median cost of health, dental, and vision coverage for California city employees

There’s actually no HTML parsing if you assume the URLs for the data files can be hard coded. So besides the nominal use of the requests library, this ends up being a data-wrangling exercise: download two specific zip files, unzip them, read the CSV files, filter the dictionaries, then do some math.
90. The currently serving U.S. congressmember with the most Twitter followers

Another example with no HTML parsing, but probably the most complicated example. You have to download and parse Sunlight Foundation’s CSV of Congressmember data to get all the Twitter usernames. Then authenticate with Twitter’s API, then perform mulitple batch lookups to get the data for all 500+ of the Congressional Twitter usernames. Then join the sorted result with the actual Congressmember identity. I probably shouldn’t have assigned this one.
HTML is not necessary

I included no-HTML exercises because there are plenty of data programming exercises that don’t have to deal with the specific nitty-gritty of the Web, such as understanding HTTP and/or HTML. It’s not just that a lot of public data has moved to JSON (e.g. the FEC API) – but that much of the best public data is found in bulk CSV and database files. These files can be programmatically fetched with simple usage of the requests library.

It’s not that parsing HTML isn’t a whole boatload of fun – and being able to do so is a useful skill if you want to build websites. But I believe novices have more than enough to learn from in sorting/filtering dictionaries and lists without worrying about learning how a website works.

Besides analytics.usa.gov, the data.usajobs.gov API, which lists federal job openings, is a great one to explore, because its data structure is simple and the site is robust. Here’s a Python exercise with the USAJobs API; and here’s one in Bash.

There’s also the Google Maps geocoding API, which can be hit up for a bit before you run into rate limits, and you get the bonus of teaching geocoding concepts. The NYTimes API requires creating an account, but you not only get good APIs for some political data, but for content data (i.e. articles, bestselling books) that is interesting fodder for journalism-related analysis.

But if you want to scrape HTML, then the Texas death penalty pages are the way to go, because of the simplicity of the HTML and the numerous ways you can traverse the pages and collect interesting data points. Besides the previously mentioned Texas Python scraping exercise, here’s one for Florida’s list of executions. And here’s a Bash exercise that scrapes data from Texas, Florida, and California and does a simple demographic analysis.

If you want more interesting public datasets – most of which require only a minimal of HTML-parsing to fetch – check out the list I talked about in last week’s info session on Stanford’s Computational Journalism Lab.

Source URL :  http://blog.danwin.com/examples-of-web-scraping-in-python-3-x-for-data-journalists/

Sunday 10 July 2016

Web Data Scraping: Practical Uses

Whether in the form of media, text or data in diverse other formats—the internet serves to be a huge storehouse of the world’s information. While browsing for commercial or business needs alike, users are exposed to numerous web pages that contain data in just about every form. Even though access to such data is extremely critical for garnering success in the contemporary world, unfortunately most of it is not open. More often than not, business websites restrict the accessibility options to such data and do not allow visitors to save or display them for reuse on their local storage devices, or onto their own websites.  This is where web data extraction tools come in handy.

Read on for a closer look into some of the common areas of data scraping usage.

• Gathering of data from diverse sources for analysis: In case a business necessitates the collection and analysis of data specific to certain categories from multiple websites, then it helps refer to web data integration experts or those related to the field of data scraping linked with categories like industrial equipment, real estate, automobiles, marketing, business contacts, electronic gadgets and so forth.

• Collection of data in different formats: Different websites are known to publish information and structured data in different formats. So, it may not be possible for organizations to see all the required data a one place, at any given time. Data scrapers allow the extraction of information spanning across multiple pages under various sections, on to a single database or spreadsheet.  This makes it easy for users to analyze (or visualize) the data.

• Helps Research: Data is an important and integral part of all kinds of research – marketing, academic or scientific. A data scraper helps in gathering structured data with ease.

• Market analysis for businesses: Companies that cater to products or services connected to specific domains require comprehensive data of products and services that are of similar kind, and which have a tendency of appearing in the market on a daily basis.

Web scraping software solutions from reputed companies are successful in keeping a constant watch on this kind of data and allow users to get access required information from diverse sources – all at the click of a button.
Go for data extraction to take your business to the next levels of success – you will not be disappointed.

Source URL : http://www.3idatascraping.com/web-data-scraping-practical-uses.php

Thursday 7 July 2016

Data Scraping - What Are Hand-Scraped Hardwood Floors and What Are the Benefits?

If you love the look of hardwood flooring with lots of character, then you may want to check out hand-scraped hardwood flooring. Hand-scraped wood provides a warm vintage look, providing the floor instant character. These types of scraped hardwoods are suitable for living rooms, dining rooms, hallways and bedrooms. But what exactly is hand-scraped hardwood flooring?

Well, it is literally what you think it is. Hand-scraped hardwood flooring is created by hand using specialized wood working tools to make each board unique and giving an overall "old worn" appearance.

At Innovation Builders we offer solid wood floors finished on site with an actual hand-scraping technique followed by stain and sealer. Solid wood floors are installed by an expert team of technicians who work each board with skilled craftsman-like attention to detail. Following the scraping procedure the floor is stained by hand with a customer selected stain color, and then protected with multiple coats of sealing and finishing polyurethane. This finishing process of staining, sealing and coating the wood floors contributes to providing the look and durability of an old reclaimed wood floor, but with today's tough, urethane finishes.

There are many, many benefits to hand-scraped wood flooring. Overall, these floors are extremely durable and hard wearing, providing years of trouble-free use. These wood floors remain looking newer for longer because the texture that the process provides hides the typical dents, dings and scratches that other floors can't hide so easily. That's great news for households with kids, dogs, and cats.

These types of wood flooring have another unique advantage as well. When you do scratch these floors during their lifetime, the scratches are easily repaired. As long as the scratch isn't too deep you can make them practically disappear without ever having to hire a professional. It's simple to hide the scratch by using a color-matched stain marker or repair kit that is readily available through local flooring distributors. These features make hand-scraped hardwood flooring a lot more durable and hassle-free to maintain than other types of wood flooring.

The expert processes utilized in the creation of these floors provides a custom look of worn wood with deep color and subtle highlights. When the light hits the wood at different times during the day, it provides an understated but powerful effect of depth and beauty. They instantly offer your rooms a rustic look full of character, allowing your home to become a warm and inviting environment. The rustic look of this wood provides a texture, style and rustic appeal that cannot be matched by any other type of flooring.

Hand-Scraped Hardwood Flooring is a floor that says welcome and adds a touch of elegance to any home. If you are looking to buy a new home and you haven't had the opportunity to see or feel hand scraped hardwoods, stop in any of the model homes at Innovation Builders in Keller, North Richland Hills or Grand Prairie, Texas and check it out!

Source URL :   http://yellowpagesdatascraping.blogspot.in/2015/06/data-scraping-what-are-hand-scraped.html

Saturday 18 June 2016

Increasing Accessibility by Scraping Information From PDF

You may have heard about data scraping which is a method that is being used by computer programs in extracting data from an output that comes from another program. To put it simply, this is a process which involves the automatic sorting of information that can be found on different resources including the internet which is inside an html file, PDF or any other documents. In addition to that, there is the collection of pertinent information. These pieces of information will be contained into the databases or spreadsheets so that the users can retrieve them later.

Most of the websites today have text that can be accessed and written easily in the source code. However, there are now other businesses nowadays that choose to make use of Adobe PDF files or Portable Document Format. This is a type of file that can be viewed by simply using the free software known as the Adobe Acrobat. Almost any operating system supports the said software. There are many advantages when you choose to utilize PDF files. Among them is that the document that you have looks exactly the same even if you put it in another computer so that you can view it. Therefore, this makes it ideal for business documents or even specification sheets. Of course there are disadvantages as well. One of which is that the text that is contained in the file is converted into an image. In this case, it is often that you may have problems with this when it comes to the copying and pasting.

This is why there are some that start scraping information from PDF. This is often called PDF scraping in which this is the process that is just like data scraping only that you will be getting information that is contained in your PDF files. In order for you to begin scraping information from PDF, you must choose and exploit a tool that is specifically designed for this process. However, you will find that it is not easy to locate the right tool that will enable you to perform PDF scraping effectively. This is because most of the tools today have problems in obtaining exactly the same data that you want without personalizing them.

Nevertheless, if you search well enough, you will be able to encounter the program that you are looking for. There is no need for you to have programming language knowledge in order for you to use them. You can easily specify your own preferences and the software will do the rest of the work for you. There are also companies out there that you can contact and they will perform the task since they have the right tools that they can use. If you choose to do things manually, you will find that this is indeed tedious and complicated whereas if you compare this to having professionals do the job for you, they will be able to finish it in no time at all. Scraping information from PDF is a process where you collect the information that can be found on the internet and this does not infringe copyright laws.

 Source  URL : http://ezinearticles.com/?Increasing-Accessibility-by-Scraping-Information-From-PDF&id=4593863

Thursday 12 May 2016

Web Scraping to Create Open Data

Open data is the idea that some data should be freely available to everyone to use and republish as they wish, without restrictions from
copyright, patents or other mechanisms of control.

My first experience with open data was in the year 2010. I wanted to create a better app for Bicing, the local bike sharing system in
Barcelona. Their website was a nightmare to use and I was tired of needing to walk to each station, trying to guess which ones had bicycles.
There was no app for Android, other than a couple of unofficial attempts that didn’t work at all.

I began as most would; I searched the internet and found a library named python-bicing that was somehow able to retrieve station and
bike information. This was my first time using Python and, after some investigation, I learned what the code was doing: accessing the
official website, parsing the JavaScript that generated their buggy map and giving back a nice chunk of Python objects that represented
bike share stations.

This I learned was called web scraping. It was like I had figured out a magic trick that would allow me to always be able to access the data I
needed without having to rely on faulty websites.

The rise of OpenBicing and CityBikes

Shortly after, I launched OpenBicing, an Android app for the local bike sharing system in Barcelona, together with a backend that used
python-bicing. I also shared a public API that provided this information so that nobody else had to do the dirty work ever again.

Since other cities were having the same problem, we expanded the scope of the project worldwide and renamed it CityBikes. That was 6
years ago.

To date, CityBikes is the most comprehensive and widely used open API for bike sharing information, with support for over 400 cities
worldwide. Our API processes around 10 requests per second and we scrape each of the 418 feeds about every three minutes. Making our
core library available for anyone to contribute has been crucial in maintaining and adding coverage for all of the supported systems.

The open data fallacy

We are usually regarded as “an open data project” even though less than 10% of our feeds come from properly licensed, documented and
machine-readable feeds. The remaining 90% is composed of 188 feeds that are machine-readable, but not licensed nor documented and
230 that are entirely maintained by scraping HTML pages.

North American BikeShare Association) recently published GBFS (General Bikeshare Feed Specification). This is clearly a step in the right
direction, but I can’t help but look at the almost 60% of services we currently support through scraping and wonder how long it will take the
remaining organizations to release their information, if ever. This is even more the case considering these numbers aren’t even taking into
account worldwide coverage.

Over the last few years there has been a progression by transportation companies and city councils toward providing their information as
“open data”. Directive 2003/98/EC encourages EU member states to release information regarding public services.

Yet, in most cases, there’s little action in enforcing Public Private Partnerships (PPP) to release their public information under a non-
restrictive license or even to transfer ownership of the data to city councils to be included in their open data portals.

Even with the increasing number of companies and institutions interested in participating in open data, by no means should we consider
open data a reality or something to be taken for granted. I firmly believe in the future and benefits of open data, I have seen them
happening all around CityBikes, but as technologists we need to stress the fact that the data is not out there yet.

The benefits of open data

When I started this project, I sought to make a difference in Barcelona. Now you can find tons of bike sharing apps that use our API on all
major platforms. It doesn’t matter that these are not our own apps. They are solving the same problem we were trying to fix, so their
success is our success.

Besides popular apps like Moovit or CityMapper, there are many neat projects out there, some of which are published under free software
licenses. Ideally, a city council could create a customization of any of these apps for their own use.

Most official applications for bike sharing systems have terrible ratings. The core business of transportation companies is running a service,

so they have no real motivation to create an engaging UI or innovate further. In some cases, the city council does not even own the rights to
the data, being completely at the mercy of the company providing the transportation service.

Open data over apps

When providing public services, city councils and companies often get lost in what they should offer as an aid to the service. They focus on
a nice map or a flashy application, rather than providing the data behind these service aids. Maps, apps, and websites have a limited focus
and usually serve a single purpose. On the other hand, data is malleable and the purest form of representation. While you can’t create
something new from looking and playing with a static map (except, of course, if you scrape it), data can be used to create countless
different iterations. It can even provide a bridge that will allow anyone to participate, improve and build on top of these public services.

Wrap Up

At this point, you might wonder why I care so much about bike sharing. To me it’s not about bike sharing anymore. CityBikes is just too
good of an open data metaphor, a simulation in which public information is freely accessible to everyone. It shows the benefits of open
data and the deficiencies that arise from the lack thereof.

We shouldn’t have to create open data by scraping websites. This information should be already available, easily accessed and provided in
a machine-readable format from the original providers, be they city councils or transportation companies. However, until there’s another
option, we’ll always have scraping.


Source : https://blog.scrapinghub.com/2016/03/30/web-scraping-to-create-open-data/