Friday 3 July 2015

SFTW: Scraping data with Google Refine

For the first Something For The Weekend of 2012 I want to tackle a common problem when you’re trying to scrape a collection of webpage: they have some sort of structure in their URL like this, where part of the URL refers to the name or code of an entity:     http://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237521

  tp://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237629

    ttp://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237823

In this instance, you can see that the URL is identical apart from a 7 digit code at the end: the ID of the school the data refers to.

There are a number of ways you could scrape this data. You could use Google Docs and the =importXML formula, but Google Docs will only let you use this 50 times on any one spreadsheet (you could copy the results and select Edit > Paste Special > Values Only and then use the formula a further 50 times if it’s not too many – here’s one I prepared earlier).

And you could use Scraperwiki to write a powerful scraper – but you need to understand enough coding to do so quickly (here’s a demo I prepared earlier).

A middle option is to use Google Refine, and here’s how you do it.

Assembling the ingredients

With the basic URL structure identified, we already have half of our ingredients. What we need  next is a list of the ID codes that we’re going to use to complete each URL.

An advanced search for “list seed number scottish schools filetype:xls” brings up a link to this spreadsheet (XLS) which gives us just that.

The spreadsheet will need editing: remove any rows you don’t need. This will reduce the time that the scraper will take in going through them. For example, if you’re only interested in one local authority, or one type of school, sort your spreadsheet so that you can delete those above or below them.

Now to combine  the ID codes with the base URL.

Bringing your data into Google Refine

Open Google Refine and create a new project with the edited spreadsheet containing the school IDs.

At the top of the school ID column click on the drop-down menu and select Edit column > Add column based on this column…

In the New column name box at the top call this ‘URL’.

In the Expression box type the following piece of GREL (Google Refine Expression Language):

“http://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=”+value

(Type in the quotation marks yourself – if you’re copying them from a webpage you may have problems)

The ‘value’ bit means the value of each cell in the column you just selected. The plus sign adds it to the end of the URL in quotes.

In the Preview window you should see the results – you can even copy one of the resulting URLs and paste it into a browser to check it works. (On one occasion Google Refine added .0 to the end of the ID number, ruining the URL. You can solve this by changing ‘value’ to value.substring(0,7) – this extracts the first 7 characters of the ID number, omitting the ‘.0') UPDATE: in the comment Thad suggests “perhaps, upon import of your spreadsheet of IDs, you forgot to uncheck the importer option to Parse as numbers?”

Click OK if you’re happy, and you should have a new column with a URL for each school ID.

Grabbing the HTML for each page

Now click on the top of this new URL column and select Edit column > Add column by fetching URLs…

In the New column name box at the top call this ‘HTML’.

All you need in the Expression window is ‘value’, so leave that as it is.

Click OK.

Google Refine will now go to each of those URLs and fetch the HTML contents. As we have a couple thousand rows here, this will take a long time – hours, depending on the speed of your computer and internet connection (it may not work at all if either isn’t very fast). So leave it running and come back to it later.

Extracting data from the raw HTML with parseHTML

When it’s finished you’ll have another column where each cell is a bunch of HTML. You’ll need to create a new column to extract what you need from that, and you’ll also need some GREL expressions explained here.

First you need to identify what data you want, and where it is in the HTML. To find it, right-click on one of the webpages containing the data, and search for a key phrase or figure that you want to extract. Around that data you want to find a HTML tag like <table class=”destinations”> or <div id=”statistics”>. Keep that open in another window while you tweak the expression we come onto below…

Back in Google Refine, at the top of the HTML column click on the drop-down menu and select Edit column > Add column based on this column…

In the New column name box at the top give it a name describing the data you’re going to pull out.

In the Expression box type the following piece of GREL (Google Refine Expression Language):

value.parseHtml().select(“table.destinations”)[0].select(“tr”).toString()

(Again, type the quotation marks yourself rather than copying them from here or you may have problems)

I’ll break down what this is doing:

value.parseHtml()

parse the HTML in each cell (value)

.select(“table.destinations”)

find a table with a class (.) of “destinations” (in the source HTML this reads <table class=”destinations”>. If it was <div id=”statistics”> then you would write .select(“div#statistics”) – the hash sign representing an ‘id’ and the full stop representing a ‘class’.

[0]

This zero in square brackets tells Refine to only grab the first table – a number 1 would indicate the second, and so on. This is because numbering (“indexing”) generally begins with zero in programming.

.select(“tr”)

Now, within that table, find anything within the tag <tr>

.toString()

And convert the results into a string of text.

The results of that expression in the Preview window should look something like this:

<tr> <th></th> <th>Abbotswell School</th> <th>Aberdeen City</th> <th>Scotland</th> </tr> <tr> <th>Percentage of pupils</th> <td>25.5%</td> <td>16.3%</td> <td>22.6%</td> </tr>

This is still HTML, but a much smaller and manageable chunk. You could, if you chose, now export it as a spreadsheet file and use various techniques to get rid of the tags (Find and Replace, for example) and split the data into separate columns (the =SPLIT formula, for example).

Or you could further tweak your GREL code in Refine to drill further into your data, like so:

value.parseHtml().select(“table.destinations”)[0].select(“td”)[0].toString()

Which would give you this:

<td>25.5%</td>

Or you can add the .substring function to strip out the HTML like so (assuming that the data you want is always 5 characters long):

value.parseHtml().select(“table.destinations”)[0].select(“td”)[0].toString().substring(5,10)

When you’re happy, click OK and you should have a new column for that data. You can repeat this for every piece of data you want to extract into a new column.

Then click Export in the upper right corner and save as a CSV or Excel file.

Source: http://onlinejournalismblog.com/2012/01/13/sftw-scraping-data-with-google-refine/

Tuesday 9 June 2015

Scraping Services - Assuring Scraping Success with Proxy Data Scraping

Have you ever heard of "Data Scraping?" Data Scraping is the process of collecting useful data that has been placed in the public domain of the internet (private areas too if conditions are met) and storing it in databases or spreadsheets for later use in various applications. Data Scraping technology is not new and many a successful businessman has made his fortune by taking advantage of data scraping technology.

Sometimes website owners may not derive much pleasure from automated harvesting of their data. Webmasters have learned to disallow web scrapers access to their websites by using tools or methods that block certain ip addresses from retrieving website content. Data scrapers are left with the choice to either target a different website, or to move the harvesting script from computer to computer using a different IP address each time and extract as much data as possible until all of the scraper's computers are eventually blocked.

Thankfully there is a modern solution to this problem. Proxy Data Scraping technology solves the problem by using proxy IP addresses. Every time your data scraping program executes an extraction from a website, the website thinks it is coming from a different IP address. To the website owner, proxy data scraping simply looks like a short period of increased traffic from all around the world. They have very limited and tedious ways of blocking such a script but more importantly -- most of the time, they simply won't know they are being scraped.

You may now be asking yourself, "Where can I get Proxy Data Scraping Technology for my project?" The "do-it-yourself" solution is, rather unfortunately, not simple at all. Setting up a proxy data scraping network takes a lot of time and requires that you either own a bunch of IP addresses and suitable servers to be used as proxies, not to mention the IT guru you need to get everything configured properly. You could consider renting proxy servers from select hosting providers, but that option tends to be quite pricey but arguably better than the alternative: dangerous and unreliable (but free) public proxy servers.

There are literally thousands of free proxy servers located around the globe that are simple enough to use. The trick however is finding them. Many sites list hundreds of servers, but locating one that is working, open, and supports the type of protocols you need can be a lesson in persistence, trial, and error. However if you do succeed in discovering a pool of working public proxies, there are still inherent dangers of using them. First off, you don't know who the server belongs to or what activities are going on elsewhere on the server. Sending sensitive requests or data through a public proxy is a bad idea. It is fairly easy for a proxy server to capture any information you send through it or that it sends back to you. If you choose the public proxy method, make sure you never send any transaction through that might compromise you or anyone else in case disreputable people are made aware of the data.

A less risky scenario for proxy data scraping is to rent a rotating proxy connection that cycles through a large number of private IP addresses. There are several of these companies available that claim to delete all web traffic logs which allows you to anonymously harvest the web with minimal threat of reprisal. Companies such as offer large scale anonymous proxy solutions, but often carry a fairly hefty setup fee to get you going.

The other advantage is that companies who own such networks can often help you design and implementation of a custom proxy data scraping program instead of trying to work with a generic scraping bot. After performing a simple Google search, I quickly found one company (www.ScrapeGoat.com) that provides anonymous proxy server access for data scraping purposes. Or, according to their website, if you want to make your life even easier, ScrapeGoat can extract the data for you and deliver it in a variety of different formats often before you could even finish configuring your off the shelf data scraping program.

Whichever path you choose for your proxy data scraping needs, don't let a few simple tricks thwart you from accessing all the wonderful information stored on the world wide web!

Source: http://ezinearticles.com/?Assuring-Scraping-Success-with-Proxy-Data-Scraping&id=248993

Tuesday 2 June 2015

Twitter Scraper Python Library

I wanted to save the tweets from Transparency Camp. This prompted me to turn Anna‘s basic Twitter scraper into a library. Here’s how you use it.

Import it. (It only works on ScraperWiki, unfortunately.)

from scraperwiki import swimport

search = swimport('twitter_search').search

Then search for terms.

search(['picnic #tcamp12', 'from:TCampDC', '@TCampDC', '#tcamp12', '#viphack'])

A separate search will be run on each of these phrases. That’s it.

A more complete search

Searching for #tcamp12 and #viphack didn’t get me all of the tweets because I waited like a week to do this. In order to get a more complete list of the tweets, I looked at the tweets returned from that first search; I searched for tweets referencing the users who had tweeted those tweets.

from scraperwiki.sqlite import save, select

from time import sleep

# Search by user to get some more

users = [row['from_user'] + ' tcamp12' for row in \

select('distinct from_user from swdata where from_user where user > "%s"' \

% get_var('previous_from_user', ''))]

for user in users:

    search([user], num_pages = 2)

    save_var('previous_from_user', user)

    sleep(2)

By default, the search function retrieves 15 pages of results, which is the maximum. In order to save some time, I limited this second phase of searching to two pages, or 200 results; I doubted that there would be more than 200 relevant results mentioning a particular user.

The full script also counts how many tweets were made by each user.

Library

Remember, this is a library, so you can easily reuse it in your own scripts, like Max Richman did.

Source: https://scraperwiki.wordpress.com/2012/07/04/twitter-scraper-python-library/

Friday 29 May 2015

Data Scraping Services - Things to take care while doing Web Scraping!!!

In the present day and age, web scraping word becomes most popular in data science. Basically web scraping is extracting the information from the websites using pre-written programs and web scraping scripts. Many organizations have successfully used web site scraping to build relevant and useful database that they use on a daily basis to enhance their business interests. This is the age of the Big Data and web scraping is one of the trending techniques in the data science.

Throughout my journey of learning web scraping and implementing many successful scraping projects, I have come across some great experiences we can learn from.  In this post, I’m going to discuss some of the approaches to take and approaches to avoid while executing web scraping.

User Proxies: Anonymously scraping data from websites

One should not scrape website with a single IP Address. Because when you repeatedly request the web page for web scraping, there is a chance that the remote web server might block your IP address preventing further request to the web page. To overcome this situation, one should scrape websites with the help of proxy servers (anonymous scraping). This will minimize the risk of getting trapped and blacklisted by a website. Use of Proxies to hide your identity (network details) to remote web servers while scraping data. You may also use a VPN instead of proxies to anonymously scrape websites.

Take maximum data and store it.

Do not follow “process the web page as it comes from the remote server”. Instead take all the information and store it to disk. This approach will be useful when your scraping algorithm breaks in the middle. In this case you don’t have to start scraping again. Never download the same content more than once as you are just wasting bandwidth. Try and download all content to disk in one go and then do the processing.

Follow strict rules in parsing:

Check various rules while parsing the information from the web site. For example if you expect a value to be a date then check that it’s really a date. This may greatly improve the quality of information. When you get unexpected data, then the algorithm need to be changed accordingly.

Respect Robots.txt

Robots.txt specifies the set of rules that should be followed by web crawlers and robots. I strongly advise you to consider and adjust your crawler to fully respect robots.txt. Robots.txt contains instructions on the exact pages that you are allowed to crawl, user-agent, and the requisite intervals between page requests. Following to these instructions minimizes the chance of getting blacklisted and banned from website owner.

Use XPath Smartly

XPath is a nice option to select elements of the HTML document more flexibly than CSS Selectors.  Be careful about HTML structure change through page to page so one xpath you made may be failed to extract data on another page due to changes in HTML structure.

Obey Website TOC:

Some websites make it absolutely apparent in their terms and conditions that they are particularly against to web scraping activities on their content. This can make you vulnerable against possible ethical and legal implications.

Test sample scrape and verify the data with actual scrape

Once you are done with web scraping project set up, you need to test it for sometimes. Check the extracted data. If something is not good, find out the cause and make changes accordingly and finally come to a perfect web scraping project.

Source: http://webdata-scraping.com/things-take-care-web-scraping/

Tuesday 26 May 2015

Web Scraping Services : What are the ethics of web scraping?

Someone recently asked: "Is web scraping an ethical concept?" I believe that web scraping is absolutely an ethical concept. Web scraping (or screen scraping) is a mechanism to have a computer read a website. There is absolutely no technical difference between an automated computer viewing a website and a human-driven computer viewing a website. Furthermore, if done correctly, scraping can provide many benefits to all involved.

There are a bunch of great uses for web scraping. First, services like Instapaper, which allow saving content for reading on the go, use screen scraping to save a copy of the website to your phone. Second, services like Mint.com, an app which tells you where and how you are spending your money, uses screen scraping to access your bank's website (all with your permission). This is useful because banks do not provide many ways for programmers to access your financial data, even if you want them to. By getting access to your data, programmers can provide really interesting visualizations and insight into your spending habits, which can help you save money.

That said, web scraping can veer into unethical territory. This can take the form of reading websites much quicker than a human could, which can cause difficulty for the servers to handle it. This can cause degraded performance in the website. Malicious hackers use this tactic in what’s known as a "Denial of Service" attack.

Another aspect of unethical web scraping comes in what you do with that data. Some people will scrape the contents of a website and post it as their own, in effect stealing this content. This is a big no-no for the same reasons that taking someone else's book and putting your name on it is a bad idea. Intellectual property, copyright and trademark laws still apply on the internet and your legal recourse is much the same. People engaging in web scraping should make every effort to comply with the stated terms of service for a website. Even when in compliance with those terms, you should take special care in ensuring your activity doesn't affect other users of a website.

One of the downsides to screen scraping is it can be a brittle process. Minor changes to the backing website can often leave a scraper completely broken. Herein lies the mechanism for prevention: making changes to the structure of the code of your website can wreak havoc on a screen scraper's ability to extract information. Periodically making changes that are invisible to the user but affect the content of the code being returned is the most effective mechanism to thwart screen scrapers. That said, this is only a set-back. Authors of screen scrapers can always update them and, as there is no technical difference between a computer-backed browser and a human-backed browser, there's no way to 100% prevent access.

Going forward, I expect screen scraping to increase. One of the main reasons for screen scraping is that the underlying website doesn't have a way for programmers to get access to the data they want. As the number of programmers (and the need for programmers) increases over time, so too will the need for data sources. It is unreasonable to expect every company to dedicate the resources to build a programmer-friendly access point. Screen scraping puts the onus of data extraction on the programmer, not the company with the data, which can work out well for all involved.

Source: https://quickleft.com/blog/is-web-scraping-ethical/

Friday 22 May 2015

5 tips for scraping big websites

Scraping bigger websites can be a challenge if done the wrong way.

Bigger websites would have more data, more security and more pages. We’ve learned a lot from our years of crawling such large complex websites, and these tips could help solve some of your challenges

1. Cache pages visited for scraping

When scraping big websites, its always a good idea to cache the data you have already downloaded. So you don’t have to put load on the website again, in case you have start over again or that page is required again during scraping. Its effortless to cache to a key value store like redis.But Databases and filesystem caches are also good.

2. Don’t flood the website with large number of parallel requests , take it slow

Big websites posses algorithms to detect webscraping, large number of parallel requests from the same ip address will identify you as a Denial Of Service Attack on their website, and blacklist your IPs immediately. A better idea is to time your requests properly one after the other, giving it some human behavior. Oh!.. but scraping like that will take you ages. So balance requests using the average response time of the websites, and play around with the number of parallel requests to the website to get the right number.

3. Store the URLs that you have already fetched

You may want to keep a list of URLs you have already fetched, in a database or a key value store. What would you do if you scraper crashes after scraping 70% of the website. If you need to complete the remaining 30%, with out this list of URLs, you’ll waste a lot of time and bandwidth. Make sure you store this list of URLs

somewhere permanent, till you have all the required data. This could also be combined with the cache. This way, you’ll be able to resume scraping.

4. Split scraping into different phases

Its easier and safer if you split the scraping into multiple smaller phases. For example, you could split scraping a huge site into two. One for gathering links to the pages from which you need to extract data and another for downloading these pages to scrape content.

5. Take only whats required

Don’t grab or follow every link unless its required. You can define a proper navigation scheme to make the scraper visit only the pages required. Its always tempting to grab everything, but its just a waste of bandwidth, time and storage.

Source: http://learn.scrapehero.com/5-tips-for-scraping-big-websites/

Wednesday 20 May 2015

How to Generate Sales Leads Using Web Scraping Services

The first stage of any selling process is what is popularly known as “lead generation”. This phase is what most businesses place at the apex of their sales concerns. It is a driving force that governs decision-making at its highest levels, and influences business strategy and planning. If you are about to embark on an outbound sales campaign and are in the process of looking for leads, you would acknowledge the fact that lead generation process is of extreme importance for any business.

Different lead generation techniques have been used over and over again by companies around the world to satiate this growing business need. Newer, more innovative methods have also emerged to help marketers in this process. One such method of lead generation that is fast catching on, and is poised to play a big role for businesses in the coming years, is web scraping. With web scraping, you can easily get access to multiple relevant and highly customized leads – a perfect starting point for any marketing, promotional or sales campaign.

The prominence of Web Scraping in overall marketing strategy

At present, levels of competition have risen sky high for most businesses. For success, lead generation and gaining insight about customer behavior and preferences is an essential business requirement. Web scraping is the process of scraping or mining the internet for information. Different tools and techniques can be used to harvest information from multiple internet sources based on relevance, and the structured and organized in a way that makes sense to your business. Companies that provide web scraping services essentially use web scrapers to generate a targeted lead database that your company can then integrate into its marketing and sales strategies and plans.

The actual process of web scraping involves creating scraping scripts or algorithms which crawl the web for information based on certain preset parameters and options. The scraping process can be customized and tuned towards finding the kind of data that your business needs. The script can extract data from websites automatically, collate and put together a meaningful collection of leads for business development.

Lead Generation Basics

At a very high level, any person who has the resources and the intent to purchase your product or service qualifies as a lead. In the present scenario, you need to go far deeper than that. Marketers need to observe behavior patterns and purchasing trends to ensure that a particular person qualifies as a lead. If you have a group of people you are targeting, you need to decide who the viable leads will be, acquire their contact information and store it in a database for further action.

List buying used to be a popular way to get leads, but their efficacy has dwindled over time. Web scraping is the fast coming up as a feasible lead generation technique, allowing you to find highly focused and targeted leads in short amounts of time. All you need is a service provider that would carry out the data mining necessary for lead generation, and you end up with a list of actionable leads that you can try selling to.

How Web Scraping makes a substantial difference

With web scraping, you can extract valuable predictive information from websites. Web scraping facilitates high quality data collection and allows you to structure marketing and sales campaigns better. To drive sales and maximize revenue, you need strong, viable leads. To facilitate this, you need critical data which encompasses customer behavior, contact details, buying patterns and trends, willingness and ability to spend resources, and a myriad of other aspects critical to ascertain the potential of an entity as a rewarding lead. Data mining through web scraping can be a great way to get to these factors and identifying the leads that would make a difference for your business.

Crawling through many different web locales using different techniques, web scraping services pick up a wealth of information. This highly relevant and specialized information instantly provides your business with actionable leads. Furthermore, this exercise allows you to fine-tune your data management processes, make more accurate and reliable predictions and projections, arrive at more effective, strategic and marketing decisions and customize your workflow and business development to better suit the current market.

The Process and the Tools

Lead generation, being one of the most important processes for any business, can prove to be an expensive proposition if not handled strategically. Companies spend large amounts of their resources acquiring viable leads they can sell to. With web scraping, you can dramatically cut down the costs involved in lead generation and take your business forward with speed and efficiency. Here are some of the time-tested web scraping tools which can come in handy for lead generation –

•    Website download software – Used to copy entire websites to local storage. All website pages are downloaded and the hierarchy of navigation and internal links preserve. The stored pages can then be viewed and scoured for information at any later time.     Web scraper – Tools that crawl through bulk information on the internet, extracting specific, relevant data using a set of pre-defined parameters.

•    Data grabber – Sifts through websites and databases fast and extracts all the information, which can be sorted and classified later.

•    Text extractor – Can be used to scrape multiple websites or locations for acquiring text content from websites and web documents. It can mine data from a variety of text file formats and platforms.

With these tools, web scraping services scrape websites for lead generation and provide your business with a set of strong, actionable leads that can make a difference.

Covering all Bases

The strength of web scraping and web crawling lies in the fact that it covers all the necessary bases when it comes to lead generation. Data is harvested, structured, categorized and organized in such a way that businesses can easily use the data provided for their sales leads. As discussed earlier, cold and detached lists no longer provide you with enough actionable leads. You need to look at various factors and consider them during your lead generation efforts –

•    Contact details of the prospect

•    Purchasing power and purchasing history of the prospect

•    Past purchasing trends, willingness to purchase and history of buying preferences of the prospect

•    Social markers that are indicative of behavioral patterns

•    Commercial and business markers that are indicative of behavioral patterns

•    Transactional details

•    Other factors including age, gender, demography, social circles, language and interests

All these factors need to be taken into account and considered in detail if you have to ensure whether a lead is viable and actionable, or not. With web scraping you can get enough data about every single prospect, connect all the data collected with the help of onboarding, and ascertain with conviction whether a particular prospect will be viable for your business.

Let us take a look at how web scraping addresses these different factors –

1. Scraping website’s

During the scraping process, all websites where a particular prospect has some participation are crawled for data. Seemingly disjointed data can be made into a sensible unit by the use of onboarding- linking user activities with their online entities with the help of user IDs. Documents can be scanned for participation. E-commerce portals can be scanned to find comments and ratings a prospect might have delivered to certain products. Service providers’ websites can be scraped to find if the prospect has given a testimonial to any particular service. All these details can then be accumulated into a meaningful data collection that is indicative of the purchasing power and intent of the prospect, along with important data about buying preferences and tastes.

2. Social scraping

According to a study, most internet users spend upwards of two hours every day on social networks. Therefore, scraping social networks is a great way to explore prospects in detail. Initially, you can get important identification markers like names, addresses, contact numbers and email addresses. Further, social networks can also supply information about age, gender, demography and language choices. From this basic starting point, further details can be added by scraping social activity over long periods of time and looking for activities which indicate purchasing preferences, trends and interests. This exercise provides highly relevant and targeted information about prospects can be constructively used while designing sales campaigns.

3. Transaction scraping

Through the scraping of transactions, you get a clear idea about the purchasing power of prospects. If you are looking for certain income groups or leads that invest in certain market sectors or during certain specific periods of time, transaction scraping is the best way to harvest meaningful information. This also helps you with competition analysis and provides you with pointers to fine-tune your marketing and sales strategies.

Using these varied lead generation techniques and finding the right balance and combination is key to securing the right leads for your business. Overall, signing up for web scraping services can be a make or break factor for your business going forward. With a steady supply of valuable leads, you can supercharge your sales, maximize returns and craft the perfect marketing maneuvers to take your business to an altogether new dimension.

Source: https://www.promptcloud.com/blog/how-to-generate-sales-leads-using-web-scraping-services/

Sunday 17 May 2015

Metadata Scraping Service

As mentioned in Robert's last blog post we set up a scraping service which supports users working with citations by extracting automatically references from digital library or publisher websites. We use a very similar service in BibSonomy to support our users while posting a new reference. However, the service is independent from BibSonomy. Our main goal is to make the metadata of other websites easily accessible to every user who needs bibliographic metadata. Therefore we offer the extracted information in BibTeX format. Most tools allow to import BibTeX so it should be very easy for everyone to get the data into his own tool. The service is running under the following URL:

http://scraper.bibsonomy.org/

Currently we support more than 60 different websites (here the full list) and we are working on further extensions. In the near future we will make the source code of our scrapers publicly available under GPL and we hope that other people will find it useful and start to help us by implementing their own scrapers.

How does the service work?

In principle there are two ways to use the service. One uses a so

called bookmarklet and the other is simply based on the URL. If you

have a webpage of a supported site e.g. from ACM digital library the

following page:

Logsonomy - social information retrieval with logdata

then you can copy this URL into the form on the service homepage and the service will return you the extracted BibTeX information. As this is not a very convenient way to access the data we provide a ScrapePublication button. This button is a small piece of JavaScript and can be copied to the toolbar of the browser. By pressing this button while visiting a digital library webpage the URL will be automatically copied and sent to the scraping service and the metadata is extracted.

The service has three options which can be used to customize it and to make it useful for other systems. Obviously one parameter is the URL itself which is used by the bookmarklet, too. The next is the selection parameter which allows to send text to the service and the last parameter allows to change the output format from html to plain BibTeX. This last parameter makes integration with other systems very simple.

If needed we can provide the metadata in other formats as well but currently we support only BibTeX.

Source: http://blog.bibsonomy.org/2008/11/metadata-scraping-service.html

Wednesday 6 May 2015

4 Web Scraping Tools To Save You Time On Data Extraction

Either you are working on a product website, struggling to add live data feed to your app or merely need to pull out a huge amount of online data for analysis, an accurate web scraping tool can save you loads of time and keep you sane. Here are four powerful web scraping tools to save you from copy-pasting or spending time on writing your own scripts.

1. Uipath

Either you are working on a product website, struggling to add live data feed to your app or merely need to pull out a huge amount of online data for analysis, an accurate web scraping tool can save you loads of time and keep you sane. Here are four powerful web scraping tools to save you from copy-pasting or spending time on writing your own scripts.

1. Uipath

Uipath specializes in developing various process automation software including web scraping and screen scraping software for desktop and web. Uipath web scraper is perfect for non-coders and easily surpasses most common data extraction challenges including page navigation, digging through flash and even scraping PDF files. All you need to do is open the web scraping wizard and simply highlight the data you need to extract. The tool will scrape all the data following this pattern at all pages you’ve chosen and sort it accordingly. You can add as many items for scraping as you like and have them sorted in respective columns. As a result, you receive a neat Excel or CSV document with all the data eliminated from duplicates.

Moreover, Uipath isn’t just about scraping. This software can be used not only for extracting data, but to manipulate the interface of another app, thus establishing data transfers among the two of them. Basically, this tool could be used to conduct any repetitive task a human could do, yet much faster and with higher accuracy.

Pros: You can automate form filling, clicking buttons, navigation etc. Uipath scraper is impressively accurate, fast and simple to use. It “reads” all types of data on screen (JS, HTML, Silverlight and more), plus you can train the software to emulate human actions of various complexity.

Cons: Premium software runs at a premium price. Uipath is an affordable professional solution, but may be a bit too pricey for personal use.

2. Import.io

Data Extraction

Either you are working on a product website, struggling to add live data feed to your app or merely need to pull out a huge amount of online data for analysis, an accurate web scraping tool can save you loads of time and keep you sane. Here are four powerful web scraping tools to save you from copy-pasting or spending time on writing your own scripts.

1. Uipath

Uipath specializes in developing various process automation software including web scraping and screen scraping software for desktop and web. Uipath web scraper is perfect for non-coders and easily surpasses most common data extraction challenges including page navigation, digging through flash and even scraping PDF files. All you need to do is open the web scraping wizard and simply highlight the data you need to extract. The tool will scrape all the data following this pattern at all pages you’ve chosen and sort it accordingly. You can add as many items for scraping as you like and have them sorted in respective columns. As a result, you receive a neat Excel or CSV document with all the data eliminated from duplicates.

Moreover, Uipath isn’t just about scraping. This software can be used not only for extracting data, but to manipulate the interface of another app, thus establishing data transfers among the two of them. Basically, this tool could be used to conduct any repetitive task a human could do, yet much faster and with higher accuracy.

Pros: You can automate form filling, clicking buttons, navigation etc. Uipath scraper is impressively accurate, fast and simple to use. It “reads” all types of data on screen (JS, HTML, Silverlight and more), plus you can train the software to emulate human actions of various complexity.

Cons: Premium software runs at a premium price. Uipath is an affordable professional solution, but may be a bit too pricey for personal use.

2. Import.io

Import.io offers you a free desktop app to help you scrap all the data you need from an unlimited amount of web pages. The service treats each page as a potential data source to generate API from. If the page you’ve submitted has been previously processed, you can access its API and get some of the data. In other case, Import.io will guide you through the process of creating the scraping matrix by building connectors (for navigation) or extractors (to pull out the needed data). Afterwards, you submit a request for extraction and it’s typically processed within 24 hours. All the data is private and you can schedule auto refreshments at any chosen period of time.

Pros: The service is easy-to-use with no tech skills needed. It can  pages with data (those that needed login/pass), plus it’s free. Minimalistic effective design and simple navigation comes along.

Cons: Improt.io has hard times navigating through combinations of javascript/POST and cannot navigate from one page to another (e.g. click next, second page etc).  Sometimes, it takes over 24 hours to receive the report.  Besides, it’s a browser-only app, non-compatible with other applications.

3. Kimono

Either you are working on a product website, struggling to add live data feed to your app or merely need to pull out a huge amount of online data for analysis, an accurate web scraping tool can save you loads of time and keep you sane. Here are four powerful web scraping tools to save you from copy-pasting or spending time on writing your own scripts.

1. Uipath

Uipath specializes in developing various process automation software including web scraping and screen scraping software for desktop and web. Uipath web scraper is perfect for non-coders and easily surpasses most common data extraction challenges including page navigation, digging through flash and even scraping PDF files. All you need to do is open the web scraping wizard and simply highlight the data you need to extract. The tool will scrape all the data following this pattern at all pages you’ve chosen and sort it accordingly. You can add as many items for scraping as you like and have them sorted in respective columns. As a result, you receive a neat Excel or CSV document with all the data eliminated from duplicates.

Moreover, Uipath isn’t just about scraping. This software can be used not only for extracting data, but to manipulate the interface of another app, thus establishing data transfers among the two of them. Basically, this tool could be used to conduct any repetitive task a human could do, yet much faster and with higher accuracy.

Pros: You can automate form filling, clicking buttons, navigation etc. Uipath scraper is impressively accurate, fast and simple to use. It “reads” all types of data on screen (JS, HTML, Silverlight and more), plus you can train the software to emulate human actions of various complexity.

Cons: Premium software runs at a premium price. Uipath is an affordable professional solution, but may be a bit too pricey for personal use.

2. Import.io

Import.io offers you a free desktop app to help you scrap all the data you need from an unlimited amount of web pages. The service treats each page as a potential data source to generate API from. If the page you’ve submitted has been previously processed, you can access its API and get some of the data. In other case, Import.io will guide you through the process of creating the scraping matrix by building connectors (for navigation) or extractors (to pull out the needed data). Afterwards, you submit a request for extraction and it’s typically processed within 24 hours. All the data is private and you can schedule auto refreshments at any chosen period of time.

Pros: The service is easy-to-use with no tech skills needed. It can  pages with data (those that needed login/pass), plus it’s free. Minimalistic effective design and simple navigation comes along.

Cons: Improt.io has hard times navigating through combinations of javascript/POST and cannot navigate from one page to another (e.g. click next, second page etc).  Sometimes, it takes over 24 hours to receive the report.  Besides, it’s a browser-only app, non-compatible with other applications.

3. Kimono

Kimono is a popular web scraper among app developers who prefer to power up their products with live data and no additional code. It saves you tons of time when you need to fill up your app with mashing data. Install Kimono Browser bookmarklet; highlight page elements you need to and provide some positive/negative examples to train the tool. After labeling all the data you can download it in CSV/JSON/a web endpoint format. The APIs created for your pages are stored in the cloud and you can run them on schedule. So far, Kimono is free to use with pro and enterprise solutions to be launched soon.

Pros: The tool works pretty fast and works great with scraping newsfeeds and prices. The data is rather accurate.

Cons: No page navigation available and you need to spend quite a lot of time to train Kimono before it starts to pull out the multi items data accurate enough. In general, I’d say Kimono is more of an app mash-ups creator than a full-scale web scraper.

4. Screen Scraper

Either you are working on a product website, struggling to add live data feed to your app or merely need to pull out a huge amount of online data for analysis, an accurate web scraping tool can save you loads of time and keep you sane. Here are four powerful web scraping tools to save you from copy-pasting or spending time on writing your own scripts.

1. Uipath

Uipath specializes in developing various process automation software including web scraping and screen scraping software for desktop and web. Uipath web scraper is perfect for non-coders and easily surpasses most common data extraction challenges including page navigation, digging through flash and even scraping PDF files. All you need to do is open the web scraping wizard and simply highlight the data you need to extract. The tool will scrape all the data following this pattern at all pages you’ve chosen and sort it accordingly. You can add as many items for scraping as you like and have them sorted in respective columns. As a result, you receive a neat Excel or CSV document with all the data eliminated from duplicates.

Moreover, Uipath isn’t just about scraping. This software can be used not only for extracting data, but to manipulate the interface of another app, thus establishing data transfers among the two of them. Basically, this tool could be used to conduct any repetitive task a human could do, yet much faster and with higher accuracy.

Pros: You can automate form filling, clicking buttons, navigation etc. Uipath scraper is impressively accurate, fast and simple to use. It “reads” all types of data on screen (JS, HTML, Silverlight and more), plus you can train the software to emulate human actions of various complexity.

Cons: Premium software runs at a premium price. Uipath is an affordable professional solution, but may be a bit too pricey for personal use.

2. Import.io

Import.io offers you a free desktop app to help you scrap all the data you need from an unlimited amount of web pages. The service treats each page as a potential data source to generate API from. If the page you’ve submitted has been previously processed, you can access its API and get some of the data. In other case, Import.io will guide you through the process of creating the scraping matrix by building connectors (for navigation) or extractors (to pull out the needed data). Afterwards, you submit a request for extraction and it’s typically processed within 24 hours. All the data is private and you can schedule auto refreshments at any chosen period of time.

Pros: The service is easy-to-use with no tech skills needed. It can  pages with data (those that needed login/pass), plus it’s free. Minimalistic effective design and simple navigation comes along.

Cons: Improt.io has hard times navigating through combinations of javascript/POST and cannot navigate from one page to another (e.g. click next, second page etc).  Sometimes, it takes over 24 hours to receive the report.  Besides, it’s a browser-only app, non-compatible with other applications.

3. Kimono

Kimono is a popular web scraper among app developers who prefer to power up their products with live data and no additional code. It saves you tons of time when you need to fill up your app with mashing data. Install Kimono Browser bookmarklet; highlight page elements you need to and provide some positive/negative examples to train the tool. After labeling all the data you can download it in CSV/JSON/a web endpoint format. The APIs created for your pages are stored in the cloud and you can run them on schedule. So far, Kimono is free to use with pro and enterprise solutions to be launched soon.

Pros: The tool works pretty fast and works great with scraping newsfeeds and prices. The data is rather accurate.

Cons: No page navigation available and you need to spend quite a lot of time to train Kimono before it starts to pull out the multi items data accurate enough. In general, I’d say Kimono is more of an app mash-ups creator than a full-scale web scraper.

4. Screen Scraper

Screen scraper is pretty neat and tackles a lot of difficult tasks including navigation and precise data extractions, however it requires a bit of programming/tokenization skills if you’d like to run it super smooth. Launch the software, add a proxy, start recording the list of your actions and creating extracting patterns (some coding required). Works great with HTML and Javascript, however you should test it with Citrix and other platforms. Basically, screen scraper helps you writing simple web scraping scripts and lets you download the extracted data in txt/csv/excel format.

Pros: When set correctly, there’s no data extraction tasks Screen scraper fails to handle.

Cons: The tool is pricey and you’ll have to go through documentation and have basic coding skills to use it.

Source: http://tech.co/4-web-scraping-tools-save-time-data-extraction-2015-03

Thursday 30 April 2015

Earn Money From Price Comparison Through Web Scraping

Many individuals discover the pot of gold just within their reach. They have realized that there is money in the web. Cyber technology has blessed mankind with so many benefits that makes money very possible by just some clicks on the mouse and keyboard. Building a price comparison website is an effective way of helping clients find their desired products while you as the owner earn money at the same time.

Building price comparison websites

There is indeed much money in building price comparison websites but it is not an easy task especially for a novice in maintaining a website of one’s own. Since this entails some serious programming and ample familiarity with data feeds, you have to have a good working plan. In addition, what you are venturing into is greater than the usual blogs about just anything that you can think of. Furthermore, you are stepping into the vast field of electronic marketing, therefore you must be ready.

The first point of consideration is to identify which products or services are you going to include in your website. Choose a product or service that you and a majority of clients are mostly interested in. Suppose you want you to choose sports as your theme then you can include items and prices of sports gear, clothing such as uniforms, training videos, books, and other safety stuff. You need to do some research and even a survey to determine whether the goods and services you are promoting on your website are in demand and are what most people want to know. Moreover, it is on this stage that you may need the help of experts and veterans in the field of building to be assured that you are on the right track.

In addition, be willing to change in case your chosen category is not gaining readership or visitors. Then evaluate whether you need to expand or to be more specific in your description of the products and the comparison of the prices. Make your site prominent by search engine optimization (SEO) and make sure to acknowledge also that not too many people visit a site that is not free.

Helping visitors choose the best product/services

Good marketing strategy starts with knowing who your target audience are. There is indeed a need to do a lot of planning and research in order to understand your client’s needs and preferences. Moreover, knowing them thoroughly leads to achieving 100% consumer satisfaction. When you have provided everything they need to know about certain products, they would not need to seek elsewhere which will also gain you more regular visitors. Remember that your audience are members of communities and social networks such that there is a great possibility that they would spread the word around about the good services you are offering.

If there is a need to conduct a survey in addition to research, you should resort to it. In this manner you can discover what goods and services are not yet completely exhausted by the other websites or web creators. Ample knowledge about your potential visitors and consumers will surely make you effectively provide them with adequate statistics for their needs.

Your site will then look like a complete guidebook for them that will give them the best value for their money. Therefore, it must be thoroughly filled with product details, uses, options, and prices.

Making money as affiliate of eCommerce websites

Maintaining a price comparison website gives you less worry about getting paid or having your products bought and sold because income comes in through advertising and affiliate sales. Affiliate marketing is a way of earning money online by serving as a publisher for promotion of products, services or sites of businesses. The affiliate receives rewards from businesses for each visitor or client that comes to the business website or buys its product through the efforts of the advertising and promotion that is made by the affiliate. This is the online version of the concept of agent or referral fee sales channel. Aside from website owners, bloggers as well as members of community forums can also serve as affiliates. The affiliate earns money in three ways: through pay per link; pay per sale and pay per lead.

Trust in the reliability of the product - You should have a personal belief or confidence in the product you are promoting not only because it makes you sound more convincing, but also because you need to maintain your clients and establish credibility in your blog or website. In other words, don’t just pick any product. If you cannot use them personally, they should at least have several positive reviews and no negative ones.

Maintain credibility with readers and fellow bloggers - Befriend your readers and your co-bloggers by answering their queries sincerely and quickly. Your friendly attitude can win you their trust which is a very vital element of affiliate marketing.

Do reviews - In addition to publishing price comparison, you can gain more visitors by writing about the product and do proper SEO (Search Engine Optimization). So the expected happens, the more prominent the product becomes online, the higher will be your income.

Link with friends thru social media - Your friends have friends and their friends have also friends. Just think of how powerful your social media site can be when you post your link on your account on Facebook, Twitter or MySpace and others. Since trust is built on friendship, it is easy to get clients from among your friends and their friends.

Overall, you get all pertinent information about certain products through web data mining or web scraping. All you need to do is to be keen to the needs of your clients and use web content extraction efficiently.

Source: http://www.loginworks.com/earn-money-price-comparison-web-scraping/

Tuesday 28 April 2015

Scraping a website from a windows service

Question

Hi there.  I have a windows forms application that scrapes a website to retrieve some data.  I would like to implement the same functionality as a windows service.  The reason for this is to allow the program to run 24/7 without  having a user signed in. 

To that end, my current version of the program uses a web browser control (system.windows.forms.webbrowser) to navigate the pages, click the buttons, allow scripts to do their thing, etc.  I cannot figure out a way to do the same without the web browser control, but the web browser control cannot be instantiated in a windows service (because there is no user interface in a web service).

Does anyone have any brilliant ideas on how to get around this?

Thank you very much!

Answers

Hi Andy,

There is a tool which could let you manipulate anything you want on the website. This agile HTML parser builds a read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams). More information, please check:

http://htmlagilitypack.codeplex.com/

Have a nice day.

Best regards

All replies

You are not telling if you are using a .NET Express edition or not

You are not telling which Framework

You are not realy saying what data you are getting from the web site.

So

I made an example of service that work on any Studio edition (including the Express)

to install it, I supposed that you have at least the Framework2, so you will use something similar to:

    %SystemRoot%\Microsoft.NET\Framework\v2.0.50727\installutil /i C:\Test\MyWindowService\MyWindowService\bin\Release\MyWindowService.exe

In the example, I supposed that you are downloading some file from the site

You will need a reference to Windows.Form for the timer


Imports System.ServiceProcess

Imports System.Configuration.Install

Public Class WindowsService : Inherits ServiceBase

  Private Minute As Integer = 60000

  Private WithEvents Timer As New Timer With {.Interval = 30 * Minute, .Enabled = True}

  Public Sub New()

    Me.ServiceName = "MyService"

    Me.EventLog.Log = "Application"

    Me.CanHandlePowerEvent = True

    Me.CanHandleSessionChangeEvent = True

    Me.CanPauseAndContinue = True

    Me.CanShutdown = True

    Me.CanStop = True

  End Sub


  Private Sub Timer_Tick(ByVal sender As Object, ByVal e As System.EventArgs) Handles Timer.Tick

    If IO.File.Exists("C:\MyPath.Data") Then IO.File.Delete("C:\MyPath.Data")

    My.Computer.Network.DownloadFile("http://MyURL.com", "C:\MyPath.Data", "MyUserName", "MyPassword")

    'Do Something with the data downloaded

  End Sub

End Class

<Microsoft.VisualBasic.HideModuleName()> _

Module MainModule

  Public TheServiceName As String

  Public Sub main()

    Dim TheServiceApplication As New WindowsService

    TheServiceName = TheServiceApplication.ServiceName

    ServiceBase.Run(TheServiceApplication)

  End Sub

End Module

<System.ComponentModel.RunInstaller(True)> _

Public Class WindowsServiceInstaller : Inherits Installer

  Public Sub New()

    Dim serviceProcessInstaller As ServiceProcessInstaller = New ServiceProcessInstaller()

    Dim serviceInstaller As ServiceInstaller = New ServiceInstaller()

    serviceProcessInstaller.Account = ServiceAccount.LocalSystem

    serviceProcessInstaller.Username = Nothing

    serviceProcessInstaller.Password = Nothing

    serviceInstaller.DisplayName = "My Windows Service"

    serviceInstaller.StartType = ServiceStartMode.Automatic

    serviceInstaller.ServiceName = TheServiceName

    Me.Installers.Add(serviceProcessInstaller)

    Me.Installers.Add(serviceInstaller)

  End Sub

End Class



 Hello Andy,

Thanks for your post.

What do you want to scrape from the page? HttpWebRequest class ans WebClient class may be what you need. More information, please check:

The HttpWebRequest class provides support for the properties and methods defined in WebRequest and for additional properties and methods that enable the user to interact directly with servers using HTTP.

http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.aspx

The WebClient class provides common methods for sending data to or receiving data from any local, intranet, or Internet resource identified by a URI

http://msdn.microsoft.com/en-us/library/system.net.webclient.aspx

If you have any concenrs, please feel free to follow up.

Best regards



Hi Andy,

What about this problem on your side now? If you have any concerns, please feel free to follow up.

Have a nice day.

Best regards



Hi Andy,

When you come back, if you need further assistance about this issue, please feel free to let us know. We will continue to work with this issue.

Have a nice day.

Best regards



Thank you for the reply. Sorry it has taken me so long to respond.  I did not receive any notification that someone had replied!

I am using Visual Studio 2010 Ultimate Edition and the .NET framework 4.0.  Actually, I am upgrading some old code written in VB 6.0, but I can use the latest and greatest thats available.

The application uses a browser control to go to the page, fill in values, click on UI elements, read the HTML that returns, etc.  The purpose of the application is to collection useful information regularily/automatically.

I know how to create a web service, but using the web control in such a service is problematic because the web browser control was meant to be placed on a windows form.  I am not able to create a new instance of it in a project designated as a windows service.



Andy

Thank you for the reply. Sorry it has taken me so long to respond.  I did not receive any notification that someone had replied!

I thought a web request was for web services (retrieving information from them).  I am trying to retreive useful information from a website designed for interaction by a human, such as selecting items from lists and clicking buttons.   I currently use a web browser control to programmatically do what a person would do and get the pages back which in turn get parsed.

Andy



Hi Andy,

There is a tool which could let you manipulate anything you want on the website. This agile HTML parser builds a read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams). More information, please check:

http://htmlagilitypack.codeplex.com/

Have a nice day.

Best regards



Thanks for the suggestion.  I will go to that link and see if it will work.  I will update this post with what I find.

I am writing to check the status of the issue on your side. Would you mind letting us know the result of the suggestions? If you have any concerns, please feel free to follow up.

Have a nice day.



Best regards

Hi Liliane

Thank for the follow up reply.  I don't have an answer as of yet.  Implementing this is going to take time and I haven't been given the go-ahead by my boss to spend the time to pursue it.



Hi Andy,

Never minde. You could have a try when you feel free. If you have any further questions about this issue, please feel free to let us know. We will continue to work with you on this issue.

Have a nice day.

Best regards

Source: https://social.msdn.microsoft.com/Forums/vstudio/en-US/f5d565b1-236b-43c2-90c7-f5cc3b2c341b/scraping-a-website-from-a-windows-service

Saturday 25 April 2015

Social Media Crawling & Scraping services for Brand Monitoring

Crawling social media sites for extracting information is a fairly new concept – mainly due to the fact that most of the social media networking sites have cropped up in the last decade or so. But it’s equally (if not more) important to grab this ever-expanding User-Generated-Content (UGC) as this is the data that companies are interested in the most – such as product/service reviews, feedback, complaints, brand monitoring, brand analysis, competitor analysis, overall sentiment towards the brand, and so on.

Scraping social networking sites such as Twitter, Linkedin, Google Plus, Instagram etc. is not an easy task for in-house data acquisition departments of most companies as these sites have complex structures and also restrict the amount and frequency of the data that they let out to crawlers. This kind of a task is best left to an expert, such as PromptCloud’s Social Media Data Acquisition Service – which can take care of your end-to-end requirements and provide you with the desired data in a minimal turnaround time. Most of the popular social networking sites such as Twitter and Facebook let crawlers extract data only through their own API (Application Programming Interface), so as to control the amount of information about their users and their activities.

PromptCloud respects all these restrictions with respect to access to content and frequency of hitting their servers to make sure that user information is not compromised and their experience with the site is unhindered.

Social Media Scraping Experts

At PromptCloud, we have developed an expertise in crawling and scraping social media data in real-time. Such data can be from diverse sources such as – Twitter, Linkedin groups, blogs, news, reviews etc. Popular usage of this data is in brand monitoring, trend watching, sentiment/competitor analysis & customer service, among others.

Our low-latency component can extract data on the basis of specific keywords, categories, geographies, or a combination of these. We can also take care of complexities such as multiple languages as well as tweets and profiles of specific users (based on keywords or geographies). Sample XML data can be accessed through this link – demo.promptcloud.com.

Structured data is delivered via a single REST-based API and every time new content is published, the feed gets updated automatically. We also provide data in any other preferred formats (XML, CSV, XLS etc.).

If you have a social media data acquisition problem that you want to get solved, please do get in touch with us.

Source: https://www.promptcloud.com/social-media-networking-sites-crawling-service/

Saturday 18 April 2015

How to Generate Sales Leads Using Web Scraping Services

The first stage of any selling process is what is popularly known as “lead generation”. This phase is what most businesses place at the apex of their sales concerns. It is a driving force that governs decision-making at its highest levels, and influences business strategy and planning. If you are about to embark on an outbound sales campaign and are in the process of looking for leads, you would acknowledge the fact that lead generation process is of extreme importance for any business.

Different lead generation techniques have been used over and over again by companies around the world to satiate this growing business need. Newer, more innovative methods have also emerged to help marketers in this process. One such method of lead generation that is fast catching on, and is poised to play a big role for businesses in the coming years, is web scraping. With web scraping, you can easily get access to multiple relevant and highly customized leads – a perfect starting point for any marketing, promotional or sales campaign.

The prominence of Web Scraping in overall marketing strategy

At present, levels of competition have risen sky high for most businesses. For success, lead generation and gaining insight about customer behavior and preferences is an essential business requirement. Web scraping is the process of scraping or mining the internet for information. Different tools and techniques can be used to harvest information from multiple internet sources based on relevance, and the structured and organized in a way that makes sense to your business. Companies that provide web scraping services essentially use web scrapers to generate a targeted lead database that your company can then integrate into its marketing and sales strategies and plans.

The actual process of web scraping involves creating scraping scripts or algorithms which crawl the web for information based on certain preset parameters and options. The scraping process can be customized and tuned towards finding the kind of data that your business needs. The script can extract data from websites automatically, collate and put together a meaningful collection of leads for business development.

Lead Generation Basics

At a very high level, any person who has the resources and the intent to purchase your product or service qualifies as a lead. In the present scenario, you need to go far deeper than that. Marketers need to observe behavior patterns and purchasing trends to ensure that a particular person qualifies as a lead. If you have a group of people you are targeting, you need to decide who the viable leads will be, acquire their contact information and store it in a database for further action.

List buying used to be a popular way to get leads, but their efficacy has dwindled over time. Web scraping is the fast coming up as a feasible lead generation technique, allowing you to find highly focused and targeted leads in short amounts of time. All you need is a service provider that would carry out the data mining necessary for lead generation, and you end up with a list of actionable leads that you can try selling to.

How Web Scraping makes a substantial difference


With web scraping, you can extract valuable predictive information from websites. Web scraping facilitates high quality data collection and allows you to structure marketing and sales campaigns better. To drive sales and maximize revenue, you need strong, viable leads. To facilitate this, you need critical data which encompasses customer behavior, contact details, buying patterns and trends, willingness and ability to spend resources, and a myriad of other aspects critical to ascertain the potential of an entity as a rewarding lead. Data mining through web scraping can be a great way to get to these factors and identifying the leads that would make a difference for your business.

web-scraping-service

Crawling through many different web locales using different techniques, web scraping services pick up a wealth of information. This highly relevant and specialized information instantly provides your business with actionable leads. Furthermore, this exercise allows you to fine-tune your data management processes, make more accurate and reliable predictions and projections, arrive at more effective, strategic and marketing decisions and customize your workflow and business development to better suit the current market.

The Process and the Tools

Lead generation, being one of the most important processes for any business, can prove to be an expensive proposition if not handled strategically. Companies spend large amounts of their resources acquiring viable leads they can sell to. With web scraping, you can dramatically cut down the costs involved in lead generation and take your business forward with speed and efficiency. Here are some of the time-tested web scraping tools which can come in handy for lead generation –

•    Website download software – Used to copy entire websites to local storage. All website pages are downloaded and the hierarchy of navigation and internal links preserve. The stored pages can then be viewed and scoured for information at any later time.     Web scraper – Tools that crawl through bulk information on the internet, extracting specific, relevant data using a set of pre-defined parameters.

•    Data grabber – Sifts through websites and databases fast and extracts all the information, which can be sorted and classified later.

•    Text extractor – Can be used to scrape multiple websites or locations for acquiring text content from websites and web documents. It can mine data from a variety of text file formats and platforms.

With these tools, web scraping services scrape websites for lead generation and provide your business with a set of strong, actionable leads that can make a difference.

Covering all Bases

The strength of web scraping and web crawling lies in the fact that it covers all the necessary bases when it comes to lead generation. Data is harvested, structured, categorized and organized in such a way that businesses can easily use the data provided for their sales leads. As discussed earlier, cold and detached lists no longer provide you with enough actionable leads. You need to look at various factors and consider them during your lead generation efforts –

•    Contact details of the prospect

•    Purchasing power and purchasing history of the prospect

•    Past purchasing trends, willingness to purchase and history of buying preferences of the prospect

•    Social markers that are indicative of behavioral patterns

•    Commercial and business markers that are indicative of behavioral patterns

•    Transactional details

•    Other factors including age, gender, demography, social circles, language and interests

All these factors need to be taken into account and considered in detail if you have to ensure whether a lead is viable and actionable, or not. With web scraping you can get enough data about every single prospect, connect all the data collected with the help of onboarding, and ascertain with conviction whether a particular prospect will be viable for your business.

Let us take a look at how web scraping addresses these different factors –

1. Scraping website’s


During the scraping process, all websites where a particular prospect has some participation are crawled for data. Seemingly disjointed data can be made into a sensible unit by the use of onboarding- linking user activities with their online entities with the help of user IDs. Documents can be scanned for participation. E-commerce portals can be scanned to find comments and ratings a prospect might have delivered to certain products. Service providers’ websites can be scraped to find if the prospect has given a testimonial to any particular service. All these details can then be accumulated into a meaningful data collection that is indicative of the purchasing power and intent of the prospect, along with important data about buying preferences and tastes.

2. Social scraping

According to a study, most internet users spend upwards of two hours every day on social networks. Therefore, scraping social networks is a great way to explore prospects in detail. Initially, you can get important identification markers like names, addresses, contact numbers and email addresses. Further, social networks can also supply information about age, gender, demography and language choices. From this basic starting point, further details can be added by scraping social activity over long periods of time and looking for activities which indicate purchasing preferences, trends and interests. This exercise provides highly relevant and targeted information about prospects can be constructively used while designing sales campaigns.

Check out How to use Twitter data for your business

3. Transaction scraping

Through the scraping of transactions, you get a clear idea about the purchasing power of prospects. If you are looking for certain income groups or leads that invest in certain market sectors or during certain specific periods of time, transaction scraping is the best way to harvest meaningful information. This also helps you with competition analysis and provides you with pointers to fine-tune your marketing and sales strategies.

get-results-from-your-lead-generation-campaign

Using these varied lead generation techniques and finding the right balance and combination is key to securing the right leads for your business. Overall, signing up for web scraping services can be a make or break factor for your business going forward. With a steady supply of valuable leads, you can supercharge your sales, maximize returns and craft the perfect marketing maneuvers to take your business to an altogether new dimension.

Source: https://www.promptcloud.com/blog/how-to-generate-sales-leads-using-web-scraping-services/

Tuesday 7 April 2015

How to Build Data Warehouses using Web Scraping

Businesses all over the world are facing an avalanche of information which needs to be collated, organized, analyzed and utilized in an appropriate fashion. Moreover, with each increasing year there is a perceived shortening of the turnaround time for businesses to take decisions based on information they have assimilated. Data Extractors, therefore, have evolved with a more significant role in modern day businesses than just mere collectors or scrapers of unstructured data. They cleanse structure and store contextual data in veritable warehouses, so as to make it available for transformation into useable information as and when the business requires. Data warehouses, therefore, are the curators of information which businesses seek to treasure and to use.

Understanding Data Warehouses

 Traditionally, Data Warehouses have been premised on the concept of getting easy access to readily available data. Modern day usage has helped it to evolve as a rich repository to store current and historical data that can be used to conduct data analysis and generate reports. As it also stores historical data, Data Warehouses are used to generate trending reports to help businesses foresee their prospects. In other words, data warehouses are the modern day crystal balls which businesses zealously pore over to foretell their future in the Industry.

Scraping Web Data for Creating Warehouses

The Web, as we know it, is a rich repository of a whole host of information. However, it is not always easy to access this information for the benefit of our businesses through manual processes. The data extractor tools, therefore, have been built to quickly and easily, scrape, cleanse and structure and store it in Data Warehouses so as to be readily available in a useable format.

Web Scraping tools are variously designed to help both programmers as well as non-programmers to retain their comfort zone while collecting data to create the data warehouses. There are several tools with point and click interfaces that ease out the process considerably. You can simply define the type of data you want and the tool will take care of the rest. Also, most tools such as these are able to store the data in the cloud and therefore do not need to maintain costly hardware or whole teams of developers to manage the repository.

Moreover, as most tools use a browser rendering technology, it helps to simulate the web viewing experience of humans thereby easing the usability aspect among business users facilitating the data extraction and storage process further.

Conclusion

The internet as we know it is stocked with valuable data most of which are not always easy to access. Web Data extraction tools have therefore gained popularity among businesses as they browse, search, navigate simulating your experience of web browsing and finally extract data fields specific to your industry and appropriate to your needs. These are stored in repositories for analysis and generation of reports. Thus evolves the need and utility of Data warehouses. As the process of data collection and organization from unstructured to structured form is automated, there is an assurance of accuracy built into the process which enhances the value and credibility of data warehouses. Web Data scraping is no doubt the value enhancers for Data warehouses in the current scenario.

Source: http://scraping-solutions.blogspot.in/2014/09/how-to-build-data-warehouses-using-web.html

Monday 30 March 2015

Collect Targeted Data from Web Using Data Extractor Tools

The use of data to enhance your business prospects is a widely acknowledged fact. It is therefore very important that you have access to relevant data and not just any data in order to further your growth prospects. Utilizing the features and benefits of Web Scraper tools can help you achieve this goal effortlessly.

Customizing Web Extraction Tools for Your Business


The Internet is a maze of information repositories and identifying the right information from the right source may pose to be a major challenge. Moreover, data incorrectly sourced may result in erroneous analysis leading to a faulty strategy and slow growth for your business.  The risk is, however, considerably mitigated by employing Web extractor tools in your business processes and leveraging the advantages they provide.

Web extraction tools are used for the singular task of extracting relevant unstructured data from specific web sites and providing business users with a set of structured useable data. They perform this vital task with the help of scripting languages like python, Ruby, or Java. The biggest advantage of utilizing Web extraction tools is its ability to be customized as per the business requirement. This is easily achieved by defining the specific seed list you wish to scrape in the crawler script. A Seed list is the series of URLs that you wish to scan in order to extract the relevant data.  Thus defined, the crawler will scan only the targeted URLs. Along with the Seed list you can also specify the following relevant information to customize the scraper tool and ensure that it delivers as per your requirement. These defining parameters include:

  •     Define the number of pages you wish the scraper to crawl
  •     Define the specific file types you want the scraper to crawl
  •     Define the type of data you would like to extract

This ensures that you can launch a focused search for the specific type of data that you wish to extract and also defines the appropriate source you want the crawler to access.

Benefits of using Targeted Data

Every business pertains to a specific domain. Its growth prospects, its revenue and its present standing are all defined by the demands and dynamics of that domain. Therefore, undertaking a study of its individual domain is one of the chief pre-requisites that your business must concentrate its efforts on in order to accelerate its growth. Moreover, through your business, you need to conduct a detailed analysis of competitive data in order to remain contextual in your specialized domain. Web Extractor tools have been equipped to understand this need and scrape pertinent data to foster growth patterns that strike the right chords. Some of the benefits leveraged from the extraction of targeted data include:
  •     Updated financial information from competitor sites on stock prices and product prices helps you to estimate and launch competitive rates for your stocks and products
  •     Studying market trends for a competitor’s products help you to position your product and plan your promotional campaigns effectively
  •     Studying analytics of competitor websites will ensure that you are able to plan your web promotions in a far more effective way
  •     Extracting data from blogs and websites that cater to your personal interests and hobby areas help you to build up your own knowledge repository which you can leverage to achieve benefits for your business as and when required.

We are leading Webdatascraping.us company and enough capable to extract website information, review scraping, contact information scraping, business directory scraping, email list scraping etc.

Friday 27 March 2015

Web Data Extraction- The most convenient and easy way to extract data from the internet

Web data extraction is the most proficient technique that will help you find the pertaining data for your existing business or any personal use. Many times, we find that experts’ copy and paste information manually from web pages or download the entire website which is a waste of time and effort.

Now with the new technique of Web data extraction you can crawl through loads and loads of web pages in order to extract particular data and at the same time save this data in the following manner
  •     CSV FILE
  •     XML FILE or
  •     Any other custom format for future use.

Below given are some instances of Web data extraction processes:
  •     Take a government portal, extracting names of citizens for a survey
  •     Search for competitor websites for product pricing and feature information
  •     Utilize web scraping to download images from a stock photography site for website design

How can Web Data Extraction serve you?

 You can extract data from any kind of websites like


Extract Data from any kind of Websites: Directories, Classified Websites, News, Websites, Blogs, Articles, and Job Portals, Search Engines, eCommerce Websites, Social Media Websites and any kind of websites whose content can be accessible. Extract Emails, Contacts, Price/Rate, Features, Contact Names, Contact Details, Full Text, Live updates, ASINs, Meta Tags, Address, Phone, Fax, Latitude & Longitude, Images, Links, Reviews, Ratings, etc. Help in Data Collection, Competitor Analysis, Research, Business Intelligence, Social Media Trend analysis, Brand Monitoring, Lead Data Collection, Website & Competitor Web Monitoring, etc. Deliver Data in any Database, Excel, CSV, Access, Text, My SQL, SQL, Oracle, etc. and in any format Custom Services of Web Data Extraction as per client need one time Data Delivery or Continued/Scheduled Data Delivery

The next is Website Data Scraping:


 Web site Data Scraping is the process of extracting data from a website by using a particular software program available from proven website only.

This extracted data can be utilized by any person and for any purposes as per their needs and wants; data extracted can be used in different industries. There are many companies providing best Website data scraping services.

It is one such field which has active developments and also shares a common objective that needs a breakthrough in the following:

    Text Processing
    Semantic Understanding
    Artificial Intelligence
    Human Computer Interactions


There are many users or end users, companies and experts that need information or data that is accessible in some or the other format. In such cases Web Data Extraction can tailor the need of extracting data from any proven source and preserve the data on a particular destination.

The source platform contains:

  •     Excel
  •     CSV
  •     MySQL and
  •     Others

Moreover, the technique of Web data extractor can also extract information from various websites like Google, Amazon, LinkedIn, EBay and many others.

It can also extract data from eCommerce shopping websites, or other social networking websites, any public websites, classifieds websites, job portal websites and any other search engine websites.

Websitedatascraping.com is enough capable to web data scraping, website data scraping, web scraping services, website scraping services, data scraping services, product information scraping and yellowpages data scraping.

Tuesday 24 March 2015

Diamond Mines And Mining Techniques

Diamonds remain to this day a mystical gem with a somewhat checkered past. The story of the Hope diamond is based on a long legend of misfortunes supposedly befalling its various and colorful owners. Segregation and mistreatment of blacks in diamond mines in Africa has long been a terrible mark on humanity in that part of the world. Although early white miners were treated better the working conditions endured by all diamond miners were less than humane. Fortunately most of today's mines while still depending on human labor have most of the heavy work done with machinery.

Contrary to popular belief diamonds are mined in areas other than Africa. Some of the better known diamond mines are these.

* Argye one of the Rio Tinto company mines is located in Western Australia.

* Diavik another Rio Tinto mine is located in Canada.

* Ekati owned by BHP Billiton and located in the Northwest Territories of Canada.

* Baken owned by Trans Hexis is located in South Africa.

* Merlin owned by Striker Resources is located in Australia.

* Orapa owned by a partnership between DeBeers and the government of Botswana is in Botswana.

* Premier owned by the De Beers Company is located in South Africa.

Diamond ore is extracted from these mines using basically four mining techniques which are based on the type of geology in which the diamond bearing material is located.

Marine Mining is the most recent development introduced about 1990. This technique is similar to deep water oil drilling. A large shaft is bored into the seabed to a depth where diamond bearing soil is located and that material is sucked to the surface. Also underwater vehicles called "crawlers" move along the seabed to scoop up diamond bearing gravel and pump it to the surface.

Placer Diamond Mining is a technique that is seen many times in movies of the old West. The diamonds are buried in river banks or mountain sides and water canons are used to wash the material down to be processed.

Hard Rock Diamond Mining is again familiar to movie fans as they watched coal miners or gold and silver miners digging their way into deep underground tunnels. Of course the technique is modern now and makes use of many specialized machines to do the heavy work.

Open-Pit diamond mining is similar to the pit coal mines of West Virginia and some western states. Overburden which is the soil covering the diamond embedded material is moved by machinery and blasting. The diamond bearing material is then moved to processing plants. This technique is common when the diamond bearing material is found close to the surface of if the geology is so unstable that tunneling is not safe or practical.

From this short article it should be apparent that diamonds take long and interesting journeys before they find a cherished spot on your finger or ear lobes.

Source: http://ezinearticles.com/?Diamond-Mines-And-Mining-Techniques&id=4800018

Monday 16 March 2015

Why Outsourcing Data Mining Services is the Leading Business Trend

Businesses usually have huge volumes of raw data that remains unprocessed. Processing data results into information. A company’s hunt for valuable information ends when it outsources its data mining process to reputable and professional data mining companies. In this way a company is able to derive more clarity and accuracy in the decision making process.

It is important too note that information is critical to the growth of a business. With the internet you are offered flexible communication and good flow of data. It is a good idea to make the data that is available readily and in a workable format where it will be useful to a business. The filtered data is deemed important to the organization and the services can be used to increase profits, ameliorating overall risks and smooth work flow.

Data mining process must engage the sorting data process through the vast data amounts of data and acquire pertinent information. Data mining is usually undertaken by professional, financial and business analysts. Nowadays, there are many growing fields that require data extraction services.

When making decisions data mining plays an important role as it enables experts to make decisions quick and in a feasible manner. The information that is processed finds wide applications for decision making that relate to e-commerce, direct marketing, health care, telecommunications, customer relationship management, financial utilities and services.

The following are the data mining services that are commonly outsourced to the professional data mining companies:

•    Data congregation. This is the process of extracting data from different websites and web pages. The common processes involved here include web scraping and screen scraping services. The data congregated is then in put into databases.

•    Collecting of contact data. This is the process of searching and collecting of information concerning contacts from different websites.

•    E-commerce data. This is data about various online stores. The information collected includes the various products and prices offered. Other information that is collected is about discounts.

•    Competitors. Information about your business competitors is quite important as it helps a business to gauge itself against other businesses. In this way a company can use this information to re-design its marketing strategies and develop its own pricing matrix.

In this era where business is hugely impacted by globalization, handling data is becoming a headache. This is where outsourcing becomes quite profitable and important to your business. Huge savings in terms money, time and infrastructure can be realized when data mining projects are customized to suit exact needs of a customer.

There are many benefits accrued when outsourcing data mining services to professional companies. The following are some of benefits that are accrued from the outsourcing process:

•    Qualified and skilled technical staff. Data mining companies employ highly competent staffs who have a successful career in IT industry and data mining. With such personnel you are assured of quality information extracted from databases and websites.

•    Improved technology. These companies have invested huge resources in terms of software and technology so as to handle the information and data in a technological way.

•    Quick turnaround time. Your data is processed in an efficient way and information presented in a timely way. These companies are able to present data in a timely manner even in tight deadlines.

•    Cost-effective prices. Nowadays there are many companies dealing with web scraping and data mining. Due to competition, these companies offer quality services at competitive prices.

•    Data safety. Data is quite critical and should not leak to your competitors. These companies are using the latest technology in ensuring that your data is not stolen by other vendors.

•    Increased market coverage. These companies serve many businesses and organizations with different data needs. By outsourcing to them you are assured of expertise dealing with your data have wide market coverage.

Outsourcing enables a company to shift its focus to the core business operations and improve its overall productivity. In fact outsourcing can be considered as a wise choice for any business. Therefore outsourcing helps businesses in managing data effectively. In this way you will be able to achieve and generate more profits. When outsourcing, it is advisable that you only consider professional companies only so as to be assured of high quality services.

Source: http://www.loginworks.com/blogs/web-scraping-blogs/216-why-outsourcing-data-mining-services-is-the-leading-business-trend/