Thursday 30 May 2013

Scrape images from another website with php

Scraping data from another website has become very popular in recent years. However, you should make sure that you have permission from the website that you want to scrape before you do this. With that said, here’s an example of how to scrape images from a website – how you choose to process the information is up to you:
view source
print?
1    <?php
2    $website_url='www.somewebsite.com';
3   
4    $curl = curl_init($website_url);
5    curl_setopt($curl, CURLOPT_AUTOREFERER, true);
6    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
7    curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1 );
8    curl_setopt($curl, CURLOPT_TIMEOUT, 2 );
9    $html = curl_exec( $curl );
10    curl_close( $curl );
11        
12    $dom = new DOMDocument;
13    @$dom->loadHTML($html);
14   
15    $links = $dom->getElementsByTagName('img');
16   
17    $web_pic_arr;
18    $web_src_arr;
19   
20    foreach ($links as $link){
21        $img_check_error=0;
22        $raw_img_url = $link->getAttribute('src');
23        $img_final_link = $raw_img_url;
24   
25        $img_url = explode('http://www.', $raw_img_url);
26        $img_check = $img_url[1];
27        
28        if($img_check==''){
29            $img_url = explode('http://', $raw_img_url);
30            $img_check = $img_url[1];
31            if($img_check!=''){ $img_check_error=1; }
32            if($img_check==''){ $img_check_error=2; }
33        }
34   
35        switch($img_check_error){
36            case 0:
37            $web_src_arr[] = $link->getAttribute('src'); break;
38            
39            case 1:
40            $web_src_arr[] = $link->getAttribute('src'); break;
41            
42            case 2:
43            $web_src_arr[] = $website_url.'/'.$link->getAttribute('src'); break;
44        }
45    } // end foreach loop
46   
47    // you can write a function to process the data however you wish
48    // here's an example of calling a function that would save the images
49    // save_images($web_src_arr, $dest, $minWidth, $minHeight);
50    ?>


Source: http://www.joeydigital.net/source_code/php/scrape-images-from-another-website-with-php/

Monday 27 May 2013

Easy web scraping with PHP

Web scraping is a technique of web development where you load a web page and "scrape" the data off the page to be used elsewhere. It's not pretty, but sometimes scraping is the only way to access data or content from a web site that doesn't provide RSS or an open API.

I'm not going to discuss the legal aspects of scraping, as it may be considered copyright infringement in some situations. However, there are also perfectly legal reasons to need to scrape, like if you have permission.

To make things really easy, we're going to let the power of regular expressions do all the work for us. If you're not familiar with regular expressions, you may want to google for a tutorial. Here is the documentation for PHP regular expression syntax.

First, we start off by loading the HTML using file_get_contents. Next, we use preg_match_all with a regular expression to turn the data on the page into a PHP array.

This example will demonstrate scraping this web site's blog page to extract the most recent blog posts. This is just for demo purposes - of course, the RSS feed is much better suited for this.

// get the HTML
$html = file_get_contents("http://www.thefutureoftheweb.com/blog/");

Here is what the HTML looks like for the blog posts:

<ul id="main">
    <li>
        <h1><a href="[link]">[title]</a></h1>
        <span class="date">[date]</span>
        <div class="section">
            [content]
        </div>
    </li>
</ul>

So we will use a regular expression that looks for all the li elements and capture the content using parentheses at the appropriate places (link, title, date & content).

preg_match_all(
    '/<li>.*?<h1><a href="(.*?)">(.*?)<\/a><\/h1>.*?<span class="date">(.*?)<\/span>.*?<div class="section">(.*?)<\/div>.*?<\/li>/s',
    $html,
    $posts, // will contain the blog posts
    PREG_SET_ORDER // formats data into an array of posts
);

foreach ($posts as $post) {
    $link = $post[1];
    $title = $post[2];
    $date = $post[3];
    $content = $post[4];

    // do something with data
}

There's a lot going on inside that regular expression, but there are really only a few "tricks" that are used. Anytime I want to say "skip over whatever is between" I use .*?. And any time I want to say "match whatever is in here" I use (.*?). And lastly, the s at the end tells PHP to allow the dot . to match newlines. That's about all there is to it.

The regular expression will only match blog posts, because they are the only <li> elements that contain an <h1>, <span class="date"> and <div class="section">.

Web scraping is highly unreliable - if the HTML structure were to change this code would break instantly. However, it's often quite easy to write this code, and usually produces a perfectly usable hack solution.



Source: http://www.thefutureoftheweb.com/blog/web-scrape-with-php-tutorial

Friday 24 May 2013

Web Scraping Evolved: APIs for Turning Webpage Content into Valuable Data

This guest post comes from Marc Mezzacca, founder of NextGen Shopping, a company dedicated to creating innovative shopping mashups.    Marc’s latest venture is a social-media based coupon code website called CouponFollow that utilizes the Twitter API.

While the rates in adoption of semantic standards are increasing, the majority of the web is still home to mostly unstructured data.  Search engines, like Google, remain focused on looking at the HTML markup for clues to create richer results for their users.  The creation of schema.org and similar movements has aided in the progression of the ability draw valuable content from webpages.

But even with semantic standards in place, structured data requires a parser to extract information and convert it into a data-interchange format, namely JSON or XML.  Many libraries exist for this, and in several popular coding languages.  But be warned, most of these parser libraries are far from polished.  Most are community-produced, and thus may not be complete or up to date, as standards are ever changing.  On the flip side, website owners whom don’t fully follow semantic rules, can break the parser.  And of course there are sites which contain no structured data formatting at all.   This inconsistency causes problems for intelligent data harvesting, and can be a roadblock for new business ideas and startups.


Several companies are offering an API service to make sense of this unstructured data, helping to remove these roadblocks.  For example, AlchemyAPI offers a suite of data extraction APIs including a Structured Content Scraping API, which enables structured data to be extracted based on both visual and structural traits.  Another company, DiffBot, is also taking care of the “dirty work” in the cloud, allowing entrepreneurs and developers to focus on their business instead of the semantics involved in parsing.  DiffBot stands out because of their unique approach.  Instead of looking at the data as a computer, they are looking visually, like a human would.  They first classify what type of webpage (eg. article, blog post, product, etc.) and then proceed to extract what visually appears to be relevant data for that page type (article title, most relevant image, etc).

Currently their website lists APIs for Page Classification (check out their infographic), as well as parsing Article type webpages.  Much of the web, including discussion boards, events, e-commerce data, etc. remains as potential future API offerings and it will be interesting to see which they go after next.

You can test drive the Artcle API on their website and see the extraction results instantly, as shown below of this article:

Source: http://blog.programmableweb.com/2012/09/13/web-scraping-evolved-apis-for-turning-webpage-content-into-valuable-data/

Friday 17 May 2013

Download all images from a website easily

Is there any way to download all images from a website automatically, without having to click through all the pages by hand? Yes there is! Extreme Picture Finder is the answer. Simply enter the website address, select the folder on your local hard disk where all downloaded images must be saved and click Start! And that's all. Now you can switch back to other tasks while Extreme Picture Finder works in the background extracting, downloading and saving all those images.

The example below shows you how easy it is to download all images from a website automatically with Extreme Picture Finder and how to avoid downloading small images (like thumbnails or banners).

To make Extreme Picture Finder download images from a website you have to create a project. Simply use menu command Project - New project... or click the Create a new project to download all images from a website button on the program toolbar and you will see the New Project Wizard window shown below.

Now in the Starting address (URL) field type the address of a website. If this site is password-protected, then check the This site is password protected box and enter a valid username and password.

Basically, this is it. The default project settings a set to download all images from all pages of the site, so you can now click the Finish button and watch images flow to your hard disk. By the way, you can view the downloaded images while the rest of them are still being downloaded - there is built-in image viewer with thumbnails and slideshow in Extreme Picture Finder.
How to download only big or full-size images from a website

By default Extreme Picture Finder will download all images from a website - big and small. But in most cases you need only big or full-size images. You do not want thumbnails, banners or parts of the website design. So instead of clicking Finish button after entering the website address, click Next button several times to reach the last step of the New Project Wizard.

Now check the Show advanced project properties box and then click Finish button. You will see the Project properties window where all project details can be modified.

In the Project properties select the Limits - File size section. This section allows you set the minimum and maximum file size of the images. So check the Do not download small target files, less than box and enter 25 in the corresponding edit field. Also you can prevent the download of huge images by specifying the maximum file size.

Now you click OK button and Extreme Picture Finder will start downloading only big images. You can easily make those settings default for all projects by clicking the Make these properties default for all projects... button.

Source: http://www.exisoftware.com/news/download-all-images-from-a-website.html

Monday 6 May 2013

PHP Web Page Scraping Tutorial

Web Scraping, also known as Web Harvesting and/or Web Data Extraction is the process of extracting data from a given web site or web page. In this tutorial I will go over a way for you to extract the title of a page, as well as the meta keywords, meta description, and links. With some basic knowledge in PHP and Regular Expressions you can accomplish this process with ease.

First lets go over the regular expression meta characters we will be using in this tutorial.
(.*)Plain Text

The dot (.) stands for any character while the asterisks (*) stands for 0 or more characters. When both are combined (.*) you are letting the system know that you are looking for any set of characters with a length of 0 or more.

As for our PHP, we will be using 3 functions in order to extract our data. The first function is our file_get_contents() function which will get the desired page and input all of its contents and html into a string format. The second function we will be using is our preg_match() function which will return us one result when given the regular expression code. The final function we will be using is preg_match_all() which works the same as preg_match() just that preg_match_all() will return more then 1 result.

For this tutorial I have included 1 HTML page that contains our Title Tag, Meta Description, Meta Keywords and Some Links. We will be using that file for our scraping purposes.

Lets start by setting up our variable that will contain our string of html from the external file.

<?php
$file_string = file_get_contents('page_to_scrape.html');
?>Plain Text

What we did above is simply get all of the contents from our file page_to_scrape.html and store it to a string. Now that we have our string we can then proceed to the next portion of extraction.

* Hint: You can replace page_to_scrape.html with any page or link you may want to scrape. Some sites my have terms against scraping so be sure to read the terms of use before you decide to scrape a site.

Lets start by extracting the text within our <title></title> tags. In order to accomplish this we need to use our preg_match() function. Given 3 parameters the preg_match() function will return us an array with our result. The first parameter will be our regular expression, the second parameter will be our variable containing the html content, and our third parameter will be our out put array which will contain our results.

<?php
$file_string = file_get_contents('page_to_scrape.html');
preg_match('/<title>(.*)<\/title>/i', $file_string, $title);
$title_out = $title[1];
?>Plain Text

Let me explain what I did in the above code. First we know that we want the text from within the title tags <title></title>. So we need to insert (.*) in between the title tags to retrieve any characters that we may have within them. When using a regular expression in the preg_match() function we need to encapsulate our regular expression within two forward slashes. You could use other characters such as {} and more. For this example though we will use the forward slashes. I append a lower case i to the end to search case insensitive. We also need to escape the forward slash in the closing title tag so that our script does not end its search there. For our second parameter I passed through our variable $file_string which we defined earlier to contain our HTML content. Lastly we pass our third parameter which will out put an array of result. Now that we have an array I assigned the element of the array that we want to the variable $title_out for later usage.

Next we need to get the Meta Description and Meta Keywords. We will just do the same as what we did above and just change the HTML and output names as follows.

preg_match('/<meta name="keywords" content="(.*)" \/> /i', $file_string, $keywords);
$keywords_out = $keywords[1];
preg_match('/<meta name="description" content="(.*)" \/> /i', $file_string, $description);
$description_out = $description[1];Plain Text

Finally we need to retrieve our list of links on the page. In my sample HTML document I have my links enclosed within <li></li> tags. I will use this in conjunction with the <a></a> tags to extract my data. For this we will need to use our preg_match_all() function that way we can return back more then 1 result. For this function we will pass through our parameters just as we did with the preg_match() function.

preg_match_all('/<li><a href="(.*)">(.*)<\/a><\/li>/i', $file_string, $links);Plain Text

With the above code we now have an array assigned to $links with all of the results. Notice that I used the meta characters (.*) more then once this time. The reason for this is since the data will not be consistently the same we need to let the script know that any set of characters may be there in its place. Our $links array will return an array that contains the data within the href=”" as well as the data between the <a></a> tags.

Now that we have all of our data that we want  to collect. We can simply just print them out as follows:

<p><strong>Title:</strong> <?php echo $title_out; ?></p>
<p><strong>Keywords:</strong> <?php echo $keywords_out; ?></p>
<p><strong>Description:</strong> <?php echo $description_out; ?></p>
<p><strong>Links:</strong> <em>(Name - Link)</em><br />
<?php
echo '<ol>';
for($i = 0; $i < count($links[1]); $i++) {
echo '<li>' . $links[2][$i] . ' - ' . $links[1][$i] . '</li>';
}
echo '</ol>';
?>
</p>Plain Text

Attached are the files used in this tutorial. Let me know if you have any questions below.

Source: http://www.devblog.co/php-web-page-scraping-tutorial/