The Google web crawler will enter your domain and scan every page of your website, extracting page titles, descriptions, keywords, and links – then report back to Google HQ and add the information to their huge database.
Today, I’d like to teach you how to make your own basic crawler – not one that scans the whole Internet, though, but one that is able to extract all the links from a given webpage.
Generally, you should make sure you have permission before scraping random websites, as most people consider it to be a very grey legal area. Still, as I say, the web wouldn’t function without these kind of crawlers, so it’s important you understand how they work and how easy they are to make.
To make a simple crawler, we’ll be using the most common programming language of the internet – PHP. Don’t worry if you’ve never programmed in PHP – I’ll be taking you through each step and explaining what each part does. I am going to assume an absolute basic knowledge of HTML though, enough that you understand how a link or image is added to an HTML document.
Before we start, you will need a server to run PHP. You have a number of options here:
Advertisement
- If you host your own blog using WordPress, you already have one, so upload the files you write via FTP and run them from there. Matt showed us some free FTP clients for Windows you could use.
- If you don’t have a web server but do have an old PC sitting around, then you could follow Dave’s tutorial here to turn an old PC into a web server.
- Just one computer? Don’t worry – Jeffry showed us how we can run a local server inside of Windows or Mac.
Getting Started
We’ll be using a helper class called Simple HTML DOM. Download this zip file, unzip it, and upload the simple_html_dom.php file contained within to your website first (in the same directory you’ll be running your programs from). It contains functions we will be using to traverse the elements of a webpage more easily. That zip file also contains today’s example code.
First, let’s write a simple program that will check if PHP is working or not. We’ll also import the helper file we’ll be using later. Make a new file in your web directory, and call it example1.php – the actual name isn’t important, but the .php ending is. Copy and paste this code into it:
<?php
include_once('simple_html_dom.php');
phpinfo();
?>
Access the file through your internet browser. If everything has gone right, you should see a big page of random debug and server information printed out like below – all from the little line of code! It’s not really what we’re after, but at least we know everything is working.
The first and last lines simply tell the server we are going to be using PHP code. This is important because we can actually include standard HTML on the page too, and it will render just fine. The second line pulls in the Simple HTML DOM helper we will be using. The phpinfo(); line is the one that printed out all that debug info, but you can go ahead and delete that now. Notice that in PHP, any commands we have must be finished with a colon (;). The most common mistake of any PHP beginner is to forget that little bit of punctuation.
One typical task that Google performs is to pull all the links from a page and see which sites they are endorsing. Try the following code next, in a new file if you like.
<?php
include_once('simple_html_dom.php');
$target_url = “http://www.tokyobit.com/”;
$html = new simple_html_dom();
$html->load_file($target_url);
foreach($html->find(‘a’) as $link){
echo $link->href.”<br />”;
}
?>
$html = new simple_html_dom();
$html->load_file($target_url);
foreach($html->find(‘a’) as $link){
echo $link->href.”<br />”;
}
?>
You should get a page full of URLs! Wonderful. Most of them will be internal links, of course. In a real world situation, Google would ignore internal links and simply look at what other websites you’re linking to, but that’s outside the scope of this tutorial.
If you’re running on your own server, go ahead and change the target_URL variable to your own webpage or any other website you’d like to examine.
That code was quite a jump from the last example, so let’s go through in pseudo-code to make sure you understand what’s going on.
Include once the simple HTML DOM helper file.
Set the target URL as http://www.tokyobit.com.
Create a new simple HTML DOM object to store the target page
Load our target URL into that object
For each link <a> that we find on the target page
– Print out the HREF attribute
That’s it for today, but if you’d like a bit of challenge – try to modify to the second example so that instead of searching for links (<a> elements), it grabs images instead (<img>). Remember, the srcattribute of an image specifies the URL for that image, not HREF.
Would you like learn more? Let me know in the comments if you’re interested in reading a part 2 (complete with homework solution!), or even if you’d like a back-basics PHP tutorial – and I’ll rustle one up next time for you. I warn you though – once you get started with programming in PHP, you’ll start making plans to create the next Facebook, and all those latent desires for world domination will soon consume you. Programming is fun.
No comments:
Post a Comment