Technical Blog

How to spider a HTML page with jQuery

There are a lot of cases where you would like to parse HTML pages from your JS scripts to extract data. jQuery provides functions to help to do so.

Getting the page content

We use the jQuery.get method. It looks like an asynchronous php’s file_get_contents function.
It works this way:

$.get(url, function (data) {
  // data is the content of the URL.
});
Same Origin Policy

The Same Origin Policy makes that if you do $.get with an url different from your server, you will get the following error:

XMLHttpRequest cannot load http://www.some_page.com. Origin null is not allowed by Access-Control-Allow-Origin

To make it work, the trick is to use a proxy. Create the following php file:

<?php
header('Access-Control-Allow-Origin: *');

if (isset($_GET['url']) && preg_match('`^http://`', $_GET['url'])) {
   echo file_get_contents($_GET['url']);
}
?>

Then, instead of doing

$.get("http://www.example.com", ...

use it this way:

$.get("http://www.my_website/path/to/proxy.php?url=http://www.example.com", ...

The proxy file is not required to be on the same server as your javascript file.

Parsing the data

With jQuery

We can build a jQuery structure from the downloaded page:

$.get(url, function (data) {
  $page = $(data);
})

Then, we can obtain all the data we want using jQuery selectors and methods. For example, to get the text from all the cells with the class “example”:

var res = [];
$("td.example", $page).each(function(){
  res.push($(this).text());
});

However, this technique has a big disadvantage. If the HTML downloaded by $.get contains error, It might fail to build the structure, and we won’t able to use the selectors.

If the downloaded page contains scripts or stylesheet, they will be downloaded and executed.

With regular expressions

The second possibility is to use regular expressions, as we get the page’s data as a string. We can extract the text from the cells with the class “example” this way:

res = data.match(/<td.*?class=["'][^"']*example(\s|["'])[^>]*>(.*?)</td>/gim)[2];

We can also mix both techniques: building a jQuery structure from a part of the page matched by a regex, or using a regex in a jQuery node.

Conclusion

This article should get you started in order to spider pages directly from the browser. It is particularly helpful to build extensions for browsers (Safari, Chrome, Firefox) or with add-ons like GreaseMonkey. It allows to aggregate data from different sources to build more powerful application.

  • http://blog.vjeux.com/2011/javascript/intercept-and-alter-script-include.html Vjeux » Intercept and alter <script> include

    [...] Provide a proxy to circumvent the Same-Origin Policy. [...]

  • bn

    bbvn

  • Nivi

    Instead of using class either we can use title or div???