Technical Blog

Category : Javascript

How to spider a HTML page with jQuery

There are a lot of cases where you would like to parse HTML pages from your JS scripts to extract data. jQuery provides functions to help to do so.

Getting the page content

We use the jQuery.get method. It looks like an asynchronous php’s file_get_contents function.
It works this way:

$.get(url, function (data) {
  // data is the content of the URL.
});
Same Origin Policy

The Same Origin Policy makes that if you do $.get with an url different from your server, you will get the following error:

XMLHttpRequest cannot load http://www.some_page.com. Origin null is not allowed by Access-Control-Allow-Origin

To make it work, the trick is to use a proxy. Create the following php file:

<?php
header('Access-Control-Allow-Origin: *');

if (isset($_GET['url']) && preg_match('`^http://`', $_GET['url'])) {
   echo file_get_contents($_GET['url']);
}
?>

Then, instead of doing

$.get("http://www.example.com", ...

use it this way:

$.get("http://www.my_website/path/to/proxy.php?url=http://www.example.com", ...

The proxy file is not required to be on the same server as your javascript file.

Parsing the data

With jQuery

We can build a jQuery structure from the downloaded page:

$.get(url, function (data) {
  $page = $(data);
})

Then, we can obtain all the data we want using jQuery selectors and methods. For example, to get the text from all the cells with the class “example”:

var res = [];
$("td.example", $page).each(function(){
  res.push($(this).text());
});

However, this technique has a big disadvantage. If the HTML downloaded by $.get contains error, It might fail to build the structure, and we won’t able to use the selectors.

If the downloaded page contains scripts or stylesheet, they will be downloaded and executed.

With regular expressions

The second possibility is to use regular expressions, as we get the page’s data as a string. We can extract the text from the cells with the class “example” this way:

res = data.match(/<td.*?class=["'][^"']*example(\s|["'])[^>]*>(.*?)</td>/gim)[2];

We can also mix both techniques: building a jQuery structure from a part of the page matched by a regex, or using a regex in a jQuery node.

Conclusion

This article should get you started in order to spider pages directly from the browser. It is particularly helpful to build extensions for browsers (Safari, Chrome, Firefox) or with add-ons like GreaseMonkey. It allows to aggregate data from different sources to build more powerful application.

Get a color name from any RGB combination

The aim of this script is to get a color name from any RGB combination. There are over 16 million combinations, so we cannot have one name for each combination. However, we can get the nearest known color.

Demo

Input
Closest

The color picker comes from David Durman’s website.

How it works

The hexadecimal value of a color represent a three-coordinates point: Red, Green and Blue. The first thing we need is a data set. I use this list from Wikipedia. I parsed it with jQuery, and built an array of labels points (r, g, b and color name) that I serialized in JSON. I finally put it in a seperate file.

The second step is to classify the color. We put this new point in our 3D space, and locate the closest labeled point. This algorithm is called kNN (k Nearest Neighbor). Here we use k=1, as we only have on element per class (color name).

Possible evolutions

The results of this technique are impacted by two main factors:

  • The dataset
  • The metric

We can change the dataset to modify the results. We can reduce it, to only use generic color names (green, brown, yellow etc.), instead of precise names (Blizzard Blue, Pakistan green, Stil de grain yellow, etc.). We can also add more data, so the density of points in the space will be higher, and the results more accurate. A lot of lists are available on the web.

Instead of using a lot of exotic color names, a solution could be to use several points for each label (color name). For instance, we could use a limited set of color names (red, green, blue, etc) but have a lot of points associated to these labels. For example, different points like #07250b (that is, in my opinion, missclassified) and #51f665 could share a same label: green. The problem is to have the dataset. We could build one from pages like this one, removing the numbers from each color name, so multiple points share the same name.
With several points for each label, It also will give us the possibility of using a 3NN, for example, instead of a 1NN classifier. It should impact also on the results, but I’m not sure it will really improve them.

The metric, or distance function, is the way we compute the distance between two points. In this technique, I use the euclidian metric, but a lot of other distances exists. Articles can be found about color metrics.

One possibility is to change our 3D space. Here we use the RGB color space, but we could also use different spaces like CMYK, HSV etc.

Finally, we could also get more information about our classification process. We could display the decision function. Instead of displaying all the existing colors in the picker, we could display for each pixel the color of the closest labeled point. It would be a good way to see the dataset density, and detect zones where the density is to low.

How to use it

You need first to download color_classifier.js. It requires jQuery to load the dataset, but you can avoid it by moving dataset.js content in color_classifier.js.
The first step is to build the classifier:

window.classifier = new ColorClassifier();
get_dataset('dataset.js', function (data){
    window.classifier.learn(data);
});

Then we can classify colors:

var result_name = window.classifier.classify("#aaf000");