Technical Blog

Category : i18n

Introduction

Extracting keywords from texts and HTML pages is a common subject that opens doors to a lot of potential applications. These include classification (what is this page topic?), recommendation systems (identifying user likes to recommend the more accurate content), search engines (what is this page about?), document clustering (how can I pack different texts into a common group) and much more.

Most applications of these are usually based on only one language, usually english. However, it would be better to be able to process document in any language. For example, a case in a recommender system would be a user that speaks French and English. In his history, he gave positive ratings to a few pages containing the keyword “Airplane”. So, for next recommendations, we would boost this keyword. With a language-independent approach, we would also be able to boost pages containing “Avion”, the french term for airplane. If the user gave positive ratings to pages in English containing “Airplane”, and in French containing “Avion”, we would also be able to merge easily into the same keyword to build a language-independent user profile that will be used for accurate French and English recommendations.

This articles shows one way to achieve good results using an easy strategy. It is obvious that we can achieve better results using more complex algorithms.

How it works

This articles shows how we implemented this in the Wikipedia Miner. The Wikipedia Miner is an open source keyword extraction server that relies on Wikipedia: all returned keywords are titles from the online encyclopedia. The server gives very good results and you can find more about it in the Wikipedia Miner section of the blog.

Wikipedia Miner himself doesn’t know how to deal once for all languages: it requires to build one database per language. Once started, it can process requests with any of the built languages, ie, one instance can process queries in a lot of different languages (as long as you have enough memory).

The language-independence shown here is actually an “all into english” technique. The idea is that Wikipedia contains one very precious resource that you may not have noticed: interlanguage links. The box in the bottom left of the page is able to rely most pages between hundreds of different languages, which makes a great free translations database. We will modify the database building script to store for each non-English keyword the associated English page title, so that we return both. Using the english field, we lose the language-dependance.

Here is an example. I took a text about airplanes on a French website. I receive these keywords:

[{
    "title": "Airbus",
    "english": "Airbus",
    "weight": 0.714
}, {
    "title": "Avion",
    "english": "Airplane",
    "weight": 0.674
}, {
    "title": "Aérodynamique",
    "english": "Aerodynamics",
    "weight": 0.412
}, {
    "title": "Low cost",
    "english": "No frills",
    "weight": 0.161
}]

Implementation

There are two ways we can design the addition of the “english” field:

  1. Adding the associated english page’s title in database building
  2. Adding a translation layer (inside the server or at a higher level with a translation layer in the code that sends the request to the server).

Option 1 is the easier to implement, especially because it makes that the server is still doing all the business. Yet, option 2 can be interesting when the databases are already built and that you don’t want to rebuild it.

Building the database

The database builder already contains code to generate the translation database (translations.csv), but data is very incomplete, because it doesn’t use the interlanguage links database. To have it work well, we had to develop our own script that builds this file, respecting its structure.

To do so, we need to download the langlinks.sql file from Wikipedia Dumps. For example, for French (FR code), you can find it at this address: https://dumps.wikimedia.org/frwiki/latest/.

Once downloaded and unzipped, it is straightforward to parse and to output of this format:

 articleid,m{‘en,englishtitle}

For example:

 15,m{'en,'Austria}

We did make sure to output only translations into english, to reduce the database size.

You will also need to edit those two files to remove all reference to translation:

  • src/main/java/org/wikipedia/miner/extraction/LabelSensesStep.java
  • src/main/java/org/wikipedia/miner/extraction/DumpExtractor.java

On the Wikipedia Miner side

Once you’ve built the database using a custom translation.csv file, you only need to integrate the english translation into returned result. To do so, when building the result, use the Topic.getTranslation(“en”) method on returned topics (keywords).

Conclusion

We’ve seen here that we can easily modify the Wikipedia Miner to integrate in the results the associated english term. This makes very easy to extract semantic information from pages in different languages and handle those without language distinction, which ease dramatically internationalization of systems.

Using curl with multibyte domain names

Internationalized Domain Names

The usage of non-ascii characters in domain names is allowed since 2003. It makes valid urls like http://香港大學.香港 or http://пример.испытание. This feature is called Internationalized Domain Names (IDNA).

Those urls are valid, but if you try to retrieve them using tools like cURL or WGET it will fail:

$ curl -XGET 香港大學.香港
curl: (6) Could not resolve host: 香港大學.香港; nodename nor servname provided, or not known

The problem is that those piece of software don’t handle multibyte domain names, contrarily to modern web browsers. Note that the problem is only when the host contains non-ascii characters. Urls like http://fa.wikipedia.org/wiki/پلاک_وسیله_نقلیه don’t need any specific processing.

To do handle those addresses well, the urls need to be converted to Punycode. This is a reversible transformation that allows to use the less user friendly ascii equivalent.
For example, http://香港大學.香港 is transformed to http://xn--pssu7cv61af44b.xn--j6w193g
and http://пример.испытание becomes http://xn--e1afmkfd.xn--80akhbyknj4f.
Those urls can be successfully retrieved using curl:

$ curl -XGET  --head  http://xn--pssu7cv61af44b.xn--j6w193g/
HTTP/1.1 200 OK

Application

Let’s create a simple script to handle those urls! In order to be able to access multibyte hostnames, we need to convert the host. To convert, several libraries are available in different programming languages, including:

For our example, I will rely on PHP’s idn_to_ascii function for simplicity’s sake. As we’ve seen earlier, only the host must be converted to Punycode. We obtain the following code:

<?php
function convert_to_ascii($url)
{
  $parts = parse_url($url);
  if (!isset($parts['host']))
    return $url; // missing http? makes parse_url fails
  // convert if domain name is non_ascii
  if (mb_detect_encoding($parts['host']) != 'ASCII')
  {
    $parts['host'] = idna_to_ascii($parts['host']);
    return http_build_url($parts);
  }
  return $url;
}
// Call from CLI
if (isset($argv[1]))
  echo convert_to_ascii($argv[1]);

We can check the conversion:

$ php idn.php http://실례.테스트/index.php?title=대문

http://xn--9n2bp8q.xn--9t4b11yi5a/index.php?title=대문

Now that our script is ready, we can use it to download with cURL:

curl -XGET --head `php idn.php http://실례.테스트/index.php?title=대문`
HTTP/1.1 200 OK

It works! We can use this trick to download from our shell, like a bash crawler, or from code in any language you wish using the same technique.

More use cases

This article was based on a practical problem, but this technique can be used for different applications. Especially, it is helpful to store urls or domains name in a canonic fashion in your backend, then you can convert it back to unicode when displaying to users. All libraries gives both functions, that are reversible without any loss.
Punycode conversion is part of a more larger urls processing called nameprep. Mozilla’s Internationalized Domain Names (IDN) Support in Mozilla Browsers is an excellent reference to understand how to handle multibyte urls, that must be taken into consideration when you want your site to become worldwide (Japanese, Russian, Arabic…)