Technical Blog

Building a language-independent keyword-based system with the Wikipedia Miner


Extracting keywords from texts and HTML pages is a common subject that opens doors to a lot of potential applications. These include classification (what is this page topic?), recommendation systems (identifying user likes to recommend the more accurate content), search engines (what is this page about?), document clustering (how can I pack different texts into a common group) and much more.

Most applications of these are usually based on only one language, usually english. However, it would be better to be able to process document in any language. For example, a case in a recommender system would be a user that speaks French and English. In his history, he gave positive ratings to a few pages containing the keyword “Airplane”. So, for next recommendations, we would boost this keyword. With a language-independent approach, we would also be able to boost pages containing “Avion”, the french term for airplane. If the user gave positive ratings to pages in English containing “Airplane”, and in French containing “Avion”, we would also be able to merge easily into the same keyword to build a language-independent user profile that will be used for accurate French and English recommendations.

This articles shows one way to achieve good results using an easy strategy. It is obvious that we can achieve better results using more complex algorithms.

How it works

This articles shows how we implemented this in the Wikipedia Miner. The Wikipedia Miner is an open source keyword extraction server that relies on Wikipedia: all returned keywords are titles from the online encyclopedia. The server gives very good results and you can find more about it in the Wikipedia Miner section of the blog.

Wikipedia Miner himself doesn’t know how to deal once for all languages: it requires to build one database per language. Once started, it can process requests with any of the built languages, ie, one instance can process queries in a lot of different languages (as long as you have enough memory).

The language-independence shown here is actually an “all into english” technique. The idea is that Wikipedia contains one very precious resource that you may not have noticed: interlanguage links. The box in the bottom left of the page is able to rely most pages between hundreds of different languages, which makes a great free translations database. We will modify the database building script to store for each non-English keyword the associated English page title, so that we return both. Using the english field, we lose the language-dependance.

Here is an example. I took a text about airplanes on a French website. I receive these keywords:

    "title": "Airbus",
    "english": "Airbus",
    "weight": 0.714
}, {
    "title": "Avion",
    "english": "Airplane",
    "weight": 0.674
}, {
    "title": "Aérodynamique",
    "english": "Aerodynamics",
    "weight": 0.412
}, {
    "title": "Low cost",
    "english": "No frills",
    "weight": 0.161


There are two ways we can design the addition of the “english” field:

  1. Adding the associated english page’s title in database building
  2. Adding a translation layer (inside the server or at a higher level with a translation layer in the code that sends the request to the server).

Option 1 is the easier to implement, especially because it makes that the server is still doing all the business. Yet, option 2 can be interesting when the databases are already built and that you don’t want to rebuild it.

Building the database

The database builder already contains code to generate the translation database (translations.csv), but data is very incomplete, because it doesn’t use the interlanguage links database. To have it work well, we had to develop our own script that builds this file, respecting its structure.

To do so, we need to download the langlinks.sql file from Wikipedia Dumps. For example, for French (FR code), you can find it at this address:

Once downloaded and unzipped, it is straightforward to parse and to output of this format:


For example:


We did make sure to output only translations into english, to reduce the database size.

You will also need to edit those two files to remove all reference to translation:

  • src/main/java/org/wikipedia/miner/extraction/
  • src/main/java/org/wikipedia/miner/extraction/

On the Wikipedia Miner side

Once you’ve built the database using a custom translation.csv file, you only need to integrate the english translation into returned result. To do so, when building the result, use the Topic.getTranslation(“en”) method on returned topics (keywords).


We’ve seen here that we can easily modify the Wikipedia Miner to integrate in the results the associated english term. This makes very easy to extract semantic information from pages in different languages and handle those without language distinction, which ease dramatically internationalization of systems.