Technical Blog

Introduction

Extracting keywords from texts and HTML pages is a common subject that opens doors to a lot of potential applications. These include classification (what is this page topic?), recommendation systems (identifying user likes to recommend the more accurate content), search engines (what is this page about?), document clustering (how can I pack different texts into a common group) and much more.

Most applications of these are usually based on only one language, usually english. However, it would be better to be able to process document in any language. For example, a case in a recommender system would be a user that speaks French and English. In his history, he gave positive ratings to a few pages containing the keyword “Airplane”. So, for next recommendations, we would boost this keyword. With a language-independent approach, we would also be able to boost pages containing “Avion”, the french term for airplane. If the user gave positive ratings to pages in English containing “Airplane”, and in French containing “Avion”, we would also be able to merge easily into the same keyword to build a language-independent user profile that will be used for accurate French and English recommendations.

This articles shows one way to achieve good results using an easy strategy. It is obvious that we can achieve better results using more complex algorithms.

How it works

This articles shows how we implemented this in the Wikipedia Miner. The Wikipedia Miner is an open source keyword extraction server that relies on Wikipedia: all returned keywords are titles from the online encyclopedia. The server gives very good results and you can find more about it in the Wikipedia Miner section of the blog.

Wikipedia Miner himself doesn’t know how to deal once for all languages: it requires to build one database per language. Once started, it can process requests with any of the built languages, ie, one instance can process queries in a lot of different languages (as long as you have enough memory).

The language-independence shown here is actually an “all into english” technique. The idea is that Wikipedia contains one very precious resource that you may not have noticed: interlanguage links. The box in the bottom left of the page is able to rely most pages between hundreds of different languages, which makes a great free translations database. We will modify the database building script to store for each non-English keyword the associated English page title, so that we return both. Using the english field, we lose the language-dependance.

Here is an example. I took a text about airplanes on a French website. I receive these keywords:

[{
    "title": "Airbus",
    "english": "Airbus",
    "weight": 0.714
}, {
    "title": "Avion",
    "english": "Airplane",
    "weight": 0.674
}, {
    "title": "Aérodynamique",
    "english": "Aerodynamics",
    "weight": 0.412
}, {
    "title": "Low cost",
    "english": "No frills",
    "weight": 0.161
}]

Implementation

There are two ways we can design the addition of the “english” field:

  1. Adding the associated english page’s title in database building
  2. Adding a translation layer (inside the server or at a higher level with a translation layer in the code that sends the request to the server).

Option 1 is the easier to implement, especially because it makes that the server is still doing all the business. Yet, option 2 can be interesting when the databases are already built and that you don’t want to rebuild it.

Building the database

The database builder already contains code to generate the translation database (translations.csv), but data is very incomplete, because it doesn’t use the interlanguage links database. To have it work well, we had to develop our own script that builds this file, respecting its structure.

To do so, we need to download the langlinks.sql file from Wikipedia Dumps. For example, for French (FR code), you can find it at this address: https://dumps.wikimedia.org/frwiki/latest/.

Once downloaded and unzipped, it is straightforward to parse and to output of this format:

 articleid,m{‘en,englishtitle}

For example:

 15,m{'en,'Austria}

We did make sure to output only translations into english, to reduce the database size.

You will also need to edit those two files to remove all reference to translation:

  • src/main/java/org/wikipedia/miner/extraction/LabelSensesStep.java
  • src/main/java/org/wikipedia/miner/extraction/DumpExtractor.java

On the Wikipedia Miner side

Once you’ve built the database using a custom translation.csv file, you only need to integrate the english translation into returned result. To do so, when building the result, use the Topic.getTranslation(“en”) method on returned topics (keywords).

Conclusion

We’ve seen here that we can easily modify the Wikipedia Miner to integrate in the results the associated english term. This makes very easy to extract semantic information from pages in different languages and handle those without language distinction, which ease dramatically internationalization of systems.


Using curl with multibyte domain names

Internationalized Domain Names

The usage of non-ascii characters in domain names is allowed since 2003. It makes valid urls like http://香港大學.香港 or http://пример.испытание. This feature is called Internationalized Domain Names (IDNA).

Those urls are valid, but if you try to retrieve them using tools like cURL or WGET it will fail:

$ curl -XGET 香港大學.香港
curl: (6) Could not resolve host: 香港大學.香港; nodename nor servname provided, or not known

The problem is that those piece of software don’t handle multibyte domain names, contrarily to modern web browsers. Note that the problem is only when the host contains non-ascii characters. Urls like http://fa.wikipedia.org/wiki/پلاک_وسیله_نقلیه don’t need any specific processing.

To do handle those addresses well, the urls need to be converted to Punycode. This is a reversible transformation that allows to use the less user friendly ascii equivalent.
For example, http://香港大學.香港 is transformed to http://xn--pssu7cv61af44b.xn--j6w193g
and http://пример.испытание becomes http://xn--e1afmkfd.xn--80akhbyknj4f.
Those urls can be successfully retrieved using curl:

$ curl -XGET  --head  http://xn--pssu7cv61af44b.xn--j6w193g/
HTTP/1.1 200 OK

Application

Let’s create a simple script to handle those urls! In order to be able to access multibyte hostnames, we need to convert the host. To convert, several libraries are available in different programming languages, including:

For our example, I will rely on PHP’s idn_to_ascii function for simplicity’s sake. As we’ve seen earlier, only the host must be converted to Punycode. We obtain the following code:

<?php
function convert_to_ascii($url)
{
  $parts = parse_url($url);
  if (!isset($parts['host']))
    return $url; // missing http? makes parse_url fails
  // convert if domain name is non_ascii
  if (mb_detect_encoding($parts['host']) != 'ASCII')
  {
    $parts['host'] = idna_to_ascii($parts['host']);
    return http_build_url($parts);
  }
  return $url;
}
// Call from CLI
if (isset($argv[1]))
  echo convert_to_ascii($argv[1]);

We can check the conversion:

$ php idn.php http://실례.테스트/index.php?title=대문

http://xn--9n2bp8q.xn--9t4b11yi5a/index.php?title=대문

Now that our script is ready, we can use it to download with cURL:

curl -XGET --head `php idn.php http://실례.테스트/index.php?title=대문`
HTTP/1.1 200 OK

It works! We can use this trick to download from our shell, like a bash crawler, or from code in any language you wish using the same technique.

More use cases

This article was based on a practical problem, but this technique can be used for different applications. Especially, it is helpful to store urls or domains name in a canonic fashion in your backend, then you can convert it back to unicode when displaying to users. All libraries gives both functions, that are reversible without any loss.
Punycode conversion is part of a more larger urls processing called nameprep. Mozilla’s Internationalized Domain Names (IDN) Support in Mozilla Browsers is an excellent reference to understand how to handle multibyte urls, that must be taken into consideration when you want your site to become worldwide (Japanese, Russian, Arabic…)


How to structure a neo4j-based system?

Neo4j is a powerful, free and easy to use graph database. It is ideal to store, for example, relationships between people, items etc. It provides great features including a rich API (like easy traversal of graph, pathfinding), high level query DSLs (Gremlin, Cypher), a server featuring REST protocol, nodes and relationships indexing, ACID compliant, etc.
In this article, I give pointers to help you decide which topology to use for your Neo4j projects: embedded db? server?
This choice is important, because all options have advantages, disadvantages and limitations that have to be taken into consideration before implementing the system. However, one important thing is that whatever solution you use, the way the database is stored on disk is the same. It means that you can change your architecture without having to re-build a database.

Neo4j provides an excellent online documentation that contains everything you need to setup, configure and use this library.

5 ways to structure a Neo4j-based project

The Neo4j server

Using the provided server allows to have very quickly a working system that allows a lot of clients to access the database. Simply call $NEO4J_HOME/bin/neo4j start and the database is running. Then, you can use it using several ways:

  • REST Api, that allows to insert, delete and access data easily from any programming language
  • Writing a server plugin: JVM code in a JAR file you locate in $NEO4J_HOME/plugins that are loaded at startup by the server and which extends the REST API. It’s a nice way to use server’s simplicity and keep the possibility of executing custom Java code. However, it has some limitations: needs to restart the server to use new plugins, limited return type (can be bypassed using JSON).
  • Unmanaged extensions which must be used with caution. It allows a better control of the protocol as it’s not relying on Neo4j’s REST API.
  • Run queries using Gremlin or Cypher plugins

An other advantage is that it includes monitoring using a web interface.

EmbeddedDB

Using an embedded DB looks more like using a powerful graph API than using a complete DataBase Management System (DBMS). Indeed, you have to write your own JVM program (your choice between Java, Scala, Clojure, Groovy, etc.) that loads the DB from file and run operations on it (node & relations creation/deletion/search, queries).

The major advantage is that you have a total control of the database and the program that holds it. For example, you’re not limited to plugins that are called by REST.

You can use an EmbeddedDB to create your own Neo4j server (the provided neo4j server is based on it), so you can include any communication layer you want to handle requests: your favorite HTTP server, Thrift, etc., so multiple clients can connect to it, as only have one program (the server) at a time accessing the database.

Creating your own server allows to decouple the database and the application (for example, you can have one machine running as a server and and lot of other ones accessing it), but you can also use it as an integrated Graph API. For example, you can include it in a standalone video game (when there’s only one client for one database), storing your environment in a graph database and running shortest path algorithms on it. In cases where only one process will ever connect to the database, embedded db are a good solution.

HighlyAvailableGraphDatabase

Neo4j HA has been designed to make the transition from single machine to multi machine operation simple, by not having to change the already existing application. (source).

This solution requires the enterprise edition of Neo4j and allows a more complex system architecture based on a machine’s cluster. You can use it with EmbeddedDB or server mode.

A hybrid solution: WrappingNeoServerBootstrapper

This class allows to keep some of the server’s features (REST API, Monitoring) when using an embedded database. This requires only a few more lines of code than the EmbeddedDB, an example can be found here.

A special case: the BatchInserter

When creating a Neo4j-based project, there is a high probability that you will start by importing existing data. Sources can be other existing databases, csv files, etc. Neo4j provides an API designed to import data very quickly (way faster than using transactions, the common way to insert data). The BatchInserter is usually used at the beginning of the project to build the DB, and then the DB used by an EmbeddedDB or a server.

To give a concrete example, it allowed me to import a graph of 10 million nodes and 50 million relationships in 20 minutes on a Macbook Pro. It took hours using classical transactions.

Conclusion

They are three main architectures to use a Neo4j DB:

  • Using the Neo4j server: very easy and customisable with handmade plugins, REST access and query DSL
  • EmbeddedDB (or the wrapper variant): It is a good choice when you have only one client that would embed the database, like create standalone programs, using Neo4j as a graph API. It allows more possibilities but it requires more work if you want to create your own server (have to build a communication layer treating requests).
  • HighlyAvailableGraphDatabase: when you need a resistant architecture. It is distributed and requires configuration. It is run on top of Embedded mode or server mode.

BatchInserter can be used prior to any of those free to import initial data in the Neo4j database.

In most cases the second option is likely to be the best one, but the first one should be selected when you need a lot of customization or simply a graph API.

Keep in mind that whatever you choose, you can switch from one solution to the other without changing your actual database on disk.


We have seen in a previous article that Wikipedia Miner is a great framework to build algorithms on top of Wikipedia. It eases the use of requests via webservices using Tomcat.

However, if you don’t want to use Tomcat, we’re going to see how you can create a standalone program that uses all the power of the framework.

We assume in the article that you already have a working Wikiminer setup (databases + configuration files).

Create a basic class

Our example will be very simple, so we don’t focus on implementing new functionalities, but rather focus on creating the program and compiling it.

Here, we will do a program that reads page’s id from command line and display their associated title.

The different steps are the following:

  • Create a WikipediaConfiguration from wikipedia.xml.
  • Create a Wikipedia object that loads the DBs from the WikipediaConfiguration.
  • Cann Wikipedia methods to compute the algorithms.
package fr.gauth.wikiminer;

import java.io.File;
import java.util.Scanner;

import org.wikipedia.miner.model.Page;
import org.wikipedia.miner.model.Wikipedia;
import org.wikipedia.miner.util.WikipediaConfiguration;

public class IdToTitlePrompt
{
	protected static WikipediaConfiguration getConfiguration(String args[])
	{
		if (args.length != 1)
		{
			System.out.println("Please specify path to wikipedia configuration file");
			System.exit(1);
		}

		File confFile = new File(args[0]);
		if (!confFile.canRead())
		{
			System.out.println("'" + args[0] + "' cannot be read");
			System.exit(1);
		}

		WikipediaConfiguration conf = null;
		try
		{
			conf = new WikipediaConfiguration(confFile);

			if (conf.getDataDirectory() == null
					|| !conf.getDataDirectory().isDirectory())
			{
				System.out.println("'" + args[0]
						+ "' does not specify a valid data directory");
				System.exit(1);
			}

		} catch (Exception e)
		{
			e.printStackTrace();
			System.exit(2);
		}
		return conf;
	}

	public static void main(String args[]) throws Exception
	{
		WikipediaConfiguration conf = getConfiguration(args);

        // if 2nd argument is set to true, the preparation of DB is threaded
        // which allows to run the code immediatly, rather than waiting
        // for the DB to be cached.
		Wikipedia wikipedia = new Wikipedia(conf, true);

		Scanner sc = new Scanner(System.in);
		while (sc.hasNextInt())
		{
			int id = sc.nextInt();
			Page page = wikipedia.getPageById(id);
			System.out.println(page.getTitle());
		}
        wikipedia.close();
	}
}

Compile and execute it

To compile and run it, we will update the original Ant’s build.xml, so it can create a standalone executable jar. To do so, we follow these steps:

  • Add a new custom target, that I called assembly (copy of package)
  • Join the dependencies (/lib) to the jar
  • Set the main class, so the jar is self-runnable

We obtain the following entry:

    <target name="assembly" depends="build" description="creates an executable jar with its dependencies">
		<echo>Creating the runnable standalone jar file</echo>
    	<mkdir dir="${build.dir}/jar"/>

    	<jar destfile="${build.dir}/jar/${jar.mainModule}" >
    		<fileset dir="${build.dir}/classes"/>
			<zipgroupfileset dir="lib" includes="*.jar"/>
			<manifest>
		  		<attribute name="Main-Class" value="fr.gauth.wikiminer.IdToTitlePrompt"/>
			</manifest>
    	</jar>
    </target>

Finally, we can run it using the following command:

ant assembly && java -jar  ../build/jar/wikipedia-miner.jar  ../configs/en.xml

In the shell, we can type 9232 and it will successfully display “Eiffel Tower”.


A NGram Generator in Scala

Using NGrams is very common and helpful in Natural Language Processing problems. Given the sentence: “This is a sentence”, the ngrams of order 2 are: “This is”, “is a” and “a sentence”. There are very useful, for example, to classify text.

In this article, I’ll show a class that generates this list.

Generators rather than lists

Given a text, the list of NGrams can be very big. For instance, if we have N tokens, we obtain:

  • N of size 1 (all the individual tokens)
  • N – 1 of size 2
  • N – 2 of size 3

Despite its linearity, the size can get very large. For instance, if we have 10 000 tokens and N in the range [1, 8] there are around 80 000 different ngrams.

Therefore, I use a generator rather than building the complete list of NGrams. This way, I save a lot of memory and can also save computation time if the caller does not iterator on the complete sequence.

The code

The NGramsGenerator class is the following:

package fr.gauth

import scala.collection.JavaConversions._
import scala.collection.mutable.{Queue => MutableQueue}
import java.util.StringTokenizer

object NGramsGenerator
{
  // each ngram is represented as a List of String
  def generate(text : String, minSize : Int, maxSize : Int) =
  {
    assert(maxSize >= minSize && minSize > 0 && maxSize > 0)

    // iterate for each token on the available ngrams
    for (ngramList <- this.generate_sub(text, minSize, maxSize);
         ngram <- ngramList)
      yield ngram.toList
  }

  // generate a list of ngrams
  def generate_sub(text : String, minSize : Int, maxSize : Int) =
  {
    val nGramsBuffer = new NGramsBuffer(minSize, maxSize)
    for (t <- this.getTokenizerIterator(text))
      yield
       {
         nGramsBuffer.addToken(t.asInstanceOf[String])
         nGramsBuffer.getNGrams
       }
  }

  // Can be overloaded to use an other tokenizer
  def getTokenizerIterator(text : String) =
    new StringTokenizer(text)

  // A utility class that stores a list of fixed size queues, each queue is a current ngram
  class NGramsBuffer(val minSize : Int, val maxSize : Int)
  {
    val queues = minSize.to(maxSize).foldLeft(List[SlidingFixedSizeQueue[String]]()) {
      (list, n) =>
        new SlidingFixedSizeQueue[String](n) :: list
    }

    def addToken(token : String) = queues.foreach(q => q.enqueue(token))

    // return only buffers that are full, otherwise not enough tokens have been
    // see yet to fill in the NGram
    def getNGrams() = queues.filter(q => q.size == q.maxSize)
  }

  // Utility class to store the last maxSize encountered tokens
  class SlidingFixedSizeQueue[A](val maxSize : Int) extends scala.collection.mutable.Queue[A]
  {
    override def enqueue(elems : A*) =
    {
      elems.foreach(super.enqueue(_))

      while (this.size > this.maxSize)
        this.dequeue()
    }
  }
}

Both the tokenizer and the NGram builder are generators.
The tokenizer can be different from an operation to an other. For example, we can decide to keep or remove the ponctuation, use an arithmetic operations tokenizer (so 1+3=4 is tokenized as “1″, “+”, “3″, “=”, “4″).
In this code, you can change it directly or subclass and override it.

It can be used this way:

package fr.gauth

object NGramDemo {
  def main(args: Array[String]): Unit =
  {
    NGramsGenerator.generate("here is my list of words", 1, 3)
      .foreach(ngram => println("Got NGram: " + ngram.mkString(" ")))
  }
}

It will output all the NGrams where N is in the range 1, 3:

Got NGram: here
Got NGram: here is
Got NGram: is
Got NGram: here is my
Got NGram: is my
Got NGram: my
Got NGram: is my list
Got NGram: my list
Got NGram: list
Got NGram: my list of
Got NGram: list of
Got NGram: of
Got NGram: list of words
Got NGram: of words
Got NGram: words

The code is available on GitHub here.


Find closest subway station with ElasticSearch

This article aims to show a concrete example of spatial search in ElasticSearch. This feature allow to search records using their geographical coordinates: closest points, point within a circle or any polygon, etc. Shay Banon’s blog article is very helpful and helped me developing this concrete full example.

In this article, we will store in ElasticSearch all the subway stations of Paris and search for the closest ones to the Eiffel Tower (or any coordinate).

Getting the data

The Parisian transportation company made available for free the list of station’s coordinate (thanks to OpenData). You can retrieve the file here.

Creation of the database

We are going to store records with two fields:

  • Name, of type string
  • Location, of type Geo Point which stores latitude and longitude.

Setting location as a geo point will allow us to perform distance calculation operations on it. We create the type station on the geo_metro index using the following mapping:

curl -XPUT http://localhost:9200/geo_metro -d '
{
    "mappings": {
        "station": {
            "properties": {
                "name": {
                    "type": "string"
                },
                "location": {
                    "type": "geo_point"
                }
            }
        }
    }
}
'

Feeding the database

From the csv file, we are looking to parse it and insert it in the database. The best solution would be to make a script reading the file line by line and executing bulked insertion requests using an ElasticSearch client.

Here, to avoid installing clients I simply made a Python script (here) that generates a list of CURL requests that I save in a .sh file (here):

python create_requests.py stations.csv > insert.sh
bash insert.sh

Hopefully, the coordinates in the CSV file have already the format we need (degres only, with floating points). If it is not the case, you should look at the excellent Geographic Coordinate Conversion article from Wikipedia.

Searching the closest station

Now that we have our data stored in ElasticSearch with the correct mapping, we are able to execute searches. Our request will return the list of the five closest stations to a geographical point.
In our this example, I will use the Eiffel Tower coordinates that I found in its Wikipedia article.

The request is the following:

curl -XGET 'http://localhost:9200/geo_metro/station/_search?size=5&pretty=true' -d '
{
    "sort" : [
        {
            "_geo_distance" : {
                "location" : [48.8583, 2.2945],
                "order" : "asc",
                "unit" : "km"
            }
        }
    ]
    },
    "query" :
    {
        "match_all"
    }
'

It successfully returns:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 634,
    "max_score" : null,
    "hits" : [ {
      "_index" : "geo_metro",
      "_type" : "station",
      "_id" : "91jtmucvThaNq83Y2K4rww",
      "_score" : null, "_source" : {"name": "Champ de Mars-Tour Eiffel", "location": {"lat": "2.28948345865043", "lon": "48.855203725918"}},
      "sort" : [ 0.655364333719714 ]
    }, {
      "_index" : "geo_metro",
      "_type" : "station",
      "_id" : "H-Q9HFVcRqiqWVtk9OCvbQ",
      "_score" : null, "_source" : {"name": "I\u00e9na", "location": {"lat": "2.29379995911415", "lon": "48.8644728971468"}},
      "sort" : [ 0.6902478960716185 ]
    }, {
      "_index" : "geo_metro",
      "_type" : "station",
      "_id" : "PS0GCpC4TDmgjrhTRego4g",
      "_score" : null, "_source" : {"name": "Bir-Hakeim Grenelle", "location": {"lat": "2.28878285580131", "lon": "48.8543331583289"}},
      "sort" : [ 0.7735556213326537 ]
    }, {
      "_index" : "geo_metro",
      "_type" : "station",
      "_id" : "jd4ct8APS4WSWzRdvnsQbg",
      "_score" : null, "_source" : {"name": "Dupleix", "location": {"lat": "2.29276958714394", "lon": "48.8508056365633"}},
      "sort" : [ 0.8546099046905625 ]
    }, {
      "_index" : "geo_metro",
      "_type" : "station",
      "_id" : "Ol9SpVIcRkKYw2bj99Sb7g",
      "_score" : null, "_source" : {"name": "Pont de l alma", "location": {"lat": "2.30129356979222", "lon": "48.8624292670432"}},
      "sort" : [ 0.8838145038672007 ]
    } ]
  }
}

Note that the sort value is the distance between the Eiffel Tower and the subway station.

Conclusion

The objective of this article was to show how simple it is to perform requests based on geographical points, and how to use it. You can create way more complicated queries, changing match_all to a concrete query, using filters, facets, etc. with the scalability of ElasticSearch, which allows to store a huge amount of points!


Exact search with ElasticSearch

ElasticSearch is an extremely powerful distributed database that can perform a lot of complex queries. This article aims to show how to perform an exact lookup (like WHERE field_name=field_value in SQL).

Creation of the test database

We will use a very simple database, that stores the age of an user. Here is an example object:

{
  "name" : "username",
  "age"  : 25
}

First, we create the db:

curl -XPUT http://localhost:9200/user_age

Then, we put some data in it:

curl -XPOST http://localhost:9200/user_age/user/ -d '{
   "name" : "user1",
   "user_age"  : 13
}'

curl -XPOST http://localhost:9200/user_age/user/ -d '{
   "name" : "user 2",
   "age"  : 20
}'

curl -XPOST http://localhost:9200/user_age/user/ -d '{
   "name" : "user 3",
   "age"  : 13
}'

curl -XPOST http://localhost:9200/user_age/user/ -d '{
   "name" : "USER4",
   "age"  : 20
}'

Note that we use POST requests, so we don’t have to manually specify the index key.

Failing requests

The objective of this database is to retrieve the age of an user from its username. For example, the following request is working:

curl -XGET http://localhost:9200/user_age/_search?pretty=true -d '{
    "query" : {
        "term" : {
            "name" : "user1"
        }
    }
}'
# Returns:
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.30685282,
    "hits" : [ {
      "_index" : "user_age",
      "_type" : "user",
      "_id" : "0lz5QLxlSRWEO7_OdvJcXg",
      "_score" : 0.30685282, "_source" : {
         "name" : "user1",
         "user_age"  : 13
       }
    } ]
  }
}

However, if we test with user 2 it doesn’t return anything:

curl -XGET http://localhost:9200/user_age/_search?pretty=true -d '{
    "query" : {
        "term" : {
            "name" : "user 2"
        }
    }
}'
# Returns:
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

Searching for USER4 fails as well, returning nothing.
Using a request that shouldn’t match anything (user) returns some results:

curl -XGET http://localhost:9200/user_age/_search?pretty=true -d '{
    "query" : {
        "term" : {
            "name" : "user"
        }
    }
}'
# Returns:
{
  "took" : 29,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.37158427,
    "hits" : [ {
      "_index" : "user_age",
      "_type" : "user",
      "_id" : "PBnEBcfMRpmBBX1xZG9Czw",
      "_score" : 0.37158427, "_source" : {
   "name" : "user 2",
   "age"  : 20
}
    }, {
      "_index" : "user_age",
      "_type" : "user",
      "_id" : "-ok3LJepR1C3H7W_DuaArQ",
      "_score" : 0.37158427, "_source" : {
   "name" : "user 3",
   "age"  : 13
}
    } ]
  }
}

The issue comes from the fact that when we inserted our records in the database, it was analyzed by the default analyzer (named Standard Analyzer). It contains the following operations:

As we are doing a term query, the input is not analyzed, which explains for example that USER4 doesn’t match, but user4 does.

We can change it to a text query (match query if your ElasticSearch is at least 0.19.9) but it doesn’t perform an exact search, so it would continue to return wrong results for “user” query.


curl -XGET http://localhost:9200/user_age/_search?pretty=true -d '{
    "query" : {
        "text" : {
            "name" : "user"
        }
    }
}'
# Returns:
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.37158427,
    "hits" : [ {
      "_index" : "user_age",
      "_type" : "user",
      "_id" : "PBnEBcfMRpmBBX1xZG9Czw",
      "_score" : 0.37158427, "_source" : {
   "name" : "user 2",
   "age"  : 20
}
    }, {
      "_index" : "user_age",
      "_type" : "user",
      "_id" : "-ok3LJepR1C3H7W_DuaArQ",
      "_score" : 0.37158427, "_source" : {
   "name" : "user 3",
   "age"  : 13
}
    } ]
  }
}

Solution: use mappings

This default behavior is good for documents indexing and retrieval, but is not adapted to our problem. We need to change it, using a custom mapping.

ElasticSearch is a schema-free database, which makes it very flexible. However, mapping allows to perform more powerful operations.
They allow to set different things:

  • Server settings (number of shards, replicas, etc.)
  • Analyzers: declared custom analyzers and filters, combining and configuring existing ones, including: NGrams generation, stemming, etc.
  • Mappings: details the fields of the records and different options, including which analyzer to use (see Core Types)

In our case, what we want is to remove the default analyzer, to keep the usernames unanalyzed.

First thing to do is to clear our index:

curl -XDELETE http://localhost:9200/user_age

Then, we can declare our mapping:

curl -XPUT http://localhost:9200/user_age -d '
{
	"mappings": {
		"user": {
			"properties": {
				"name": {
					"index": "not_analyzed",
					"type": "string"
				},
				"age": {
					"type": "integer"
				}
		}
	}
}
}
'

Then, we can put back our records and re-run the request. Now, “User” returns nothing and all the exact names, including “User 2″, returns the exact match.

To do it, we removed analyzing at insertion time and request time:

  • “index”: “not_analyzed” makes that the search engine keeps “User 2″ and nothing else, not ["user", 2]
  • Using term queries makes that request is not run for “user” and 2.

Note: this example is very simple, and could have been done very easily using the username as the record’s ID (in this case, it’s actually not even needed to add the name field):

curl -XPUT http://localhost:9200/user_age/user/user%202 -d '{
   "name" : "User 2",
   "user_age"  : 13
}'

It could have been retrieved like this:

curl -XGET http://localhost:9200/user_age/user/user%202

{"_index":"user_age","_type":"user","_id":"user 2","_version":1,"exists":true, "_source" : {
   "name" : "User 2",
   "user_age"  : 13
  }
}

However, if you want to do it on a non-unique field, or on several fields, this solution is not applicable anymore.


Create new Wikipedia Miner’s webservices

Wikipedia Miner is a great toolkit to perform different operations on Wikipedia, including search (retrieval of articles) and algorithms implementation (suggestion of pages, categorization). It can help a lot in problems of natural language processing and web mining. It provides a clear API for manipulating the data with good performances.

The toolkit can be used using a Java API or webservices. The second one has several advantages including:

  • Lose-coupled interface to access data & algorithms
  • The server centralizes the DB and its access, so only the server requires a good hardware configuration.
  • Easy access to the service from different programming languages / software from different computers with no effort

Wikipedia Miner comes with ten useful services (listed here), but creating new services is very useful to add new functionalities.

As a predicate to this article, you need to have a working installation of Wikipedia Miner (see WikiMiner’s install guide).

In this article we will create a very simple webservice that takes a list of Wikipedia’s articles IDs and return it’s ontology (if it exists, wether the article is a people, a company, etc.).

Creating an empty service

Creating a new service involves two main parts:

  • The service itself (Java code)
  • The Tomcat’s configuration
Basic service’s class

We need first to create a class that will handle the HTTP request. To do so, let’s create a OntologyService class inheriting from org.wikipedia.miner.service.WMService
that we will put in the package org.wikipedia.miner.service.
A class extending WMService needs at least to:

  • Call the super constructor with a description of the service
  • Define (override) the method buildWrappedResponse(HttpServletRequest)

We obtain the following code:

package org.wikipedia.miner.service;

import javax.servlet.http.HttpServletRequest;

import org.xjsf.Service;

public class OntologyService extends WMService {
	// Service description strings
	private static String groupName     = "core";
	private static String shortDescr    = "Returns the ontology of articles";
	private static String detailsMarkup = "<p>The service takes a list of article's ID and returns a map of ID=>Ontology. Ontology can be an empty string</p>";
	private static Boolean supportsDirectResponse = true;

	public OntologyService()
	{
		super(groupName, shortDescr, detailsMarkup, supportsDirectResponse);
	}

	@Override
	public Message buildWrappedResponse(HttpServletRequest request) throws Exception
        {
		// returns empty response
		return new Service.Message(request);
	}

}
Registering our service

Now that we have a very basic class handling requests, we need to ask Tomcat to redirect requests to it. To do so, we will edit the file web.xml located in the configs directory of the project.
We will configure two things:

  • The servlet (our service) with its name and it’s classpath
  • The mapping that will redirect HTTP requests to our class

We add to web.xml the following entries:

    <servlet>
      <servlet-name>ontology</servlet-name>
      <description></description>
      <servlet-class>
        org.wikipedia.miner.service.OntologyService
      </servlet-class>
      <load-on-startup>1</load-on-startup>
    </servlet>

    <servlet-mapping>
      <servlet-name>ontology</servlet-name>
      <url-pattern>/services/ontology</url-pattern>
    </servlet-mapping>

The mapping will redirect requests to services/ontology to our class. Note that this path is case sensitive.

Let’s test!

Now our service is ready to be used. However, we still need to deploy our new war jar so the modifications are taken into consideration.

ant deploy && sudo /etc/init.d/tomcat6 restart

You are now able to access http://127.0.0.1:8080/wikipediaminer/services/Ontology?responseFormat=JSON using a webbrowser or curl for instance. You should get the following response:

    {
      "service": "/services/ontology",
      "request": {
        "responseFormat": "JSON"
      }
    }

If it is not the case, some issues could refer to /etc/tomcat6/Catalina/localhost/wikipediaminer.xml. You can also delete /var/lib/tomcat6/webapps/wikipediaminer/ so you’re sure that when you deploy from ant it gets a fresh copy of the war file.

Parameters handling

We are going to upgrade our OntologyService class to handle the parameters of the request. In our case, a list of integers.
The parameters can be from different types listed here. You can take a look at the other Wikipedia Miner’s services for example of their usage.

To do so, we override the init(ServletConfig) method to create parameters objects and bind them using addGlobalParameter. Then, we can use it in the request handler to access the parameters. We obtain the following code:

    package org.wikipedia.miner.service;

    import javax.servlet.ServletConfig;
    import javax.servlet.ServletException;
    import javax.servlet.http.HttpServletRequest;

    import org.xjsf.Service;
    import org.xjsf.param.IntListParameter;

    public class OntologyService extends WMService {

    	// Service Description
    	private static String groupName     = "core";
    	private static String shortDescr    = "Returns the ontology of articles";
    	private static String detailsMarkup = "<p>The service takes a list of article's ID and returns a map of ID=>Ontology. Ontology can be an empty string</p>";
    	private static Boolean supportsDirectResponse = true;

    	// Parameters
    	IntListParameter prmIdList;

    	public OntologyService()
    	{
    		super(groupName, shortDescr, detailsMarkup, supportsDirectResponse);
    	}

    	@Override
    	public void init(ServletConfig config) throws ServletException
    	{
    		super.init(config);
    		prmIdList = new IntListParameter("ids", "List of page ids whom ontology must be found", null);
    		addGlobalParameter(prmIdList);
    	}

    	@Override
    	public Message buildWrappedResponse(HttpServletRequest request) throws Exception
    	{
    		Integer[] ids = prmIdList.getValue(request);
    		if (ids == null) // no ids field in the request
                return new ParameterMissingMessage(request);
    		// do things with ids and return it
    		return new Service.Message(request);
    	}
    }

Retrieving the ontology

Here is a created basic and incomplete function to retrieve the ontology from a Wikipedia article. We can add it to the Article class:


protected String infobox_ontology;
	/**
	 * Returns the ontology of the article, or an empty string
	 * @author glemoine
	 * @param article
	 * @return
	 */
	public String getInfoBoxTitle()
    {
		if (this.infobox_ontology == null)
			this.infobox_ontology = this.processInfoBoxTitle();

		return this.infobox_ontology;
    }

	protected String processInfoBoxTitle()
    {
        String content = this.getMarkup().substring(0, 800);
        String[] lines = content.split("\\n");

        for (String line : lines)
        {
        	line = line.replace("|", " ");
            String[] splitted = line.split(" ");
            if (splitted[0].endsWith("Infobox"))
                return splitted[splitted.length - 1].toLowerCase();
        }

        return ""; // no ontology found
    }

Returning a response

In this last step, we are looking to:

  • Make the calculation (calling the ontology method)
  • Return the values, creating our custom answer class
The result class

We currently return a default Service.Message object. We need to create a new class that extends it that maps what we want to return. In our case, only one field: a map associated an id (integer) to an ontology (string).
We add as an inner class to OntologyService the following:

    	public static class OntologyReturnMessage extends Service.Message
    	{
    		@Expose
    		@Attribute
    		private Map<Integer, String> ontologies;

    		private OntologyReturnMessage(HttpServletRequest request, Map<Integer, String> ontologies)
    		{
    			super(request);

    			this.ontologies = ontologies;
    		}
    	}
Putting things together

The final step is to compute the ontology method for each id, create the response object and feed it. We obtain the final code:

    package org.wikipedia.miner.service;

    import java.util.HashMap;
    import java.util.Map;

    import javax.servlet.ServletConfig;
    import javax.servlet.ServletException;
    import javax.servlet.http.HttpServletRequest;

    import org.simpleframework.xml.Attribute;
    import org.wikipedia.miner.model.Article;
    import org.wikipedia.miner.model.Page;
    import org.wikipedia.miner.model.Page.PageType;
    import org.wikipedia.miner.model.Wikipedia;
    import org.xjsf.Service;
    import org.xjsf.UtilityMessages.ParameterMissingMessage;
    import org.xjsf.param.IntListParameter;

    import com.google.gson.annotations.Expose;

    public class OntologyService extends WMService {

    	// Service Description
    	private static String groupName     = "core";
    	private static String shortDescr    = "Returns the ontology of articles";
    	private static String detailsMarkup = "<p>The service takes a list of article's ID and returns a map of ID=>Ontology. Ontology can be an empty string</p>";
    	private static Boolean supportsDirectResponse = true;

    	// Parameters
    	IntListParameter prmIdList;

    	public OntologyService()
    	{
    		super(groupName, shortDescr, detailsMarkup, supportsDirectResponse);
    	}

    	@Override
    	public void init(ServletConfig config) throws ServletException
    	{
    		super.init(config);
    		prmIdList = new IntListParameter("ids", "List of page ids whom ontology must be found", null);
    		addGlobalParameter(prmIdList);
    	}

    	@Override
    	public Service.Message buildWrappedResponse(HttpServletRequest request) throws Exception
    	{
    		Integer[] ids = prmIdList.getValue(request);
    		if (ids == null)
    			return new ParameterMissingMessage(request);

    		Map<Integer, String> ontologies = this.computeOntologies(ids, getWikipedia(request));

    		return new OntologyReturnMessage(request, ontologies);
    	}

    	protected Map<Integer, String> computeOntologies(Integer[] ids, Wikipedia wikipedia)
    	{
    		Map<Integer, String> ontologies = new HashMap<Integer, String>(ids.length);

    		for (Integer id : ids)
    		{
    			String onto = null;
    			Page p = wikipedia.getPageById(id);
    			if (p.getType() == PageType.article)
    			{
    				Article art = (Article) p;
    				onto = art.getInfoBoxTitle();
    			}
    			else
    				onto = "";

    			ontologies.put(id, onto);
    		}

    		return ontologies;
    	}

    	public static class OntologyReturnMessage extends Service.Message
    	{
    		@Expose
    		@Attribute
    		private Map<Integer, String> ontologies;

    		private OntologyReturnMessage(HttpServletRequest request, Map<Integer, String> ontologies)
    		{
    			super(request);

    			this.ontologies = ontologies;
    		}
    	}

    }

Now we can re-deploy our service and we obtain the following results:

curl -XGET 'http://playground5:8080/wikipediaminer/services/ontology?responseFormat=JSON&ids=7089,7218'
{
  "ontologies": {
    "7089": "food",
    "7218": "food"
  },
  "service": "/services/ontology",
  "request": {
    "ids": "7089,7218",
    "responseFormat": "JSON"
  }
}

Creating a webservice with Wikipedia Miner’s is simple and allows to perform complex operations on this huge database.


Follow a region of an image frame by frame

My objective is to detect a moving object with an immobile camera, and then to track it with a moving camera. I will present how I made the second part.
My first algorithm detects the moving object, by drawing a square around it.

In this example, the object is a ball. It is quite straightforward to detect, but it could be something more complex like a cat. Therefore, I needed an algorithm that could handle any kind of object.

Once I’ve detected a moving object, I can go to the second step: tracking it.

Tracking the object

We detected an object in the frame N. We have to find it back in the next pictures: N + 1, N + 2 etc., as long as the object is in the field of view.

We cannot use the previous algorithm. The technique to detect the moving object is a video surveillance technique, that only works with a static camera.

It turns out that the camshift algorithm is the one we need. It is a modified meanshift algorithm, that can handle a scaling on the object. Therefore, it still works if the altitude of the camera changes. It can find where the region is in the next frame.

The process to use camshift is the following:

  • Initialize the tracking with the previously detected object in frame N
  • Call the tracker with the new frame. It will detect the object and update the position of the last detected zone. Do it again with each new frame.

We use the algorithm with the second frame, and display with an ellipse the result:

The ellipse should be a circle, but what matters is that the center of the shape is (almost) the center of the ball. A few parameters can be configured in the camshift algorithm, and therefore we can optimize it.

The algorithm works on real time:

Issues

    The result is convincing, however camshift has some drawbacks:

  • The algorithm uses the hue of the image, so it doesn’t work when the object to track is close to white or black.
  • It requires that the object did not move too much.

For example, in this image the object or the camera moved to much. The difference of position is to high, so the algorithm failed to detect it.

Some code

I use the OpenCV’s function cvCamShift, but it requires more code to work. Therefore, I decided used a wrapper. I found more or less the same code on a lot of different place, but I used Billy Lamberta’s C Wrapper. I used camshifting.[ch] to build my own program.

// returns a CvRect containing the object to track
// here the value is hardcoded, but we should put here the function
// that detects the moving object
static CvRect *get_object_rect(void)
{
  CvRect* object_rect =  malloc(sizeof (CvRect));
  *object_rect = cvRect(235, 40, 50, 50);
  return object_rect;
}

int main (void)
{
  const int nbImages = 6;
  const char *files[nbImages] = {"1.jpeg", "2.jpeg", "3.jpeg", "4.jpeg", "5.jpeg", "6.jpeg"};
  IplImage *in[nbImages];

  for (int i = 0; i < nbImages; ++i)
    in[i] = cvLoadImage(files[i], CV_LOAD_IMAGE_COLOR);

  cvNamedWindow("example", CV_WINDOW_AUTOSIZE);
  assert(in[0]);

  CvRect* object_rect = get_object_rect();
  /*  // Use this to check that the rectangle is correct
  cvRectangle(in[0],
              cvPoint(object_rect->x, object_rect->y),
              cvPoint(object_rect->x + object_rect->width, object_rect->y + object_rect->height),
              cvScalar(255, 0, 0, 1), 1, 8, 0);
  */
  cvShowImage("example", in[0]); // Display initial image
  cvWaitKey(0);

  TrackedObj* tracked_obj = create_tracked_object(in[0], object_rect);
  CvBox2D object_box; //area to draw around the object                                                                                                         

  IplImage* image;
  //object detection with camshift
  for (int i = 1; i < nbImages; ++i)
  {
    image = in[i];
    assert(image);
    //track the object in the new frame
    object_box = camshift_track_face(image, tracked_obj);

    //outline object ellipse
    cvEllipseBox(image, object_box, CV_RGB(0,0,255), 3, CV_AA, 0);
    cvShowImage("example", image);
    cvWaitKey(0);
  }

  //free memory
  destroy_tracked_object(tracked_obj);
  for (int i = 0; i < nbImages; ++i)
    cvReleaseImage(&in[i]);
  free(object_rect);

  return 0;
}

Estimate the area of lakes from Google Maps

I am presenting an algorithm to estimate the surface area of lakes from Google Maps images. The objective is to detect the lakes, and compute their area.

The input image

I am using the Map mode on Google Maps. My aim is to focus on the area computing algorithm, rather than preprocessing. That is why I use this mode rather than the satellite view, which is way more complicated to process. I even modified a bit the image with a graphic editing software to remove the names superimposed on the lakes.

The image we will use is a view of three lakes in Bolivia: Salar de Uyuni, Salar de Coipasa and Poopo Lake.

Lake segmentation

The first step is the lake segmentation: we want to detect where the lakes are.
Lakes in Google Maps are show with an almost uniform color. We can assume that a pixel belongs to a lake if it matches the following rules:

  • abs(Red_Component – 167) <= 5
  • abs(Green_Component – 190) <= 5
  • abs(Blue_Component – 221) <= 5

where abs is the absolute value.
We can build a thresholded image: if the pixel matches these rules, set it to 1, otherwise 0.
We obtain the following image:

Lake identification & measure

Next step is to identify each lake, i.e. getting information about each shape we have detected. We do a labeling process using connected components. Our objective is to extract the number of pixel for each lake.
The code looks like this (used to be C, but I added a few C++ structures):

std::vector<int> *label(IplImage *img)
{
  // We want to know if a pixel has been visited or not
  char **seen = (char **) malloc(img->width * sizeof (char));
  for (int i = 0; i < img->width; ++i)
    seen[i] = (char *) calloc(img->height, 1);

  std::vector<int> *comp_size_cnt = new std::vector<int>(); // nb of pixel per lake

  for (int y = 0; y < img->height; ++y)
    for (int x = 0; x < img->width; ++x)
    {
      if (seen[x][y]) // already marked point
        continue;
      if (ACCESS_PIXEL(img, x, y)) // lake
        comp_size_cnt->push_back(labeling_sub(seen, img, x, y));
    }

  for (int i = 0; i < img->width; ++i)
    free(seen[i]);
  free(seen);
  return comp_size_cnt;
}

static int labeling_sub(char **seen, IplImage *img, int x, int y)
{
  int count = 0; // number of pixels in the lake
  std::stack<std::pair<int, int> > stack; // pixels to visit
  stack.push(std::make_pair(x, y)); // first pixel to visit

  while (!stack.empty())
  {
    // getting the pixel to visit and pop it from the stack
    const std::pair<int, int> point = stack.top();
    stack.pop();
    x = point.first;
    y = point.second;
    if (seen[x][y]) // already processed pixel
      continue;
    seen[x][y] = 1; // mark it visited
    count += 1;

    // Visiting the neighbor (8 connexity, but this is badly done)
    for (int i = x - 1; i <= x + 1; ++i)
      for (int j = y - 1; j <= y + 1; ++j)
      {
	    if ((i >= 0) && (i < img->width) // bound checking
            && (j >= 0) && (j < img->height)
            && !seen[i][j] // pixel hasn't been visited yet
	        && ACCESS_PIXEL(img, i, j) // pixel belongs to the lake
           )
	    stack.push(std::make_pair(i, j)); // we want to visit (i, j)
      }
  }
  return count;
}

I also added a bit of code to these functions to display the result we get, which looks like this:

Area estimation

The final step is to calculate the lake area. The technique is quite straightforward: we know the total number of pixel for the entire image and for each lake. Therefore, we can determine the percentage of the map that is the lake. Using the map’s scale, we can also compute the total area of the map. Finally, we can combine the ratio and the total area to get the estimation of that of the lake’s.

I processed the scale information manually: I determined the scale’s width (in pixels), and read the distance in relation to this scale.

The code is the following:

void process_area(int width, int height, // image size
                  const std::vector<int> &size, // number of pixel per lake
                  int scale_pixel, int scale_km) // scale information
{
  float km_per_pixel = (float)scale_km / scale_pixel;
  float real_width = width * km_per_pixel;
  float real_height = height * km_per_pixel;
  float total_area = real_width * real_height;
  int total_pixel = width * height;

  for (int i = 0; i < size.size(); ++i) // for each lake...
  {
    float ratio = (float) size[i] / (float) (total_pixel);
    std::cout << "Lake " << i << ": " << ratio * total_area <<  std::endl;
  }
}

Results

I summarized the areas in this table:

Lake’s name Computed (km2) Real (km2) Error (%)
Salar de Coipasa 2300 2200 4.3
Poopo Lake 2567 2530 1.5
Salar de Uyuni 11121 10582 5.1

High level of accuracy is hard to obtain. The map is a small image that represents a very large region, so we can’t be more precise. The borders of the lake can also make a big difference, as shown in the following close-up. It is complicated to know where the lake ends and the shore starts.

The algorithm inspiration

The inspiration comes from the french Wikipedia article about the Monte-Carlo Method. It is a set of stochastic methods to solve deterministic problems.
As an illustration, they show a way to estimate a lake’s area using a cannon.
The idea is to shoot randomly X cannonballs in a square with a known area. N represents the number of cannonballs that ended up in the lake.


You can estimate the area using the following formula:

Area_{Lake} = \frac{(X - N)}{N} \times Area_{Ground}

In this article, we consider that we sent one cannonball in each pixel, so we remove the stochastic aspect.