Technical Blog

Category : ElasticSearch

Find closest subway station with ElasticSearch

This article aims to show a concrete example of spatial search in ElasticSearch. This feature allow to search records using their geographical coordinates: closest points, point within a circle or any polygon, etc. Shay Banon’s blog article is very helpful and helped me developing this concrete full example.

In this article, we will store in ElasticSearch all the subway stations of Paris and search for the closest ones to the Eiffel Tower (or any coordinate).

Getting the data

The Parisian transportation company made available for free the list of station’s coordinate (thanks to OpenData). You can retrieve the file here.

Creation of the database

We are going to store records with two fields:

  • Name, of type string
  • Location, of type Geo Point which stores latitude and longitude.

Setting location as a geo point will allow us to perform distance calculation operations on it. We create the type station on the geo_metro index using the following mapping:

curl -XPUT http://localhost:9200/geo_metro -d '
{
    "mappings": {
        "station": {
            "properties": {
                "name": {
                    "type": "string"
                },
                "location": {
                    "type": "geo_point"
                }
            }
        }
    }
}
'

Feeding the database

From the csv file, we are looking to parse it and insert it in the database. The best solution would be to make a script reading the file line by line and executing bulked insertion requests using an ElasticSearch client.

Here, to avoid installing clients I simply made a Python script (here) that generates a list of CURL requests that I save in a .sh file (here):

python create_requests.py stations.csv > insert.sh
bash insert.sh

Hopefully, the coordinates in the CSV file have already the format we need (degres only, with floating points). If it is not the case, you should look at the excellent Geographic Coordinate Conversion article from Wikipedia.

Searching the closest station

Now that we have our data stored in ElasticSearch with the correct mapping, we are able to execute searches. Our request will return the list of the five closest stations to a geographical point.
In our this example, I will use the Eiffel Tower coordinates that I found in its Wikipedia article.

The request is the following:

curl -XGET 'http://localhost:9200/geo_metro/station/_search?size=5&pretty=true' -d '
{
    "sort" : [
        {
            "_geo_distance" : {
                "location" : [48.8583, 2.2945],
                "order" : "asc",
                "unit" : "km"
            }
        }
    ]
    },
    "query" :
    {
        "match_all"
    }
'

It successfully returns:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 634,
    "max_score" : null,
    "hits" : [ {
      "_index" : "geo_metro",
      "_type" : "station",
      "_id" : "91jtmucvThaNq83Y2K4rww",
      "_score" : null, "_source" : {"name": "Champ de Mars-Tour Eiffel", "location": {"lat": "2.28948345865043", "lon": "48.855203725918"}},
      "sort" : [ 0.655364333719714 ]
    }, {
      "_index" : "geo_metro",
      "_type" : "station",
      "_id" : "H-Q9HFVcRqiqWVtk9OCvbQ",
      "_score" : null, "_source" : {"name": "I\u00e9na", "location": {"lat": "2.29379995911415", "lon": "48.8644728971468"}},
      "sort" : [ 0.6902478960716185 ]
    }, {
      "_index" : "geo_metro",
      "_type" : "station",
      "_id" : "PS0GCpC4TDmgjrhTRego4g",
      "_score" : null, "_source" : {"name": "Bir-Hakeim Grenelle", "location": {"lat": "2.28878285580131", "lon": "48.8543331583289"}},
      "sort" : [ 0.7735556213326537 ]
    }, {
      "_index" : "geo_metro",
      "_type" : "station",
      "_id" : "jd4ct8APS4WSWzRdvnsQbg",
      "_score" : null, "_source" : {"name": "Dupleix", "location": {"lat": "2.29276958714394", "lon": "48.8508056365633"}},
      "sort" : [ 0.8546099046905625 ]
    }, {
      "_index" : "geo_metro",
      "_type" : "station",
      "_id" : "Ol9SpVIcRkKYw2bj99Sb7g",
      "_score" : null, "_source" : {"name": "Pont de l alma", "location": {"lat": "2.30129356979222", "lon": "48.8624292670432"}},
      "sort" : [ 0.8838145038672007 ]
    } ]
  }
}

Note that the sort value is the distance between the Eiffel Tower and the subway station.

Conclusion

The objective of this article was to show how simple it is to perform requests based on geographical points, and how to use it. You can create way more complicated queries, changing match_all to a concrete query, using filters, facets, etc. with the scalability of ElasticSearch, which allows to store a huge amount of points!

Exact search with ElasticSearch

ElasticSearch is an extremely powerful distributed database that can perform a lot of complex queries. This article aims to show how to perform an exact lookup (like WHERE field_name=field_value in SQL).

Creation of the test database

We will use a very simple database, that stores the age of an user. Here is an example object:

{
  "name" : "username",
  "age"  : 25
}

First, we create the db:

curl -XPUT http://localhost:9200/user_age

Then, we put some data in it:

curl -XPOST http://localhost:9200/user_age/user/ -d '{
   "name" : "user1",
   "user_age"  : 13
}'

curl -XPOST http://localhost:9200/user_age/user/ -d '{
   "name" : "user 2",
   "age"  : 20
}'

curl -XPOST http://localhost:9200/user_age/user/ -d '{
   "name" : "user 3",
   "age"  : 13
}'

curl -XPOST http://localhost:9200/user_age/user/ -d '{
   "name" : "USER4",
   "age"  : 20
}'

Note that we use POST requests, so we don’t have to manually specify the index key.

Failing requests

The objective of this database is to retrieve the age of an user from its username. For example, the following request is working:

curl -XGET http://localhost:9200/user_age/_search?pretty=true -d '{
    "query" : {
        "term" : {
            "name" : "user1"
        }
    }
}'
# Returns:
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.30685282,
    "hits" : [ {
      "_index" : "user_age",
      "_type" : "user",
      "_id" : "0lz5QLxlSRWEO7_OdvJcXg",
      "_score" : 0.30685282, "_source" : {
         "name" : "user1",
         "user_age"  : 13
       }
    } ]
  }
}

However, if we test with user 2 it doesn’t return anything:

curl -XGET http://localhost:9200/user_age/_search?pretty=true -d '{
    "query" : {
        "term" : {
            "name" : "user 2"
        }
    }
}'
# Returns:
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

Searching for USER4 fails as well, returning nothing.
Using a request that shouldn’t match anything (user) returns some results:

curl -XGET http://localhost:9200/user_age/_search?pretty=true -d '{
    "query" : {
        "term" : {
            "name" : "user"
        }
    }
}'
# Returns:
{
  "took" : 29,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.37158427,
    "hits" : [ {
      "_index" : "user_age",
      "_type" : "user",
      "_id" : "PBnEBcfMRpmBBX1xZG9Czw",
      "_score" : 0.37158427, "_source" : {
   "name" : "user 2",
   "age"  : 20
}
    }, {
      "_index" : "user_age",
      "_type" : "user",
      "_id" : "-ok3LJepR1C3H7W_DuaArQ",
      "_score" : 0.37158427, "_source" : {
   "name" : "user 3",
   "age"  : 13
}
    } ]
  }
}

The issue comes from the fact that when we inserted our records in the database, it was analyzed by the default analyzer (named Standard Analyzer). It contains the following operations:

As we are doing a term query, the input is not analyzed, which explains for example that USER4 doesn’t match, but user4 does.

We can change it to a text query (match query if your ElasticSearch is at least 0.19.9) but it doesn’t perform an exact search, so it would continue to return wrong results for “user” query.


curl -XGET http://localhost:9200/user_age/_search?pretty=true -d '{
    "query" : {
        "text" : {
            "name" : "user"
        }
    }
}'
# Returns:
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.37158427,
    "hits" : [ {
      "_index" : "user_age",
      "_type" : "user",
      "_id" : "PBnEBcfMRpmBBX1xZG9Czw",
      "_score" : 0.37158427, "_source" : {
   "name" : "user 2",
   "age"  : 20
}
    }, {
      "_index" : "user_age",
      "_type" : "user",
      "_id" : "-ok3LJepR1C3H7W_DuaArQ",
      "_score" : 0.37158427, "_source" : {
   "name" : "user 3",
   "age"  : 13
}
    } ]
  }
}

Solution: use mappings

This default behavior is good for documents indexing and retrieval, but is not adapted to our problem. We need to change it, using a custom mapping.

ElasticSearch is a schema-free database, which makes it very flexible. However, mapping allows to perform more powerful operations.
They allow to set different things:

  • Server settings (number of shards, replicas, etc.)
  • Analyzers: declared custom analyzers and filters, combining and configuring existing ones, including: NGrams generation, stemming, etc.
  • Mappings: details the fields of the records and different options, including which analyzer to use (see Core Types)

In our case, what we want is to remove the default analyzer, to keep the usernames unanalyzed.

First thing to do is to clear our index:

curl -XDELETE http://localhost:9200/user_age

Then, we can declare our mapping:

curl -XPUT http://localhost:9200/user_age -d '
{
	"mappings": {
		"user": {
			"properties": {
				"name": {
					"index": "not_analyzed",
					"type": "string"
				},
				"age": {
					"type": "integer"
				}
		}
	}
}
}
'

Then, we can put back our records and re-run the request. Now, “User” returns nothing and all the exact names, including “User 2″, returns the exact match.

To do it, we removed analyzing at insertion time and request time:

  • “index”: “not_analyzed” makes that the search engine keeps “User 2″ and nothing else, not ["user", 2]
  • Using term queries makes that request is not run for “user” and 2.

Note: this example is very simple, and could have been done very easily using the username as the record’s ID (in this case, it’s actually not even needed to add the name field):

curl -XPUT http://localhost:9200/user_age/user/user%202 -d '{
   "name" : "User 2",
   "user_age"  : 13
}'

It could have been retrieved like this:

curl -XGET http://localhost:9200/user_age/user/user%202

{"_index":"user_age","_type":"user","_id":"user 2","_version":1,"exists":true, "_source" : {
   "name" : "User 2",
   "user_age"  : 13
  }
}

However, if you want to do it on a non-unique field, or on several fields, this solution is not applicable anymore.