Technical Blog

Category : Wikipedia Miner

Introduction

Extracting keywords from texts and HTML pages is a common subject that opens doors to a lot of potential applications. These include classification (what is this page topic?), recommendation systems (identifying user likes to recommend the more accurate content), search engines (what is this page about?), document clustering (how can I pack different texts into a common group) and much more.

Most applications of these are usually based on only one language, usually english. However, it would be better to be able to process document in any language. For example, a case in a recommender system would be a user that speaks French and English. In his history, he gave positive ratings to a few pages containing the keyword “Airplane”. So, for next recommendations, we would boost this keyword. With a language-independent approach, we would also be able to boost pages containing “Avion”, the french term for airplane. If the user gave positive ratings to pages in English containing “Airplane”, and in French containing “Avion”, we would also be able to merge easily into the same keyword to build a language-independent user profile that will be used for accurate French and English recommendations.

This articles shows one way to achieve good results using an easy strategy. It is obvious that we can achieve better results using more complex algorithms.

How it works

This articles shows how we implemented this in the Wikipedia Miner. The Wikipedia Miner is an open source keyword extraction server that relies on Wikipedia: all returned keywords are titles from the online encyclopedia. The server gives very good results and you can find more about it in the Wikipedia Miner section of the blog.

Wikipedia Miner himself doesn’t know how to deal once for all languages: it requires to build one database per language. Once started, it can process requests with any of the built languages, ie, one instance can process queries in a lot of different languages (as long as you have enough memory).

The language-independence shown here is actually an “all into english” technique. The idea is that Wikipedia contains one very precious resource that you may not have noticed: interlanguage links. The box in the bottom left of the page is able to rely most pages between hundreds of different languages, which makes a great free translations database. We will modify the database building script to store for each non-English keyword the associated English page title, so that we return both. Using the english field, we lose the language-dependance.

Here is an example. I took a text about airplanes on a French website. I receive these keywords:

[{
    "title": "Airbus",
    "english": "Airbus",
    "weight": 0.714
}, {
    "title": "Avion",
    "english": "Airplane",
    "weight": 0.674
}, {
    "title": "Aérodynamique",
    "english": "Aerodynamics",
    "weight": 0.412
}, {
    "title": "Low cost",
    "english": "No frills",
    "weight": 0.161
}]

Implementation

There are two ways we can design the addition of the “english” field:

  1. Adding the associated english page’s title in database building
  2. Adding a translation layer (inside the server or at a higher level with a translation layer in the code that sends the request to the server).

Option 1 is the easier to implement, especially because it makes that the server is still doing all the business. Yet, option 2 can be interesting when the databases are already built and that you don’t want to rebuild it.

Building the database

The database builder already contains code to generate the translation database (translations.csv), but data is very incomplete, because it doesn’t use the interlanguage links database. To have it work well, we had to develop our own script that builds this file, respecting its structure.

To do so, we need to download the langlinks.sql file from Wikipedia Dumps. For example, for French (FR code), you can find it at this address: https://dumps.wikimedia.org/frwiki/latest/.

Once downloaded and unzipped, it is straightforward to parse and to output of this format:

 articleid,m{‘en,englishtitle}

For example:

 15,m{'en,'Austria}

We did make sure to output only translations into english, to reduce the database size.

You will also need to edit those two files to remove all reference to translation:

  • src/main/java/org/wikipedia/miner/extraction/LabelSensesStep.java
  • src/main/java/org/wikipedia/miner/extraction/DumpExtractor.java

On the Wikipedia Miner side

Once you’ve built the database using a custom translation.csv file, you only need to integrate the english translation into returned result. To do so, when building the result, use the Topic.getTranslation(“en”) method on returned topics (keywords).

Conclusion

We’ve seen here that we can easily modify the Wikipedia Miner to integrate in the results the associated english term. This makes very easy to extract semantic information from pages in different languages and handle those without language distinction, which ease dramatically internationalization of systems.

We have seen in a previous article that Wikipedia Miner is a great framework to build algorithms on top of Wikipedia. It eases the use of requests via webservices using Tomcat.

However, if you don’t want to use Tomcat, we’re going to see how you can create a standalone program that uses all the power of the framework.

We assume in the article that you already have a working Wikiminer setup (databases + configuration files).

Create a basic class

Our example will be very simple, so we don’t focus on implementing new functionalities, but rather focus on creating the program and compiling it.

Here, we will do a program that reads page’s id from command line and display their associated title.

The different steps are the following:

  • Create a WikipediaConfiguration from wikipedia.xml.
  • Create a Wikipedia object that loads the DBs from the WikipediaConfiguration.
  • Cann Wikipedia methods to compute the algorithms.
package fr.gauth.wikiminer;

import java.io.File;
import java.util.Scanner;

import org.wikipedia.miner.model.Page;
import org.wikipedia.miner.model.Wikipedia;
import org.wikipedia.miner.util.WikipediaConfiguration;

public class IdToTitlePrompt
{
	protected static WikipediaConfiguration getConfiguration(String args[])
	{
		if (args.length != 1)
		{
			System.out.println("Please specify path to wikipedia configuration file");
			System.exit(1);
		}

		File confFile = new File(args[0]);
		if (!confFile.canRead())
		{
			System.out.println("'" + args[0] + "' cannot be read");
			System.exit(1);
		}

		WikipediaConfiguration conf = null;
		try
		{
			conf = new WikipediaConfiguration(confFile);

			if (conf.getDataDirectory() == null
					|| !conf.getDataDirectory().isDirectory())
			{
				System.out.println("'" + args[0]
						+ "' does not specify a valid data directory");
				System.exit(1);
			}

		} catch (Exception e)
		{
			e.printStackTrace();
			System.exit(2);
		}
		return conf;
	}

	public static void main(String args[]) throws Exception
	{
		WikipediaConfiguration conf = getConfiguration(args);

        // if 2nd argument is set to true, the preparation of DB is threaded
        // which allows to run the code immediatly, rather than waiting
        // for the DB to be cached.
		Wikipedia wikipedia = new Wikipedia(conf, true);

		Scanner sc = new Scanner(System.in);
		while (sc.hasNextInt())
		{
			int id = sc.nextInt();
			Page page = wikipedia.getPageById(id);
			System.out.println(page.getTitle());
		}
        wikipedia.close();
	}
}

Compile and execute it

To compile and run it, we will update the original Ant’s build.xml, so it can create a standalone executable jar. To do so, we follow these steps:

  • Add a new custom target, that I called assembly (copy of package)
  • Join the dependencies (/lib) to the jar
  • Set the main class, so the jar is self-runnable

We obtain the following entry:

    <target name="assembly" depends="build" description="creates an executable jar with its dependencies">
		<echo>Creating the runnable standalone jar file</echo>
    	<mkdir dir="${build.dir}/jar"/>

    	<jar destfile="${build.dir}/jar/${jar.mainModule}" >
    		<fileset dir="${build.dir}/classes"/>
			<zipgroupfileset dir="lib" includes="*.jar"/>
			<manifest>
		  		<attribute name="Main-Class" value="fr.gauth.wikiminer.IdToTitlePrompt"/>
			</manifest>
    	</jar>
    </target>

Finally, we can run it using the following command:

ant assembly && java -jar  ../build/jar/wikipedia-miner.jar  ../configs/en.xml

In the shell, we can type 9232 and it will successfully display “Eiffel Tower”.

Create new Wikipedia Miner’s webservices

Wikipedia Miner is a great toolkit to perform different operations on Wikipedia, including search (retrieval of articles) and algorithms implementation (suggestion of pages, categorization). It can help a lot in problems of natural language processing and web mining. It provides a clear API for manipulating the data with good performances.

The toolkit can be used using a Java API or webservices. The second one has several advantages including:

  • Lose-coupled interface to access data & algorithms
  • The server centralizes the DB and its access, so only the server requires a good hardware configuration.
  • Easy access to the service from different programming languages / software from different computers with no effort

Wikipedia Miner comes with ten useful services (listed here), but creating new services is very useful to add new functionalities.

As a predicate to this article, you need to have a working installation of Wikipedia Miner (see WikiMiner’s install guide).

In this article we will create a very simple webservice that takes a list of Wikipedia’s articles IDs and return it’s ontology (if it exists, wether the article is a people, a company, etc.).

Creating an empty service

Creating a new service involves two main parts:

  • The service itself (Java code)
  • The Tomcat’s configuration
Basic service’s class

We need first to create a class that will handle the HTTP request. To do so, let’s create a OntologyService class inheriting from org.wikipedia.miner.service.WMService
that we will put in the package org.wikipedia.miner.service.
A class extending WMService needs at least to:

  • Call the super constructor with a description of the service
  • Define (override) the method buildWrappedResponse(HttpServletRequest)

We obtain the following code:

package org.wikipedia.miner.service;

import javax.servlet.http.HttpServletRequest;

import org.xjsf.Service;

public class OntologyService extends WMService {
	// Service description strings
	private static String groupName     = "core";
	private static String shortDescr    = "Returns the ontology of articles";
	private static String detailsMarkup = "<p>The service takes a list of article's ID and returns a map of ID=>Ontology. Ontology can be an empty string</p>";
	private static Boolean supportsDirectResponse = true;

	public OntologyService()
	{
		super(groupName, shortDescr, detailsMarkup, supportsDirectResponse);
	}

	@Override
	public Message buildWrappedResponse(HttpServletRequest request) throws Exception
        {
		// returns empty response
		return new Service.Message(request);
	}

}
Registering our service

Now that we have a very basic class handling requests, we need to ask Tomcat to redirect requests to it. To do so, we will edit the file web.xml located in the configs directory of the project.
We will configure two things:

  • The servlet (our service) with its name and it’s classpath
  • The mapping that will redirect HTTP requests to our class

We add to web.xml the following entries:

    <servlet>
      <servlet-name>ontology</servlet-name>
      <description></description>
      <servlet-class>
        org.wikipedia.miner.service.OntologyService
      </servlet-class>
      <load-on-startup>1</load-on-startup>
    </servlet>

    <servlet-mapping>
      <servlet-name>ontology</servlet-name>
      <url-pattern>/services/ontology</url-pattern>
    </servlet-mapping>

The mapping will redirect requests to services/ontology to our class. Note that this path is case sensitive.

Let’s test!

Now our service is ready to be used. However, we still need to deploy our new war jar so the modifications are taken into consideration.

ant deploy && sudo /etc/init.d/tomcat6 restart

You are now able to access http://127.0.0.1:8080/wikipediaminer/services/Ontology?responseFormat=JSON using a webbrowser or curl for instance. You should get the following response:

    {
      "service": "/services/ontology",
      "request": {
        "responseFormat": "JSON"
      }
    }

If it is not the case, some issues could refer to /etc/tomcat6/Catalina/localhost/wikipediaminer.xml. You can also delete /var/lib/tomcat6/webapps/wikipediaminer/ so you’re sure that when you deploy from ant it gets a fresh copy of the war file.

Parameters handling

We are going to upgrade our OntologyService class to handle the parameters of the request. In our case, a list of integers.
The parameters can be from different types listed here. You can take a look at the other Wikipedia Miner’s services for example of their usage.

To do so, we override the init(ServletConfig) method to create parameters objects and bind them using addGlobalParameter. Then, we can use it in the request handler to access the parameters. We obtain the following code:

    package org.wikipedia.miner.service;

    import javax.servlet.ServletConfig;
    import javax.servlet.ServletException;
    import javax.servlet.http.HttpServletRequest;

    import org.xjsf.Service;
    import org.xjsf.param.IntListParameter;

    public class OntologyService extends WMService {

    	// Service Description
    	private static String groupName     = "core";
    	private static String shortDescr    = "Returns the ontology of articles";
    	private static String detailsMarkup = "<p>The service takes a list of article's ID and returns a map of ID=>Ontology. Ontology can be an empty string</p>";
    	private static Boolean supportsDirectResponse = true;

    	// Parameters
    	IntListParameter prmIdList;

    	public OntologyService()
    	{
    		super(groupName, shortDescr, detailsMarkup, supportsDirectResponse);
    	}

    	@Override
    	public void init(ServletConfig config) throws ServletException
    	{
    		super.init(config);
    		prmIdList = new IntListParameter("ids", "List of page ids whom ontology must be found", null);
    		addGlobalParameter(prmIdList);
    	}

    	@Override
    	public Message buildWrappedResponse(HttpServletRequest request) throws Exception
    	{
    		Integer[] ids = prmIdList.getValue(request);
    		if (ids == null) // no ids field in the request
                return new ParameterMissingMessage(request);
    		// do things with ids and return it
    		return new Service.Message(request);
    	}
    }

Retrieving the ontology

Here is a created basic and incomplete function to retrieve the ontology from a Wikipedia article. We can add it to the Article class:


protected String infobox_ontology;
	/**
	 * Returns the ontology of the article, or an empty string
	 * @author glemoine
	 * @param article
	 * @return
	 */
	public String getInfoBoxTitle()
    {
		if (this.infobox_ontology == null)
			this.infobox_ontology = this.processInfoBoxTitle();

		return this.infobox_ontology;
    }

	protected String processInfoBoxTitle()
    {
        String content = this.getMarkup().substring(0, 800);
        String[] lines = content.split("\\n");

        for (String line : lines)
        {
        	line = line.replace("|", " ");
            String[] splitted = line.split(" ");
            if (splitted[0].endsWith("Infobox"))
                return splitted[splitted.length - 1].toLowerCase();
        }

        return ""; // no ontology found
    }

Returning a response

In this last step, we are looking to:

  • Make the calculation (calling the ontology method)
  • Return the values, creating our custom answer class
The result class

We currently return a default Service.Message object. We need to create a new class that extends it that maps what we want to return. In our case, only one field: a map associated an id (integer) to an ontology (string).
We add as an inner class to OntologyService the following:

    	public static class OntologyReturnMessage extends Service.Message
    	{
    		@Expose
    		@Attribute
    		private Map<Integer, String> ontologies;

    		private OntologyReturnMessage(HttpServletRequest request, Map<Integer, String> ontologies)
    		{
    			super(request);

    			this.ontologies = ontologies;
    		}
    	}
Putting things together

The final step is to compute the ontology method for each id, create the response object and feed it. We obtain the final code:

    package org.wikipedia.miner.service;

    import java.util.HashMap;
    import java.util.Map;

    import javax.servlet.ServletConfig;
    import javax.servlet.ServletException;
    import javax.servlet.http.HttpServletRequest;

    import org.simpleframework.xml.Attribute;
    import org.wikipedia.miner.model.Article;
    import org.wikipedia.miner.model.Page;
    import org.wikipedia.miner.model.Page.PageType;
    import org.wikipedia.miner.model.Wikipedia;
    import org.xjsf.Service;
    import org.xjsf.UtilityMessages.ParameterMissingMessage;
    import org.xjsf.param.IntListParameter;

    import com.google.gson.annotations.Expose;

    public class OntologyService extends WMService {

    	// Service Description
    	private static String groupName     = "core";
    	private static String shortDescr    = "Returns the ontology of articles";
    	private static String detailsMarkup = "<p>The service takes a list of article's ID and returns a map of ID=>Ontology. Ontology can be an empty string</p>";
    	private static Boolean supportsDirectResponse = true;

    	// Parameters
    	IntListParameter prmIdList;

    	public OntologyService()
    	{
    		super(groupName, shortDescr, detailsMarkup, supportsDirectResponse);
    	}

    	@Override
    	public void init(ServletConfig config) throws ServletException
    	{
    		super.init(config);
    		prmIdList = new IntListParameter("ids", "List of page ids whom ontology must be found", null);
    		addGlobalParameter(prmIdList);
    	}

    	@Override
    	public Service.Message buildWrappedResponse(HttpServletRequest request) throws Exception
    	{
    		Integer[] ids = prmIdList.getValue(request);
    		if (ids == null)
    			return new ParameterMissingMessage(request);

    		Map<Integer, String> ontologies = this.computeOntologies(ids, getWikipedia(request));

    		return new OntologyReturnMessage(request, ontologies);
    	}

    	protected Map<Integer, String> computeOntologies(Integer[] ids, Wikipedia wikipedia)
    	{
    		Map<Integer, String> ontologies = new HashMap<Integer, String>(ids.length);

    		for (Integer id : ids)
    		{
    			String onto = null;
    			Page p = wikipedia.getPageById(id);
    			if (p.getType() == PageType.article)
    			{
    				Article art = (Article) p;
    				onto = art.getInfoBoxTitle();
    			}
    			else
    				onto = "";

    			ontologies.put(id, onto);
    		}

    		return ontologies;
    	}

    	public static class OntologyReturnMessage extends Service.Message
    	{
    		@Expose
    		@Attribute
    		private Map<Integer, String> ontologies;

    		private OntologyReturnMessage(HttpServletRequest request, Map<Integer, String> ontologies)
    		{
    			super(request);

    			this.ontologies = ontologies;
    		}
    	}

    }

Now we can re-deploy our service and we obtain the following results:

curl -XGET 'http://playground5:8080/wikipediaminer/services/ontology?responseFormat=JSON&ids=7089,7218'
{
  "ontologies": {
    "7089": "food",
    "7218": "food"
  },
  "service": "/services/ontology",
  "request": {
    "ids": "7089,7218",
    "responseFormat": "JSON"
  }
}

Creating a webservice with Wikipedia Miner’s is simple and allows to perform complex operations on this huge database.