Technical Blog

Create new Wikipedia Miner’s webservices

Wikipedia Miner is a great toolkit to perform different operations on Wikipedia, including search (retrieval of articles) and algorithms implementation (suggestion of pages, categorization). It can help a lot in problems of natural language processing and web mining. It provides a clear API for manipulating the data with good performances.

The toolkit can be used using a Java API or webservices. The second one has several advantages including:

  • Lose-coupled interface to access data & algorithms
  • The server centralizes the DB and its access, so only the server requires a good hardware configuration.
  • Easy access to the service from different programming languages / software from different computers with no effort

Wikipedia Miner comes with ten useful services (listed here), but creating new services is very useful to add new functionalities.

As a predicate to this article, you need to have a working installation of Wikipedia Miner (see WikiMiner’s install guide).

In this article we will create a very simple webservice that takes a list of Wikipedia’s articles IDs and return it’s ontology (if it exists, wether the article is a people, a company, etc.).

Creating an empty service

Creating a new service involves two main parts:

  • The service itself (Java code)
  • The Tomcat’s configuration
Basic service’s class

We need first to create a class that will handle the HTTP request. To do so, let’s create a OntologyService class inheriting from org.wikipedia.miner.service.WMService
that we will put in the package org.wikipedia.miner.service.
A class extending WMService needs at least to:

  • Call the super constructor with a description of the service
  • Define (override) the method buildWrappedResponse(HttpServletRequest)

We obtain the following code:

package org.wikipedia.miner.service;

import javax.servlet.http.HttpServletRequest;

import org.xjsf.Service;

public class OntologyService extends WMService {
	// Service description strings
	private static String groupName     = "core";
	private static String shortDescr    = "Returns the ontology of articles";
	private static String detailsMarkup = "<p>The service takes a list of article's ID and returns a map of ID=>Ontology. Ontology can be an empty string</p>";
	private static Boolean supportsDirectResponse = true;

	public OntologyService()
	{
		super(groupName, shortDescr, detailsMarkup, supportsDirectResponse);
	}

	@Override
	public Message buildWrappedResponse(HttpServletRequest request) throws Exception
        {
		// returns empty response
		return new Service.Message(request);
	}

}
Registering our service

Now that we have a very basic class handling requests, we need to ask Tomcat to redirect requests to it. To do so, we will edit the file web.xml located in the configs directory of the project.
We will configure two things:

  • The servlet (our service) with its name and it’s classpath
  • The mapping that will redirect HTTP requests to our class

We add to web.xml the following entries:

    <servlet>
      <servlet-name>ontology</servlet-name>
      <description></description>
      <servlet-class>
        org.wikipedia.miner.service.OntologyService
      </servlet-class>
      <load-on-startup>1</load-on-startup>
    </servlet>

    <servlet-mapping>
      <servlet-name>ontology</servlet-name>
      <url-pattern>/services/ontology</url-pattern>
    </servlet-mapping>

The mapping will redirect requests to services/ontology to our class. Note that this path is case sensitive.

Let’s test!

Now our service is ready to be used. However, we still need to deploy our new war jar so the modifications are taken into consideration.

ant deploy && sudo /etc/init.d/tomcat6 restart

You are now able to access http://127.0.0.1:8080/wikipediaminer/services/Ontology?responseFormat=JSON using a webbrowser or curl for instance. You should get the following response:

    {
      "service": "/services/ontology",
      "request": {
        "responseFormat": "JSON"
      }
    }

If it is not the case, some issues could refer to /etc/tomcat6/Catalina/localhost/wikipediaminer.xml. You can also delete /var/lib/tomcat6/webapps/wikipediaminer/ so you’re sure that when you deploy from ant it gets a fresh copy of the war file.

Parameters handling

We are going to upgrade our OntologyService class to handle the parameters of the request. In our case, a list of integers.
The parameters can be from different types listed here. You can take a look at the other Wikipedia Miner’s services for example of their usage.

To do so, we override the init(ServletConfig) method to create parameters objects and bind them using addGlobalParameter. Then, we can use it in the request handler to access the parameters. We obtain the following code:

    package org.wikipedia.miner.service;

    import javax.servlet.ServletConfig;
    import javax.servlet.ServletException;
    import javax.servlet.http.HttpServletRequest;

    import org.xjsf.Service;
    import org.xjsf.param.IntListParameter;

    public class OntologyService extends WMService {

    	// Service Description
    	private static String groupName     = "core";
    	private static String shortDescr    = "Returns the ontology of articles";
    	private static String detailsMarkup = "<p>The service takes a list of article's ID and returns a map of ID=>Ontology. Ontology can be an empty string</p>";
    	private static Boolean supportsDirectResponse = true;

    	// Parameters
    	IntListParameter prmIdList;

    	public OntologyService()
    	{
    		super(groupName, shortDescr, detailsMarkup, supportsDirectResponse);
    	}

    	@Override
    	public void init(ServletConfig config) throws ServletException
    	{
    		super.init(config);
    		prmIdList = new IntListParameter("ids", "List of page ids whom ontology must be found", null);
    		addGlobalParameter(prmIdList);
    	}

    	@Override
    	public Message buildWrappedResponse(HttpServletRequest request) throws Exception
    	{
    		Integer[] ids = prmIdList.getValue(request);
    		if (ids == null) // no ids field in the request
                return new ParameterMissingMessage(request);
    		// do things with ids and return it
    		return new Service.Message(request);
    	}
    }

Retrieving the ontology

Here is a created basic and incomplete function to retrieve the ontology from a Wikipedia article. We can add it to the Article class:


protected String infobox_ontology;
	/**
	 * Returns the ontology of the article, or an empty string
	 * @author glemoine
	 * @param article
	 * @return
	 */
	public String getInfoBoxTitle()
    {
		if (this.infobox_ontology == null)
			this.infobox_ontology = this.processInfoBoxTitle();

		return this.infobox_ontology;
    }

	protected String processInfoBoxTitle()
    {
        String content = this.getMarkup().substring(0, 800);
        String[] lines = content.split("\\n");

        for (String line : lines)
        {
        	line = line.replace("|", " ");
            String[] splitted = line.split(" ");
            if (splitted[0].endsWith("Infobox"))
                return splitted[splitted.length - 1].toLowerCase();
        }

        return ""; // no ontology found
    }

Returning a response

In this last step, we are looking to:

  • Make the calculation (calling the ontology method)
  • Return the values, creating our custom answer class
The result class

We currently return a default Service.Message object. We need to create a new class that extends it that maps what we want to return. In our case, only one field: a map associated an id (integer) to an ontology (string).
We add as an inner class to OntologyService the following:

    	public static class OntologyReturnMessage extends Service.Message
    	{
    		@Expose
    		@Attribute
    		private Map<Integer, String> ontologies;

    		private OntologyReturnMessage(HttpServletRequest request, Map<Integer, String> ontologies)
    		{
    			super(request);

    			this.ontologies = ontologies;
    		}
    	}
Putting things together

The final step is to compute the ontology method for each id, create the response object and feed it. We obtain the final code:

    package org.wikipedia.miner.service;

    import java.util.HashMap;
    import java.util.Map;

    import javax.servlet.ServletConfig;
    import javax.servlet.ServletException;
    import javax.servlet.http.HttpServletRequest;

    import org.simpleframework.xml.Attribute;
    import org.wikipedia.miner.model.Article;
    import org.wikipedia.miner.model.Page;
    import org.wikipedia.miner.model.Page.PageType;
    import org.wikipedia.miner.model.Wikipedia;
    import org.xjsf.Service;
    import org.xjsf.UtilityMessages.ParameterMissingMessage;
    import org.xjsf.param.IntListParameter;

    import com.google.gson.annotations.Expose;

    public class OntologyService extends WMService {

    	// Service Description
    	private static String groupName     = "core";
    	private static String shortDescr    = "Returns the ontology of articles";
    	private static String detailsMarkup = "<p>The service takes a list of article's ID and returns a map of ID=>Ontology. Ontology can be an empty string</p>";
    	private static Boolean supportsDirectResponse = true;

    	// Parameters
    	IntListParameter prmIdList;

    	public OntologyService()
    	{
    		super(groupName, shortDescr, detailsMarkup, supportsDirectResponse);
    	}

    	@Override
    	public void init(ServletConfig config) throws ServletException
    	{
    		super.init(config);
    		prmIdList = new IntListParameter("ids", "List of page ids whom ontology must be found", null);
    		addGlobalParameter(prmIdList);
    	}

    	@Override
    	public Service.Message buildWrappedResponse(HttpServletRequest request) throws Exception
    	{
    		Integer[] ids = prmIdList.getValue(request);
    		if (ids == null)
    			return new ParameterMissingMessage(request);

    		Map<Integer, String> ontologies = this.computeOntologies(ids, getWikipedia(request));

    		return new OntologyReturnMessage(request, ontologies);
    	}

    	protected Map<Integer, String> computeOntologies(Integer[] ids, Wikipedia wikipedia)
    	{
    		Map<Integer, String> ontologies = new HashMap<Integer, String>(ids.length);

    		for (Integer id : ids)
    		{
    			String onto = null;
    			Page p = wikipedia.getPageById(id);
    			if (p.getType() == PageType.article)
    			{
    				Article art = (Article) p;
    				onto = art.getInfoBoxTitle();
    			}
    			else
    				onto = "";

    			ontologies.put(id, onto);
    		}

    		return ontologies;
    	}

    	public static class OntologyReturnMessage extends Service.Message
    	{
    		@Expose
    		@Attribute
    		private Map<Integer, String> ontologies;

    		private OntologyReturnMessage(HttpServletRequest request, Map<Integer, String> ontologies)
    		{
    			super(request);

    			this.ontologies = ontologies;
    		}
    	}

    }

Now we can re-deploy our service and we obtain the following results:

curl -XGET 'http://playground5:8080/wikipediaminer/services/ontology?responseFormat=JSON&ids=7089,7218'
{
  "ontologies": {
    "7089": "food",
    "7218": "food"
  },
  "service": "/services/ontology",
  "request": {
    "ids": "7089,7218",
    "responseFormat": "JSON"
  }
}

Creating a webservice with Wikipedia Miner’s is simple and allows to perform complex operations on this huge database.

  • http://gauth.fr/2012/09/create-a-new-wikipedia-miners-application-with-its-java-api/ Create a new Wikipedia Miner’s application with its Java API : : Gauth

    [...] have seen in a previous article that Wikipedia Miner is a great framework to build algorithms on top of Wikipedia. It eases the use [...]