Technical Blog

Category : Scala

A NGram Generator in Scala

Using NGrams is very common and helpful in Natural Language Processing problems. Given the sentence: “This is a sentence”, the ngrams of order 2 are: “This is”, “is a” and “a sentence”. There are very useful, for example, to classify text.

In this article, I’ll show a class that generates this list.

Generators rather than lists

Given a text, the list of NGrams can be very big. For instance, if we have N tokens, we obtain:

  • N of size 1 (all the individual tokens)
  • N – 1 of size 2
  • N – 2 of size 3

Despite its linearity, the size can get very large. For instance, if we have 10 000 tokens and N in the range [1, 8] there are around 80 000 different ngrams.

Therefore, I use a generator rather than building the complete list of NGrams. This way, I save a lot of memory and can also save computation time if the caller does not iterator on the complete sequence.

The code

The NGramsGenerator class is the following:

package fr.gauth

import scala.collection.JavaConversions._
import scala.collection.mutable.{Queue => MutableQueue}
import java.util.StringTokenizer

object NGramsGenerator
{
  // each ngram is represented as a List of String
  def generate(text : String, minSize : Int, maxSize : Int) =
  {
    assert(maxSize >= minSize && minSize > 0 && maxSize > 0)

    // iterate for each token on the available ngrams
    for (ngramList <- this.generate_sub(text, minSize, maxSize);
         ngram <- ngramList)
      yield ngram.toList
  }

  // generate a list of ngrams
  def generate_sub(text : String, minSize : Int, maxSize : Int) =
  {
    val nGramsBuffer = new NGramsBuffer(minSize, maxSize)
    for (t <- this.getTokenizerIterator(text))
      yield
       {
         nGramsBuffer.addToken(t.asInstanceOf[String])
         nGramsBuffer.getNGrams
       }
  }

  // Can be overloaded to use an other tokenizer
  def getTokenizerIterator(text : String) =
    new StringTokenizer(text)

  // A utility class that stores a list of fixed size queues, each queue is a current ngram
  class NGramsBuffer(val minSize : Int, val maxSize : Int)
  {
    val queues = minSize.to(maxSize).foldLeft(List[SlidingFixedSizeQueue[String]]()) {
      (list, n) =>
        new SlidingFixedSizeQueue[String](n) :: list
    }

    def addToken(token : String) = queues.foreach(q => q.enqueue(token))

    // return only buffers that are full, otherwise not enough tokens have been
    // see yet to fill in the NGram
    def getNGrams() = queues.filter(q => q.size == q.maxSize)
  }

  // Utility class to store the last maxSize encountered tokens
  class SlidingFixedSizeQueue[A](val maxSize : Int) extends scala.collection.mutable.Queue[A]
  {
    override def enqueue(elems : A*) =
    {
      elems.foreach(super.enqueue(_))

      while (this.size > this.maxSize)
        this.dequeue()
    }
  }
}

Both the tokenizer and the NGram builder are generators.
The tokenizer can be different from an operation to an other. For example, we can decide to keep or remove the ponctuation, use an arithmetic operations tokenizer (so 1+3=4 is tokenized as “1″, “+”, “3″, “=”, “4″).
In this code, you can change it directly or subclass and override it.

It can be used this way:

package fr.gauth

object NGramDemo {
  def main(args: Array[String]): Unit =
  {
    NGramsGenerator.generate("here is my list of words", 1, 3)
      .foreach(ngram => println("Got NGram: " + ngram.mkString(" ")))
  }
}

It will output all the NGrams where N is in the range 1, 3:

Got NGram: here
Got NGram: here is
Got NGram: is
Got NGram: here is my
Got NGram: is my
Got NGram: my
Got NGram: is my list
Got NGram: my list
Got NGram: list
Got NGram: my list of
Got NGram: list of
Got NGram: of
Got NGram: list of words
Got NGram: of words
Got NGram: words

The code is available on GitHub here.