OpenNLP: Training your own custom POS tagger.  

An end-to-end example in Java, of using your own dataset to train a custom POS tagger.


Shameless plugin: We are a data annotation platform to make it super easy for you to build ML datasets. Just upload data, invite your team and build datasets super quick. Check us out.

Open NLP is a powerful java NLP library from Apache. It provides various tools for NLP one of which is Parts-Of-Speech (POS) tagger. Usually POS taggers are used to find out structure grammatical structure in text, you use a tagged dataset where each word (part of a phrase) is tagged with a label, you build an NLP model from this dataset and then for a new text you can use the model to generate tags for each word in the text.

Pre-trained models

Basically, the model learns the information and structure in the training data and can use that to label an unseen text. OpenNLP comes with a few pre-trained models like English models trained to structured english text for detecting nouns, verbs etc.

Trained your own models

But if the text in your domain or use case doesn't follow the strict rules of English then the pre-trained model may not work well for you. In such cases, you can choose to build your own training data and train a custom model just for your use case.

We will show how we can use the POS tagger to learn entities in queries from e-commerce search (similar to NER).

Get the dataset used below here.

Training data format

Training data is passed as a text file where each line is one data item. Each word in the line should be labeled in a format like "word_LABEL", the word and the label name is separated by an underscore '_'.

Here is a sample of the input training file:

anki_Brand overdrive_Brand
just_ModelName dance_ModelName 2018_ModelName
aoc_Brand 27"_ScreenSize monitor_Category
horizon_ModelName zero_ModelName dawn_ModelName
cm_Unknown 700_Unknown modem_Category
computer_Category
bt_Category transmitter_Category
120hz_Unknown led_Unknown tv_Category
vizio_Brand 4k_Unknown tv_Category
battlefront_ModelName 2_ModelName ps4_Unknown
                  

Note: Each word needs a label/tag. Here, for words we do not care about we are using the label 'Unknown'.

Build training dataset

Depending upon your domain, you can build such a dataset either automatically or manually. Building such a dataset manually can be really painful, tools like Dataturks POS tagger can help make the process much easier.

Train model

The important class here is POSModel, which holds the actual model. We use class POSTaggerME to do the model building. Below is the code to build a model from training data file


public POSModel train(String filepath) {
  POSModel model = null;
  TrainingParameters parameters = TrainingParameters.defaultParams();
  parameters.put(TrainingParameters.ITERATIONS_PARAM, "100");
  
  try {
    try (InputStream dataIn = new FileInputStream(filepath)) {
        ObjectStream<String> lineStream = new PlainTextByLineStream(new InputStreamFactory() {
            @Override
            public InputStream createInputStream() throws IOException {
                return dataIn;
            }
        }, StandardCharsets.UTF_8);
        ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);
  
        model = POSTaggerME.train("en", sampleStream, parameters, new POSTaggerFactory());
        return model;
    }
  }
  catch (Exception e) {
    e.printStackTrace();
  }
  return null;

}
                  

Save model to file

We can output to model to a file and then later read it back from a file as well


public void writeToFile(POSModel model, String modelOutpath) {
    try (OutputStream modelOut = new BufferedOutputStream(new FileOutputStream(modelOutpath))) {
        model.serialize(modelOut);
    }
    catch (Exception e) {
        e.printStackTrace();
    }
}

public POSModel getModel(String modelPath) {
    try {
        try (InputStream modelIn = new FileInputStream(modelPath)) {
            POSModel model = new POSModel(modelIn);
            return model;
        }
    }
    catch (Exception e) {
        e.printStackTrace();
    }
    return model;
}

Use model to do tagging.

Finally, we can see how the model can be used to tag unseen queries:


public void doTagging(POSModel model, String input) {
    input = input.trim();
    POSTaggerME tagger = new POSTaggerME(model);
    Sequence[] sequences = tagger.topKSequences(input.split(" "));
    for (Sequence s : sequences) {
        List<String> tags = s.getOutcomes();
        System.out.println(Arrays.asList(input.split(" ")) +" =>" + tags);
    }
}
                  

Here is the sample output using our model


String[] tests = new String[] {"apple watch", "samsung mobile phones", " lcd 52 inch tv"};
for (String item : tests) {
  doTagging(model, item);
}

Output


[apple, watch] =>[Brand, Category]
[apple, watch] =>[Category, Category]
[apple, watch] =>[ModelName, Category]
[samsung, mobile, phones] =>[Brand, Category, Category]
[samsung, mobile, phones] =>[Brand, ModelName, ModelName]
[samsung, mobile, phones] =>[Brand, ModelName, Category]
[lcd, 52, inch, tv] =>[ModelName, ModelName, ScreenSize, Category]
[lcd, 52, inch, tv] =>[Category, Unknown, ScreenSize, Category]
[lcd, 52, inch, tv] =>[Category, Unknown, Unknown, Category]

Shameless plugin: We are a data annotation platform to make it super easy for you to build ML datasets. Just upload data, invite your team and build datasets super quick. Check us out.