Shameless plugin: We are a data annotation platform to make it super easy for you to build ML datasets. Just upload data, invite your team and build datasets super quick. Check us out.
Open NLP is a powerful java NLP library from Apache. It provides various tools for NLP one of which is Parts-Of-Speech (POS) tagger. Usually POS taggers are used to find out structure grammatical structure in text, you use a tagged dataset where each word (part of a phrase) is tagged with a label, you build an NLP model from this dataset and then for a new text you can use the model to generate tags for each word in the text.
Basically, the model learns the information and structure in the training data and can use that to label an unseen text. OpenNLP comes with a few pre-trained models like English models trained to structured english text for detecting nouns, verbs etc.
But if the text in your domain or use case doesn't follow the strict rules of English then the pre-trained model may not work well for you. In such cases, you can choose to build your own training data and train a custom model just for your use case.
We will show how we can use the POS tagger to learn entities in queries from e-commerce search (similar to NER).
Get the dataset used below here.
Training data is passed as a text file where each line is one data item. Each word in the line should be labeled in a format like "word_LABEL", the word and the label name is separated by an underscore '_'.
Here is a sample of the input training file:
anki_Brand overdrive_Brand just_ModelName dance_ModelName 2018_ModelName aoc_Brand 27"_ScreenSize monitor_Category horizon_ModelName zero_ModelName dawn_ModelName cm_Unknown 700_Unknown modem_Category computer_Category bt_Category transmitter_Category 120hz_Unknown led_Unknown tv_Category vizio_Brand 4k_Unknown tv_Category battlefront_ModelName 2_ModelName ps4_Unknown
Note: Each word needs a label/tag. Here, for words we do not care about we are using the label 'Unknown'.
Depending upon your domain, you can build such a dataset either automatically or manually. Building such a dataset manually can be really painful, tools like Dataturks POS tagger can help make the process much easier.
The important class here is POSModel
, which holds the actual model. We use class POSTaggerME
to do the model building. Below is the code to build a model from training data file
public POSModel train(String filepath) {
POSModel model = null;
TrainingParameters parameters = TrainingParameters.defaultParams();
parameters.put(TrainingParameters.ITERATIONS_PARAM, "100");
try {
try (InputStream dataIn = new FileInputStream(filepath)) {
ObjectStream<String> lineStream = new PlainTextByLineStream(new InputStreamFactory() {
@Override
public InputStream createInputStream() throws IOException {
return dataIn;
}
}, StandardCharsets.UTF_8);
ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);
model = POSTaggerME.train("en", sampleStream, parameters, new POSTaggerFactory());
return model;
}
}
catch (Exception e) {
e.printStackTrace();
}
return null;
}
We can output to model to a file and then later read it back from a file as well
public void writeToFile(POSModel model, String modelOutpath) {
try (OutputStream modelOut = new BufferedOutputStream(new FileOutputStream(modelOutpath))) {
model.serialize(modelOut);
}
catch (Exception e) {
e.printStackTrace();
}
}
public POSModel getModel(String modelPath) {
try {
try (InputStream modelIn = new FileInputStream(modelPath)) {
POSModel model = new POSModel(modelIn);
return model;
}
}
catch (Exception e) {
e.printStackTrace();
}
return model;
}
Finally, we can see how the model can be used to tag unseen queries:
public void doTagging(POSModel model, String input) {
input = input.trim();
POSTaggerME tagger = new POSTaggerME(model);
Sequence[] sequences = tagger.topKSequences(input.split(" "));
for (Sequence s : sequences) {
List<String> tags = s.getOutcomes();
System.out.println(Arrays.asList(input.split(" ")) +" =>" + tags);
}
}
Here is the sample output using our model
String[] tests = new String[] {"apple watch", "samsung mobile phones", " lcd 52 inch tv"};
for (String item : tests) {
doTagging(model, item);
}
Output
[apple, watch] =>[Brand, Category]
[apple, watch] =>[Category, Category]
[apple, watch] =>[ModelName, Category]
[samsung, mobile, phones] =>[Brand, Category, Category]
[samsung, mobile, phones] =>[Brand, ModelName, ModelName]
[samsung, mobile, phones] =>[Brand, ModelName, Category]
[lcd, 52, inch, tv] =>[ModelName, ModelName, ScreenSize, Category]
[lcd, 52, inch, tv] =>[Category, Unknown, ScreenSize, Category]
[lcd, 52, inch, tv] =>[Category, Unknown, Unknown, Category]
Shameless plugin: We are a data annotation platform to make it super easy for you to build ML datasets. Just upload data, invite your team and build datasets super quick. Check us out.