Stanford CoreNLP: Training your own custom NER tagger.  

An end-to-end example in Java, of using your own dataset to train a custom NER tagger.


Shameless plugin: We are a data annotation platform to make it super easy for you to build ML datasets. Just upload data, invite your team and build datasets super quick. Check us out.

Stanford core NLP is by far the most battle-tested NLP library out there. In a way, it is the golden standard of NLP performance today. Among various other functionalities, named entity recognization (NER) is supported in the library, what this allows is to tag important entities in a piece of text like the name of a person, place etc.

Core NLP NER tagger implements CRF (conditional random field) algorithm which is one of the best ways to solve NER problem in NLP. The algorithm is trained on a tagged dataset and the output is a learned model.

Pre-trained models

Basically, the model learns the information and structure in the training data and can use that to label an unseen text. CoreNLP comes with a few pre-trained models like English models trained to structured English text for detecting names, places etc.

Trained your own models

But if the text in your domain or use case doesn't overlap domain for which the pre-trained models were built for then the pre-trained model may not work well for you. In such cases, you can choose to build your own training data and train a custom model just for your use case.

We will show how we can use the NER tagger to learn entities in queries from e-commerce search.

Get the dataset used below here.

Training data format

Training data is passed as a text file where each line is one word-label pair. Each word in the line should be labeled in a format like "word\tLABEL", the word and the label name is separated by a tab '\t'. For a text sentence, we should break it down into words and add one line for each word in the training file. To mark the start of the next line, we add an empty line in the training file.

Here is a sample of the input training file:

hp	Brand
spectre	ModelName
x360	ModelName

home	Category
theater	Category
system	0

horizon	ModelName
zero	ModelName
dawn	ModelName
ps4	0

hoverboard	Category

                  

Note: Each word needs a label/tag. Here, for words we do not care about we are using the label zero '0'.

Build training dataset

Depending upon your domain, you can build such a dataset either automatically or manually. Building such a dataset manually can be really painful, tools like Dataturks NER tagger can help make the process much easier.

Train model

The important class here is CRFClassifier, which holds the actual model. Below is the code to build a model from training data file and outputs the model in a file.


public void trainAndWrite(String modelOutPath, String prop, String trainingFilepath) {
   Properties props = StringUtils.propFileToProperties(prop);
   props.setProperty("serializeTo", modelOutPath);

   //if input use that, else use from properties file.
   if (trainingFilepath != null) {
       props.setProperty("trainFile", trainingFilepath);
   }

   SeqClassifierFlags flags = new SeqClassifierFlags(props);
   CRFClassifier<CoreLabel> crf = new CRFClassifier<>(flags);
   crf.train();

   crf.serializeClassifier(modelOutPath);
}
                  

Properties file

CoreNLP uses a properties file where we can define the parameters on how to build a custom model. For example, we can define how to build features to learn etc. Below is an example properties file:


# location of the training file
trainFile = ./standford_train.txt
# location where you would like to save (serialize) your
# classifier; adding .gz at the end automatically gzips the file,
# making it smaller, and faster to load
serializeTo = ner-model.ser.gz

# structure of your training file; this tells the classifier that
# the word is in column 0 and the correct answer is in column 1
map = word=0,answer=1

# This specifies the order of the CRF: order 1 means that features
# apply at most to a class pair of previous class and current class
# or current class and next class.
maxLeft=1

# these are the features we'd like to train with
# some are discussed below, the rest can be
# understood by looking at NERFeatureFactory
useClassFeature=true
useWord=true
# word character ngrams will be included up to length 6 as prefixes
# and suffixes only
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useDisjunctive=true
useSequences=true
usePrevSequences=true
# the last 4 properties deal with word shape features
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
#wordShape=chris2useLC
wordShape=none
#useBoundarySequences=true
#useNeighborNGrams=true
#useTaggySequences=true
#printFeatures=true
#saveFeatureIndexToDisk = true
#useObservedSequencesOnly = true
#useWordPairs = true

Read model from the file

Since we have saved the model to a file, we can now load that model (or distribute it for others to use):


public CRFClassifier getModel(String modelPath) {
    return CRFClassifier.getClassifierNoExceptions(modelPath);
}

Use model to do tagging.

Finally, we can see how the model can be used to tag unseen queries:


public void doTagging(CRFClassifier model, String input) {
  input = input.trim();
  System.out.println(input + "=>"  +  model.classifyToString(input));
}            

Here is the sample output using our model


String[] tests = new String[] {"apple watch", "samsung mobile phones", " lcd 52 inch tv"};
for (String item : tests) {
  doTagging(model, item);
}

Output


apple watch=>apple/Brand watch/Category
samsung mobile phones=>samsung/Brand mobile/Category phones/Category
lcd 52 inch tv=>lcd/ModelName 52/ModelName inch/0 tv/Category

Shameless plugin: We are a data annotation platform to make it super easy for you to build ML datasets. Just upload data, invite your team and build datasets super quick. Check us out.