Following up on Part of Speech Tagging with NLTK – Ngram Taggers, I test the accuracy of adding an Affix Tagger and a Regexp Tagger to theĀ SequentialBackoffTagger chain.
NLTK Affix Tagger
The AffixTagger learns prefix and suffix patterns to determine the part of speech tag for word. I tried inserting the affix tagger into every possible position of the ubt_tagger
to see which method increased accuracy the most. As you’ll see in the results, the aubt_tagger
had the highest accuracy.
ubta_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger, nltk.tag.AffixTagger]) ubat_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.AffixTagger, nltk.tag.TrigramTagger]) uabt_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.AffixTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger]) aubt_tagger = backoff_tagger(train_sents, [nltk.tag.AffixTagger, nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger])
NLTK Regexp Tagger
The RegexpTagger allows you to define your own word patterns for determining the part of speech tag. Some of the patterns defined below were taken from chapter 3 of the NLTK book, others I added myself. Since I had already determined that the aubt_tagger
was the most accurate, I only tested the regexp tagger at the beginning and end of the pos tagger chain.
word_patterns = [ (r'^-?[0-9]+(.[0-9]+)?$', 'CD'), (r'.*ould$', 'MD'), (r'.*ing$', 'VBG'), (r'.*ed$', 'VBD'), (r'.*ness$', 'NN'), (r'.*ment$', 'NN'), (r'.*ful$', 'JJ'), (r'.*ious$', 'JJ'), (r'.*ble$', 'JJ'), (r'.*ic$', 'JJ'), (r'.*ive$', 'JJ'), (r'.*ic$', 'JJ'), (r'.*est$', 'JJ'), (r'^a$', 'PREP'), ] aubtr_tagger = nltk.tag.RegexpTagger(word_patterns, backoff=aubt_tagger) raubt_tagger = backoff_tagger(train_sents, [nltk.tag.AffixTagger, nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger], backoff=nltk.tag.RegexpTagger(word_patterns))
NLTK Affix and Regexp Tagging Accuracy
Conclusion
As you can see, the aubt_tagger
provided the most gain over the ubt_tagger
, and the raubt_tagger
had a slight gain on top of that. In Part of Speech Tagging with NLTK – Brill Tagger I discuss the results of using the BrillTagger to push the accuracy even higher.