Part of Speech Tagging with NLTK Part 2 – Regexp and Affix Taggers
Following up on Part of Speech Tagging with NLTK - Ngram Taggers, I test the accuracy of adding an Affix Tagger and a Regexp Tagger to the SequentialBackoffTagger chain.
NLTK Affix Tagger
The AffixTagger learns prefix and suffix patterns to determine the part of speech tag for word. I tried inserting the affix tagger into every possible position of the ubt_tagger to see which method increased accuracy the most. As you'll see in the results, the aubt_tagger had the highest accuracy.
ubta_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger, nltk.tag.AffixTagger]) ubat_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.AffixTagger, nltk.tag.TrigramTagger]) uabt_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.AffixTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger]) aubt_tagger = backoff_tagger(train_sents, [nltk.tag.AffixTagger, nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger])
NLTK Regexp Tagger
The RegexpTagger allows you to define your own word patterns for determining the part of speech tag. Some of the patterns defined below were taken from chapter 3 of the NLTK book, others I added myself. Since I had already determined that the aubt_tagger was the most accurate, I only tested the regexp tagger at the beginning and end of the pos tagger chain.
word_patterns = [
(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),
(r'.*ould$', 'MD'),
(r'.*ing$', 'VBG'),
(r'.*ed$', 'VBD'),
(r'.*ness$', 'NN'),
(r'.*ment$', 'NN'),
(r'.*ful$', 'JJ'),
(r'.*ious$', 'JJ'),
(r'.*ble$', 'JJ'),
(r'.*ic$', 'JJ'),
(r'.*ive$', 'JJ'),
(r'.*ic$', 'JJ'),
(r'.*est$', 'JJ'),
(r'^a$', 'PREP'),
]
aubtr_tagger = nltk.tag.RegexpTagger(word_patterns, backoff=aubt_tagger)
raubt_tagger = backoff_tagger(train_sents, [nltk.tag.AffixTagger, nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger],
backoff=nltk.tag.RegexpTagger(word_patterns))
NLTK Affix and Regexp Tagging Accuracy
Conclusion
As you can see, the aubt_tagger provided the most gain over the ubt_tagger, and the raubt_tagger had a slight gain on top of that. In Part of Speech Tagging with NLTK - Brill Tagger I discuss the results of using the BrillTagger to push the accuracy even higher.





Pingback: Part of Speech Tagging with NLTK - Part 1 « Stream Hacker
Pingback: Learning to do natural language processing with NLTK | JetLlib Journal