Part of Speech Tagging with NLTK – Part 2
Following up on Part of Speech Tagging with NLTK – Part 1, I test the accuracy of adding an AffixTagger and a RegexpTagger to my SequentialBackoffTagger chain.
Affix Tagging
The AffixTagger learns prefix and suffix patterns to determine the part of speech tag for word. I tried inserting the AffixTagger into every possible position of the ubt_tagger to see which method increased accuracy the most. As you’ll see in the results, the aubt_tagger had the highest accuracy.
ubta_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger, nltk.tag.AffixTagger]) ubat_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.AffixTagger, nltk.tag.TrigramTagger]) uabt_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.AffixTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger]) aubt_tagger = backoff_tagger(train_sents, [nltk.tag.AffixTagger, nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger])
Regexp Tagging
The RegexpTagger allows you to define your own word patterns for determining the part of speech tag. Some of the patterns defined below were taken from chapter 3 of the NLTK book, others I added myself. Since I had already determined that the aubt_tagger was the most accurate, I only tested the RegexpTagger at the beginning and end of the chain.
word_patterns = [
(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),
(r'.*ould$', 'MD'),
(r'.*ing$', 'VBG'),
(r'.*ed$', 'VBD'),
(r'.*ness$', 'NN'),
(r'.*ment$', 'NN'),
(r'.*ful$', 'JJ'),
(r'.*ious$', 'JJ'),
(r'.*ble$', 'JJ'),
(r'.*ic$', 'JJ'),
(r'.*ive$', 'JJ'),
(r'.*ic$', 'JJ'),
(r'.*est$', 'JJ'),
(r'^a$', 'PREP'),
]
aubtr_tagger = nltk.tag.RegexpTagger(word_patterns, backoff=aubt_tagger)
raubt_tagger = backoff_tagger(train_sents, [nltk.tag.AffixTagger, nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger],
backoff=nltk.tag.RegexpTagger(word_patterns))
Affix and Regexp Tagging Accuracy
Conclusion
As you can see, the aubt_tagger provided the most gain over the ubt_tagger, and the raubt_tagger had a slight gain on top of that. In Part 3 I’ll discuss the results of using the BrillTagger to push the accuracy even higher.
Related Reading:










