Part of Speech Tagging with NLTK Part 2 – Regexp and Affix Taggers

Following up on Part of Speech Tagging with NLTK – Ngram Taggers, I test the accuracy of adding an Affix Tagger and a Regexp Tagger to the SequentialBackoffTagger chain.

NLTK Affix Tagger

The AffixTagger learns prefix and suffix patterns to determine the part of speech tag for word. I tried inserting the affix tagger into every possible position of the ubt_tagger to see which method increased accuracy the most. As you’ll see in the results, the aubt_tagger had the highest accuracy.

ubta_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger, nltk.tag.AffixTagger])
ubat_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.AffixTagger, nltk.tag.TrigramTagger])
uabt_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.AffixTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger])
aubt_tagger = backoff_tagger(train_sents, [nltk.tag.AffixTagger, nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger])

NLTK Regexp Tagger

The RegexpTagger allows you to define your own word patterns for determining the part of speech tag. Some of the patterns defined below were taken from chapter 3 of the NLTK book, others I added myself. Since I had already determined that the aubt_tagger was the most accurate, I only tested the regexp tagger at the beginning and end of the pos tagger chain.

word_patterns = [
	(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),
	(r'.*ould$', 'MD'),
	(r'.*ing$', 'VBG'),
	(r'.*ed$', 'VBD'),
	(r'.*ness$', 'NN'),
	(r'.*ment$', 'NN'),
	(r'.*ful$', 'JJ'),
	(r'.*ious$', 'JJ'),
	(r'.*ble$', 'JJ'),
	(r'.*ic$', 'JJ'),
	(r'.*ive$', 'JJ'),
	(r'.*ic$', 'JJ'),
	(r'.*est$', 'JJ'),
	(r'^a$', 'PREP'),
]

aubtr_tagger = nltk.tag.RegexpTagger(word_patterns, backoff=aubt_tagger)
raubt_tagger = backoff_tagger(train_sents, [nltk.tag.AffixTagger, nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger],
    backoff=nltk.tag.RegexpTagger(word_patterns))

NLTK Affix and Regexp Tagging Accuracy

Conclusion

As you can see, the aubt_tagger provided the most gain over the ubt_tagger, and the raubt_tagger had a slight gain on top of that. In Part of Speech Tagging with NLTK – Brill Tagger I discuss the results of using the BrillTagger to push the accuracy even higher.