Updated accuracies for POS taggers

Last Fall, I published on my personal blog a series of accuracy scores for the CLTK’s POS tagging models. It was recently pointed out to me by Dr. Giuseppe G. A. Celano, a member of Perseus’s treebank team, that the training set I used to make the taggers contained duplicates. This resulted in a skewing of accuracy scores, though not the taggers themselves.

The Perseus files, kept here for Greek and here for Latin, contain a single XML for each tagged work and one file, agdt-1.7.xml for Greek and ldt-1.5.xml for Latin, for all individual files brought together. In training and testing the taggers, I had combined all XML files in the directory, the works plus their compilation, while I should I have included only the compilation files.

I am happy to report that the CLTK taggers have been retrained and that their functionality is near-identical to what they were. This is because the duplicate fields did not give the model any new information about a particular word’s POS tag nor those of its neighbor.

The testing corpus I had used, however, was considerably skewed. Essentially, the model was being tested on matter it had already seen. With cross-validation run again (Greek and Latin notebooks), the CLTK’s POS taggers’ averages (mean over 10 iterations) are as follow.

Greek accuracy

Tagger	Accuracy
unigram	0.815227
bigram	0.255973
trigram	0.195761
1, 2, 3-gram backoff	0.818673
tnt	0.828682

Latin accuracy

Tagger	Accuracy
unigram	0.679296
bigram	0.102156
trigram	0.075027
1, 2, 3-gram backoff	0.686440
tnt	0.700846

The TnT algorithm returns the best results at 83% accuracy for Greek and 70% for Latin. These are not bad scores, it seems to me, however they are something of a letdown from scores in the high 90’s which I claimed earlier!

To get the latest taggers, update your greek_models_cltk and latin_models_cltk files (directions here).

I give sincere thanks to Giuseppe and also apologize to anyone who was mislead by my mistake.