Last Fall, I published on my personal blog a series of accuracy scores for the CLTK’s POS tagging models. It was recently pointed out to me by Dr. Giuseppe G. A. Celano, a member of Perseus’s treebank team, that the training set I used to make the taggers contained duplicates. This resulted in a skewing of accuracy scores, though not the taggers themselves.
The Perseus files, kept here for Greek and here for Latin, contain a single XML for each tagged work and one file,
agdt-1.7.xml for Greek and
ldt-1.5.xml for Latin, for all individual files brought together. In training and testing the taggers, I had combined all XML files in the directory, the works plus their compilation, while I should I have included only the compilation files.
I am happy to report that the CLTK taggers have been retrained and that their functionality is near-identical to what they were. This is because the duplicate fields did not give the model any new information about a particular word’s POS tag nor those of its neighbor.
The testing corpus I had used, however, was considerably skewed. Essentially, the model was being tested on matter it had already seen. With cross-validation run again (Greek and Latin notebooks), the CLTK’s POS taggers’ averages (mean over 10 iterations) are as follow.
|1, 2, 3-gram backoff||0.818673|
|1, 2, 3-gram backoff||0.686440|
The TnT algorithm returns the best results at 83% accuracy for Greek and 70% for Latin. These are not bad scores, it seems to me, however they are something of a letdown from scores in the high 90’s which I claimed earlier!
To get the latest taggers, update your
latin_models_cltk files (directions here).
I give sincere thanks to Giuseppe and also apologize to anyone who was mislead by my mistake.