Tokenizing Latin text

Note: The following is re-posted from Patrick’s blog, Disjecta Membra.

One of the first tasks necessary in any text analysis projects is tokenization—we take our text as a whole and convert it to a list of smaller units, or tokens. When dealing with Latin—or at least digitized version of modern editions, like those found in the Perseus Digital Library, the Latin Library, etc.—paragraph- and sentence-level tokenization present little problem. Paragraphs are usually well marked and can be split by newlines (</n>). Sentences in modern Latin editions use the same punctuation set as English i.e., ‘.’, ‘?’, and ‘!’), so most sentence-level tokenization can be done more or less successfully with the built-in tools found in the Natural Language Toolkit (NLTK), e.g. nltk.word_tokenize. But just as in English, Latin word tokenization offers small, specific issues that are not addressed by NLTK. The classic case in English is the negative contraction—how do we want to handle, for example, “didn’t”: [“didn’t”] or [“did”, “n’t”] or [“did”, “not”]?

There are four important cases in which Latin word tokenization demands special attention: the enclictics “-que”, “-ue/-ve”, and “-ne” and the postpositive use of “-cum” with the personal pronouns (e.g. nobiscum for *cum nobis). The Classical Language Toolkit now takes these cases into consideration when doing Latin word tokenization. Below is a brief how-to on using the CLTK to tokenize your Latin texts by word. [The tutorial assumes the following requirements: Python3, NLTK3, CLTK.]

Tokenizing Latin text with CLTK

We could simply use Python to split our texts into a list of tokens. (And sometimes this will be enough!) So …

>>> text = "Arma virumque cano, Troiae qui primus ab oris"

>>> text.split()

['Arma', 'virumque', 'cano,', 'Troiae', 'qui', 'primus', 'ab', 'oris']

A good start, but we’ve lost information, namely the comma between cano and Troiae. This might be ok, but let’s use NLTK’s tokenizer to hold on to the punctuation.

>>> import nltk

>>> nltk.word_tokenize(text)

['Arma', 'virumque', 'cano', ',', 'Troiae', 'qui', 'primus', 'ab', 'oris']

Using word_tokenize, we retain the punctuation. But otherwise we have more or less the same division of words.

But for someone working with Latin that second token is an issue. Do we really want virumque? Or are we looking for virum and the enclitic –que? In many cases, it will be there latter. Let’s use CLTK to handle this.

>>> from cltk.tokenize.word import WordTokenizer

>>> word_tokenizer = WordTokenizer('latin') >>> word_tokenizer.tokenize(text)

['Arma', 'virum', '-que', 'cano', ',', 'Troiae', 'qui', 'primus', 'ab', 'oris']

Using the CLTK WordTokenizer for Latin we retain the punctuation and split the special case more usefully.