Blog

Sep 27, 2025
Announcing CLTK 2.0: NLP for *all* pre‑modern languages

Using the same API and architecture patterns, I have rewritten almost every line of code. A small breakthrough I found is that generative LLM are able to provide the core NLP tasks that this project aims for: part-of-speech and dependency parsing. The potentially enormous breakthrough is even generalist models (e.g., ChatGPT, Llama) can perform this task for a multitude, if not the majority, of ancient, classical, and medieval languages.

Features
- Generative models support 105 languages! See here for all of them
- For backend, one may choose OpenAI/ChatGPT, Mistral, or any model that Ollama runs (e.g., Llama). I...
- Dec 14, 2021
  Presentation at Digital Classicist Berlin
  
  Three of us – Kyle Johnson, Clément Besnier, and Todd Cook – presented to the Digital Classicist Berlin program at the Berlin–Brandenburgischen Akademie der Wissenschaften.
  - Kyle’s slides on the CLTK, generally
  - Clément’s notebook on making a custom Process and examples using Old Norse
  - Todd’s slides on MLOps and large language models
- Sep 22, 2021
  CLTK v. 1.0 and ACL Publication
  
  Last month, the annual ACL conference published the CLTK’s de facto “white paper” (“The Classical Language Toolkit: An NLP Framework for Pre-Modern Languages”).
  
  Some time prior to this, we also officially promoted our version 1.0 to “production”.
- Jul 5, 2020
  Announcing 'alpha' release of CLTK v. 1.0
  
  The core maintainers are pleased to announce the first pre-release of the CLTK’s version 1.0. More information will follow, but at the highest level the guiding principles have been (a) to add a single pre-configured interface and (b) to fully rationalize software’s architecture for adding new languages with a minimum of friction.
  
  Preferably in a new virtual environment, in either Python 3.7 or 3.8, pull the latest “alpha” with:
  
  $ pip install –pre cltk
  
  The docs should be enough to begin stress-testing the new code:
  - Docs: https://alpha.cltk.org
  - Source: https://github.com/cltk/cltkv1 (Note: Not our main repo!)
  ...
- Dec 30, 2018
  On under-resourced languages and the CLTK
  
  The CLTK has as a central goal to provide complete NLP coverage of all pre-modern languages. In practice, this ambitious goal needs to be tempered by availability of language resources, digital and human. With some frequency, especially around the time of Google Summer of Code application, we are approached by potential contributors who hope to pitch in by adding NLP tools for a given language. Over the past six years, those of us centrally involved in the project have learned a great deal about what characteristics distinguish a successful from unsuccessful project. I’ll describe these characteristics in some detail below,...
- Apr 27, 2018
  Announcing Google Summer of Code Projects for 2018
  
  The Classical Language Toolkit is happy to present our Google Summer of Code projects for 2018:
  - Eleftheria Chatziargyriou (Aristotle University of Thessaloniki): Extending NLP functionality for Germanic Languages.
  - Andrew Deloucas (Harvard University): The Road to CDLI’s Corpora Integration into CLTK: An Undertaking
  - James Gawley (University of Buffalo): Expanding the CLTK with Synonyms, Translations and Word Embeddings
  As we see it, this year’s projects each have the potential to move the CLTK forward in its own way.
  
  Eleftheria’s project takes on one of the CLTK’s primary development paths, extending the core tools to...
- May 11, 2017
  CLTK announces GSoC 2017 students
  
  We’re happy to announce two exceptional students, Charles Pletcher and Natasha Voake, whom the CLTK will mentor for the 2017 Google Summer of Code. (See official GSoC page here.)
  
  By the end of the summer, if not before, we will offer updates on what Charles and Natasha accomplish.
  
  Charles Pletcher
  
  Proposal:
  
  I plan to build a flexible platform for teachers to annotate texts in the CLTK Archive. The purpose of the annotations is twofold: first, to guide the students in their (re)readings of the text; and second, to provide a ready-to-hand...
- Mar 1, 2017
  CLTK in Google Summer of Code 2017
  
  We are thrilled to announce that the CLTK will once again be participating in the Google Summer of Code. The four of us will be acting as mentors to the forthcoming students.
  
  See our organization’s official GSoC page for more information, but here is our call for projects:
  
  See our Project ideas page for a list of GSOC tasks that are suited to three months’ work for a beginning–to–intermediate programmer or language student. What follows is a high–level overview of these projects and a few tips in applying. Most work will...
- Jun 12, 2016
  Guest post: Descartes meet Python. Python meet Descartes.
  
  Note: The following is re-posted from Peter’s website, ithaca.
  
  This summer I’m working on a commentary on Descartes’ Meditationes de prima philosophia, usually known in English as his Meditations on First Philosophy. (Though actually a better translation is the less literal Metaphysical Meditations, which is how it’s usually translated into French.) In addition to providing a text and commentary, I plan to produce a vocabulary. This is a time-consuming and error-prone job, so naturally I want to offload at least some of the grunt work to a computer. (See laziness as virtue for programmers.)
  
  As a start,...
- Mar 27, 2016
  Analyzing Latin clausulae
  
  Latin prose authors since Cicero often employed rhythms at the end of their periods to produce sonorous effects like those found in poetry. These rhythms not only add to the elegance of an author’s style, but they even often help characterize a specific style. The prose rhythm preferences of some Roman authors are so distinct that they are often used in matters of textual emendation and author attribution, and so philologists for the last century have spent considerable time cataloguing the rhythm of prose with the hope that better textual judgements may be made with more data.
  
  The CLTK now...
- Feb 29, 2016
  CLTK participating in Google Summer of Code!
  
  We are thrilled to announce that the CLTK has been accepted to Google Summer of Code 2016. This is a tremendous opportunity for Classics students’ careers, an affirmation of the CLTK’s vision, and an chance for us, as mentors, to give back.
  
  GSoC funds students to work for three months on an approved open source project. These students may be at the undergraduate or graduate level, specializing in any academic field, and hailing from any accredited college or university, from any country. We do not yet know the number of students Google will fund...
- Feb 23, 2016
  New projects for Google Summer of Code
  
  Google Summer of Code is an program which gives students a stipend to do three months’ of work for an approved open source organization. The CLTK has submitted an application for inclusion into the ranks of approved projects. (See all sorts of info about past projects and participants here.)
  
  While we will not know whether we are accepted for another few days, these projects, on the project ideas page of the CLTK wiki, should serve as good ideas for anyone wishing to make a major, substantive contribution. These can also be found on the CLTK’s Issues...
- Feb 19, 2016
  Two recent CLTK lectures
  
  Note: The following is re-posted from Kyle’s personal website.
  
  I have given two lectures on the CLTK over the past few months and should post them before too much time as gone by.
  
  The first lecture was last November, when I gave a guest lecture, to an NYU graduate class of Peter Meineck, on an introduction to NLP and the CLTK. This was a lot of fun, as I had time to dialog with the class and explore some texts (some plays of Aeschylus) together. The Jupyter notebooks I prepared are on GitHub. Here’s the lecture’s slide deck,...
- Aug 2, 2015
  Tokenizing Latin text
  
  Note: The following is re-posted from Patrick’s blog, Disjecta Membra.
  
  One of the first tasks necessary in any text analysis projects is tokenization—we take our text as a whole and convert it to a list of smaller units, or tokens. When dealing with Latin—or at least digitized version of modern editions, like those found in the Perseus Digital Library, the Latin Library, etc.—paragraph- and sentence-level tokenization present little problem. Paragraphs are usually well marked and can be split by newlines (</n>). Sentences in modern Latin editions use the same punctuation set as English i.e.,...
- Aug 2, 2015
  Updated accuracies for POS taggers
  
  Last Fall, I published on my personal blog a series of accuracy scores for the CLTK’s POS tagging models. It was recently pointed out to me by Dr. Giuseppe G. A. Celano, a member of Perseus’s treebank team, that the training set I used to make the taggers contained duplicates. This resulted in a skewing of accuracy scores, though not the taggers themselves.
  
  The Perseus files, kept here for Greek and here for Latin, contain a single XML for each tagged work and one file, agdt-1.7.xml for Greek and ldt-1.5.xml for Latin, for...
- Jul 11, 2015
  New Latin prosody scanner
  
  After six months of development, Bradley Baker and I have completed a Latin prose scanner for the CLTK. The module accepts any macronized Latin text and returns to the user the resulting scansion.
  
  We designed it as a component for my forthcoming undergraduate thesis on the prose rhythms of Demosthenes and Cicero. Since my intention was to scan prose rhythms, the program ignores some features of poetic prosody, such as hiatus. Future versions may include extra features for scanning Latin poetry.
  
  The basic algorithm used in the module is rather straightforward. An input string’s words are first tokenized (via the...
- May 10, 2015
  New Greek and Latin lemmatizer
  
  I’m happy to announce that the CLTK has a new and much improved (faster and more accurate) lemmatizer for the Latin and, now too, Greek languages.
  
  With any lemmatizer, choices must be made about which headword to select in the case that a lemma is ambiguous, that is could belong to more than one headword. In the event of such ambiguity, the previous lemmatizer took whatever headword it found first. Now, the lemmatizer (or rathter the code which created the lemmatizer’s key-value data store) takes into consideration all possibilities and selects the headword which has the most frequent occurences in...
- Mar 28, 2015
  Welcome to the CLTK homepage!
  
  I am excited to have this little website to bring useful information to users of the CLTK. As the project continues to grow, I hope users can share tutorials, code snippets, etc..
  
  If you are interested in authoring a post, you can send your text (preferably in Markdown) to me by email (kyle@kyle-p-johnson.com) or, better yet, fork this site’s repository, add your post, and make a pull request. This can all be done in-browser on GitHub. While not necessary, to clone this site and run it locally, see directions for using Jekyll on GitHub pages.
  
  To author...
subscribe via RSS

Blog

Features

Charles Pletcher