Blog

  • Presentation at Digital Classicist Berlin

    Three of us – Kyle Johnson, Clément Besnier, and Todd Cook – presented to the Digital Classicist Berlin program at the Berlin–Brandenburgischen Akademie der Wissenschaften.

  • CLTK v. 1.0 and ACL Publication

    Last month, the annual ACL conference published the CLTK’s de facto “white paper” (“The Classical Language Toolkit: An NLP Framework for Pre-Modern Languages”).

    Some time prior to this, we also officially promoted our version 1.0 to “production”.

  • Announcing 'alpha' release of CLTK v. 1.0

    The core maintainers are pleased to announce the first pre-release of the CLTK’s version 1.0. More information will follow, but at the highest level the guiding principles have been (a) to add a single pre-configured interface and (b) to fully rationalize software’s architecture for adding new languages with a minimum of friction.

    Preferably in a new virtual environment, in either Python 3.7 or 3.8, pull the latest “alpha” with:

    $ pip install –pre cltk

    The docs should be enough to begin stress-testing the new code:

    ...
  • On under-resourced languages and the CLTK

    The CLTK has as a central goal to provide complete NLP coverage of all pre-modern languages. In practice, this ambitious goal needs to be tempered by availability of language resources, digital and human. With some frequency, especially around the time of Google Summer of Code application, we are approached by potential contributors who hope to pitch in by adding NLP tools for a given language. Over the past six years, those of us centrally involved in the project have learned a great deal about what characteristics distinguish a successful from unsuccessful project. I’ll describe these characteristics in some detail below,...

  • Announcing Google Summer of Code Projects for 2018

    GSoC banner

    The Classical Language Toolkit is happy to present our Google Summer of Code projects for 2018:

    • Eleftheria Chatziargyriou (Aristotle University of Thessaloniki): Extending NLP functionality for Germanic Languages.
    • Andrew Deloucas (Harvard University): The Road to CDLI’s Corpora Integration into CLTK: An Undertaking
    • James Gawley (University of Buffalo): Expanding the CLTK with Synonyms, Translations and Word Embeddings

    As we see it, this year’s projects each have the potential to move the CLTK forward in its own way.

    Eleftheria’s project takes on one of the CLTK’s primary development paths, extending the core tools to...

  • CLTK announces GSoC 2017 students

    GSoC banner

    We’re happy to announce two exceptional students, Charles Pletcher and Natasha Voake, whom the CLTK will mentor for the 2017 Google Summer of Code. (See official GSoC page here.)

    By the end of the summer, if not before, we will offer updates on what Charles and Natasha accomplish.

    Charles Pletcher

    Proposal:

    I plan to build a flexible platform for teachers to annotate texts in the CLTK Archive. The purpose of the annotations is twofold: first, to guide the students in their (re)readings of the text; and second, to provide a ready-to-hand...

  • CLTK in Google Summer of Code 2017

    GSoC banner

    We are thrilled to announce that the CLTK will once again be participating in the Google Summer of Code. The four of us will be acting as mentors to the forthcoming students.

    See our organization’s official GSoC page for more information, but here is our call for projects:

    See our Project ideas page for a list of GSOC tasks that are suited to three months’ work for a beginning–to–intermediate programmer or language student. What follows is a high–level overview of these projects and a few tips in applying. Most work will...

  • Guest post: Descartes meet Python. Python meet Descartes.

    Note: The following is re-posted from Peter’s website, ithaca.

    This summer I’m working on a commentary on Descartes’ Meditationes de prima philosophia, usually known in English as his Meditations on First Philosophy. (Though actually a better translation is the less literal Metaphysical Meditations, which is how it’s usually translated into French.) In addition to providing a text and commentary, I plan to produce a vocabulary. This is a time-consuming and error-prone job, so naturally I want to offload at least some of the grunt work to a computer. (See laziness as virtue for programmers.)

    As a start,...

  • Analyzing Latin clausulae

    Latin prose authors since Cicero often employed rhythms at the end of their periods to produce sonorous effects like those found in poetry. These rhythms not only add to the elegance of an author’s style, but they even often help characterize a specific style. The prose rhythm preferences of some Roman authors are so distinct that they are often used in matters of textual emendation and author attribution, and so philologists for the last century have spent considerable time cataloguing the rhythm of prose with the hope that better textual judgements may be made with more data.

    The CLTK now...

  • CLTK participating in Google Summer of Code!

    GSoC banner

    We are thrilled to announce that the CLTK has been accepted to Google Summer of Code 2016. This is a tremendous opportunity for Classics students’ careers, an affirmation of the CLTK’s vision, and an chance for us, as mentors, to give back.

    GSoC funds students to work for three months on an approved open source project. These students may be at the undergraduate or graduate level, specializing in any academic field, and hailing from any accredited college or university, from any country. We do not yet know the number of students Google will fund...

  • New projects for Google Summer of Code

    Google Summer of Code is an program which gives students a stipend to do three months’ of work for an approved open source organization. The CLTK has submitted an application for inclusion into the ranks of approved projects. (See all sorts of info about past projects and participants here.)

    While we will not know whether we are accepted for another few days, these projects, on the project ideas page of the CLTK wiki, should serve as good ideas for anyone wishing to make a major, substantive contribution. These can also be found on the CLTK’s Issues...

  • Two recent CLTK lectures

    Note: The following is re-posted from Kyle’s personal website.

    I have given two lectures on the CLTK over the past few months and should post them before too much time as gone by.

    The first lecture was last November, when I gave a guest lecture, to an NYU graduate class of Peter Meineck, on an introduction to NLP and the CLTK. This was a lot of fun, as I had time to dialog with the class and explore some texts (some plays of Aeschylus) together. The Jupyter notebooks I prepared are on GitHub. Here’s the lecture’s slide deck,...

  • Tokenizing Latin text

    Note: The following is re-posted from Patrick’s blog, Disjecta Membra.

    One of the first tasks necessary in any text analysis projects is tokenization—we take our text as a whole and convert it to a list of smaller units, or tokens. When dealing with Latin—or at least digitized version of modern editions, like those found in the Perseus Digital Library, the Latin Library, etc.—paragraph- and sentence-level tokenization present little problem. Paragraphs are usually well marked and can be split by newlines (</n>). Sentences in modern Latin editions use the same punctuation set as English i.e.,...

  • Updated accuracies for POS taggers

    Last Fall, I published on my personal blog a series of accuracy scores for the CLTK’s POS tagging models. It was recently pointed out to me by Dr. Giuseppe G. A. Celano, a member of Perseus’s treebank team, that the training set I used to make the taggers contained duplicates. This resulted in a skewing of accuracy scores, though not the taggers themselves.

    The Perseus files, kept here for Greek and here for Latin, contain a single XML for each tagged work and one file, agdt-1.7.xml for Greek and ldt-1.5.xml for Latin, for...

  • New Latin prosody scanner

    After six months of development, Bradley Baker and I have completed a Latin prose scanner for the CLTK. The module accepts any macronized Latin text and returns to the user the resulting scansion.

    We designed it as a component for my forthcoming undergraduate thesis on the prose rhythms of Demosthenes and Cicero. Since my intention was to scan prose rhythms, the program ignores some features of poetic prosody, such as hiatus. Future versions may include extra features for scanning Latin poetry.

    The basic algorithm used in the module is rather straightforward. An input string’s words are first tokenized (via the...

  • New Greek and Latin lemmatizer

    I’m happy to announce that the CLTK has a new and much improved (faster and more accurate) lemmatizer for the Latin and, now too, Greek languages.

    With any lemmatizer, choices must be made about which headword to select in the case that a lemma is ambiguous, that is could belong to more than one headword. In the event of such ambiguity, the previous lemmatizer took whatever headword it found first. Now, the lemmatizer (or rathter the code which created the lemmatizer’s key-value data store) takes into consideration all possibilities and selects the headword which has the most frequent occurences in...

  • Welcome to the CLTK homepage!

    I am excited to have this little website to bring useful information to users of the CLTK. As the project continues to grow, I hope users can share tutorials, code snippets, etc..

    If you are interested in authoring a post, you can send your text (preferably in Markdown) to me by email (kyle@kyle-p-johnson.com) or, better yet, fork this site’s repository, add your post, and make a pull request. This can all be done in-browser on GitHub. While not necessary, to clone this site and run it locally, see directions for using Jekyll on GitHub pages.

    To author...

subscribe via RSS