Blog

  • CLTK in Google Summer of Code 2017

    GSoC banner

    We are thrilled to announce that the CLTK will once again be participating in the Google Summer of Code. The four of us will be acting as mentors to the forthcoming students.

    See our organization’s official GSoC page for more information, but here is our call for projects:

    See our Project ideas page for a list of GSOC tasks that are suited to three months’ work for a beginning–to–intermediate programmer or language student. What follows is a high–level overview of these projects and a few tips in applying. Most work will...

  • Guest post: Descartes meet Python. Python meet Descartes.

    Note: The following is re-posted from Peter’s website, ithaca.

    This summer I’m working on a commentary on Descartes’ Meditationes de prima philosophia, usually known in English as his Meditations on First Philosophy. (Though actually a better translation is the less literal Metaphysical Meditations, which is how it’s usually translated into French.) In addition to providing a text and commentary, I plan to produce a vocabulary. This is a time-consuming and error-prone job, so naturally I want to offload at least some of the grunt work to a computer. (See laziness as virtue for programmers.)

    As a start,...

  • Analyzing Latin clausulae

    Latin prose authors since Cicero often employed rhythms at the end of their periods to produce sonorous effects like those found in poetry. These rhythms not only add to the elegance of an author’s style, but they even often help characterize a specific style. The prose rhythm preferences of some Roman authors are so distinct that they are often used in matters of textual emendation and author attribution, and so philologists for the last century have spent considerable time cataloguing the rhythm of prose with the hope that better textual judgements may be made with more data.

    The CLTK now...

  • CLTK participating in Google Summer of Code!

    GSoC banner

    We are thrilled to announce that the CLTK has been accepted to Google Summer of Code 2016. This is a tremendous opportunity for Classics students’ careers, an affirmation of the CLTK’s vision, and an chance for us, as mentors, to give back.

    GSoC funds students to work for three months on an approved open source project. These students may be at the undergraduate or graduate level, specializing in any academic field, and hailing from any accredited college or university, from any country. We do not yet know the number of students Google will fund...

  • New projects for Google Summer of Code

    Google Summer of Code is an program which gives students a stipend to do three months’ of work for an approved open source organization. The CLTK has submitted an application for inclusion into the ranks of approved projects. (See all sorts of info about past projects and participants here.)

    While we will not know whether we are accepted for another few days, these projects, on the project ideas page of the CLTK wiki, should serve as good ideas for anyone wishing to make a major, substantive contribution. These can also be found on the CLTK’s Issues...

  • Two recent CLTK lectures

    Note: The following is re-posted from Kyle’s personal website.

    I have given two lectures on the CLTK over the past few months and should post them before too much time as gone by.

    The first lecture was last November, when I gave a guest lecture, to an NYU graduate class of Peter Meineck, on an introduction to NLP and the CLTK. This was a lot of fun, as I had time to dialog with the class and explore some texts (some plays of Aeschylus) together. The Jupyter notebooks I prepared are on GitHub. Here’s the lecture’s slide deck,...

  • Tokenizing Latin text

    Note: The following is re-posted from Patrick’s blog, Disjecta Membra.

    One of the first tasks necessary in any text analysis projects is tokenization—we take our text as a whole and convert it to a list of smaller units, or tokens. When dealing with Latin—or at least digitized version of modern editions, like those found in the Perseus Digital Library, the Latin Library, etc.—paragraph- and sentence-level tokenization present little problem. Paragraphs are usually well marked and can be split by newlines (</n>). Sentences in modern Latin editions use the same punctuation set as English i.e., ‘.’,...

  • Updated accuracies for POS taggers

    Last Fall, I published on my personal blog a series of accuracy scores for the CLTK’s POS tagging models. It was recently pointed out to me by Dr. Giuseppe G. A. Celano, a member of Perseus’s treebank team, that the training set I used to make the taggers contained duplicates. This resulted in a skewing of accuracy scores, though not the taggers themselves.

    The Perseus files, kept here for Greek and here for Latin, contain a single XML for each tagged work and one file, agdt-1.7.xml for Greek and ldt-1.5.xml for Latin, for all individual...

  • New Latin prosody scanner

    After six months of development, Bradley Baker and I have completed a Latin prose scanner for the CLTK. The module accepts any macronized Latin text and returns to the user the resulting scansion.

    We designed it as a component for my forthcoming undergraduate thesis on the prose rhythms of Demosthenes and Cicero. Since my intention was to scan prose rhythms, the program ignores some features of poetic prosody, such as hiatus. Future versions may include extra features for scanning Latin poetry.

    The basic algorithm used in the module is rather straightforward. An input string’s words are first tokenized (via the...

  • New Greek and Latin lemmatizer

    I’m happy to announce that the CLTK has a new and much improved (faster and more accurate) lemmatizer for the Latin and, now too, Greek languages.

    With any lemmatizer, choices must be made about which headword to select in the case that a lemma is ambiguous, that is could belong to more than one headword. In the event of such ambiguity, the previous lemmatizer took whatever headword it found first. Now, the lemmatizer (or rathter the code which created the lemmatizer’s key-value data store) takes into consideration all possibilities and selects the headword which has the most frequent occurences in...

  • Welcome to the CLTK homepage!

    I am excited to have this little website to bring useful information to users of the CLTK. As the project continues to grow, I hope users can share tutorials, code snippets, etc..

    If you are interested in authoring a post, you can send your text (preferably in Markdown) to me by email (kyle@kyle-p-johnson.com) or, better yet, fork this site’s repository, add your post, and make a pull request. This can all be done in-browser on GitHub. While not necessary, to clone this site and run it locally, see directions for using Jekyll on GitHub pages.

    To author...

subscribe via RSS