The CLTK has as a central goal to provide complete NLP coverage of all pre-modern languages. In practice, this ambitious goal needs to be tempered by availability of language resources, digital and human. With some frequency, especially around the time of Google Summer of Code application, we are approached by potential contributors who hope to pitch in by adding NLP tools for a given language. Over the past six years, those of us centrally involved in the project have learned a great deal about what characteristics distinguish a successful from unsuccessful project. I’ll describe these characteristics in some detail below,...
The Classical Language Toolkit is happy to present our Google Summer of Code projects for 2018:
- Eleftheria Chatziargyriou (Aristotle University of Thessaloniki): Extending NLP functionality for Germanic Languages.
- Andrew Deloucas (Harvard University): The Road to CDLI’s Corpora Integration into CLTK: An Undertaking
- James Gawley (University of Buffalo): Expanding the CLTK with Synonyms, Translations and Word Embeddings
As we see it, this year’s projects each have the potential to move the CLTK forward in its own way.
Eleftheria’s project takes on one of the CLTK’s primary development paths, extending the core tools to...
We’re happy to announce two exceptional students, Charles Pletcher and Natasha Voake, whom the CLTK will mentor for the 2017 Google Summer of Code. (See official GSoC page here.)
By the end of the summer, if not before, we will offer updates on what Charles and Natasha accomplish.
I plan to build a flexible platform for teachers to annotate texts in the CLTK Archive. The purpose of the annotations is twofold: first, to guide the students in their (re)readings of the text; and second, to provide a ready-to-hand...
We are thrilled to announce that the CLTK will once again be participating in the Google Summer of Code. The four of us will be acting as mentors to the forthcoming students.
See our organization’s official GSoC page for more information, but here is our call for projects:
See our Project ideas page for a list of GSOC tasks that are suited to three months’ work for a beginning–to–intermediate programmer or language student. What follows is a high–level overview of these projects and a few tips in applying. Most work will...
Note: The following is re-posted from Peter’s website, ithaca.
This summer I’m working on a commentary on Descartes’ Meditationes de prima philosophia, usually known in English as his Meditations on First Philosophy. (Though actually a better translation is the less literal Metaphysical Meditations, which is how it’s usually translated into French.) In addition to providing a text and commentary, I plan to produce a vocabulary. This is a time-consuming and error-prone job, so naturally I want to offload at least some of the grunt work to a computer. (See laziness as virtue for programmers.)
As a start,...
Latin prose authors since Cicero often employed rhythms at the end of their periods to produce sonorous effects like those found in poetry. These rhythms not only add to the elegance of an author’s style, but they even often help characterize a specific style. The prose rhythm preferences of some Roman authors are so distinct that they are often used in matters of textual emendation and author attribution, and so philologists for the last century have spent considerable time cataloguing the rhythm of prose with the hope that better textual judgements may be made with more data.
The CLTK now...
We are thrilled to announce that the CLTK has been accepted to Google Summer of Code 2016. This is a tremendous opportunity for Classics students’ careers, an affirmation of the CLTK’s vision, and an chance for us, as mentors, to give back.
GSoC funds students to work for three months on an approved open source project. These students may be at the undergraduate or graduate level, specializing in any academic field, and hailing from any accredited college or university, from any country. We do not yet know the number of students Google will fund...
Google Summer of Code is an program which gives students a stipend to do three months’ of work for an approved open source organization. The CLTK has submitted an application for inclusion into the ranks of approved projects. (See all sorts of info about past projects and participants here.)
While we will not know whether we are accepted for another few days, these projects, on the project ideas page of the CLTK wiki, should serve as good ideas for anyone wishing to make a major, substantive contribution. These can also be found on the CLTK’s Issues...
Note: The following is re-posted from Kyle’s personal website.
I have given two lectures on the CLTK over the past few months and should post them before too much time as gone by.
The first lecture was last November, when I gave a guest lecture, to an NYU graduate class of Peter Meineck, on an introduction to NLP and the CLTK. This was a lot of fun, as I had time to dialog with the class and explore some texts (some plays of Aeschylus) together. The Jupyter notebooks I prepared are on GitHub. Here’s the lecture’s slide deck,...
Note: The following is re-posted from Patrick’s blog, Disjecta Membra.
One of the first tasks necessary in any text analysis projects is tokenization—we take our text as a whole and convert it to a list of smaller units, or tokens. When dealing with Latin—or at least digitized version of modern editions, like those found in the Perseus Digital Library, the Latin Library, etc.—paragraph- and sentence-level tokenization present little problem. Paragraphs are usually well marked and can be split by newlines (
</n>). Sentences in modern Latin editions use the same punctuation set as English i.e., ‘.’,...
Last Fall, I published on my personal blog a series of accuracy scores for the CLTK’s POS tagging models. It was recently pointed out to me by Dr. Giuseppe G. A. Celano, a member of Perseus’s treebank team, that the training set I used to make the taggers contained duplicates. This resulted in a skewing of accuracy scores, though not the taggers themselves.
After six months of development, Bradley Baker and I have completed a Latin prose scanner for the CLTK. The module accepts any macronized Latin text and returns to the user the resulting scansion.
We designed it as a component for my forthcoming undergraduate thesis on the prose rhythms of Demosthenes and Cicero. Since my intention was to scan prose rhythms, the program ignores some features of poetic prosody, such as hiatus. Future versions may include extra features for scanning Latin poetry.
The basic algorithm used in the module is rather straightforward. An input string’s words are first tokenized (via the...
I’m happy to announce that the CLTK has a new and much improved (faster and more accurate) lemmatizer for the Latin and, now too, Greek languages.
With any lemmatizer, choices must be made about which headword to select in the case that a lemma is ambiguous, that is could belong to more than one headword. In the event of such ambiguity, the previous lemmatizer took whatever headword it found first. Now, the lemmatizer (or rathter the code which created the lemmatizer’s key-value data store) takes into consideration all possibilities and selects the headword which has the most frequent occurences in...
I am excited to have this little website to bring useful information to users of the CLTK. As the project continues to grow, I hope users can share tutorials, code snippets, etc..
If you are interested in authoring a post, you can send your text (preferably in Markdown) to me by email (email@example.com) or, better yet, fork this site’s repository, add your post, and make a pull request. This can all be done in-browser on GitHub. While not necessary, to clone this site and run it locally, see directions for using Jekyll on GitHub pages.
subscribe via RSS