We’re happy to announce two exceptional students, Charles Pletcher and Natasha Voake, whom the CLTK will mentor for the 2017 Google Summer of Code. (See official GSoC page here.)
By the end of the summer, if not before, we will offer updates on what Charles and Natasha accomplish.
I plan to build a flexible platform for teachers to annotate texts in the CLTK Archive. The purpose of the annotations is twofold: first, to guide the students in their (re)readings of the text; and second, to provide a ready-to-hand reference to which students and teachers can return.
Teachers will be able to enter annotations either by uploading a list of references and notes or by entering their annotations directly through a WYSIWYG interface built on top of Draft.js. Annotations will be searchable via Elasticsearch, taggable, and, if time allows, versioned.
Charles is a second-year PhD student in Classics and Comparative Literature at Columbia. His academic focus centers on Western Hemisphere receptions of tragedy, with particular attention paid to messenger speech. Having previously worked as a software engineer, he maintains a keen interest in digital humanities.
Old and Middle French are hardly studied outside of a limited network of French universities. Implementing NLP functionality to these languages would make it easier to study them and access the rich literature and culture expressed in them, famous examples of which include the “Chanson de Roland”, Chrétien de Troyes’ Arthurian legends, Marie de France’s “Lais”, Christine de Pizan’s writings, etc.
This project aims to extend basic CLTK functionality to Old and Middle French texts between c.900 and c.1500 CE, by implementing a tokenizer, stopwords, named entity recognition, a PoS tagger, and a lemmatizer with English translations for as many words as possible. Data from which the above will be sourced will be from texts licensed under creative commons licenses, which have been transcribed and digitized. For example, a number of Old French texts from the BNF’s 19th century editions have been digitized and made available at gallica.fr. Lemmas will be sourced from Godefroy’s 1901 “Lexique de l’Ancien Français” and the “Dictionnaire Electronique de Chrétien de Troyes”, which has the advantage of English-language definitions.
Natasha is a final-year Linguistics student at Peterhouse, Cambridge with a particular interest in computational linguistics. She is a French-English bilingual and previously lived in New York and Paris, France.