Note: The following is re-posted from Kyle’s personal website.
I have given two lectures on the CLTK over the past few months and should post them before too much time as gone by.
The first lecture was last November, when I gave a guest lecture, to an NYU graduate class of Peter Meineck, on an introduction to NLP and the CLTK. This was a lot of fun, as I had time to dialog with the class and explore some texts (some plays of Aeschylus) together. The Jupyter notebooks I prepared are on GitHub. Here’s the lecture’s slide deck, which might function as an informal introduction for newcomers.
A second lecture, which I gave last week, was a much briefer, 5–minute “lightning talk” to the San Francisco Python Meetup Group. The subject of this talk was narrow, being about problems encountered when working with linguistic corpora. The core difficulty pertained to the sharing of experiments and reproduction of results. Too often, folks working in NLP settle for hard–to–get, poorly documented, and un–versioned data sets. As I worked with others’ corpora and created my own, I came to understand poor data set management to be a significant problem, not only for digital Classics but NLP in general. In a nutshell, the CLTK’s solution to this problem is to leverage the capacities of Git and GitHub, that data sets can be, among other things, precisely versioned and easily updated by end users.