<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://cltk.org/feed.xml" rel="self" type="application/atom+xml" /><link href="https://cltk.org/" rel="alternate" type="text/html" hreflang="en" /><updated>2026-01-10T02:58:16+00:00</updated><id>https://cltk.org/feed.xml</id><title type="html">The Classical Language Toolkit</title><subtitle>NLP for pre-modern languages
</subtitle><author><name>CLTK Team</name></author><entry><title type="html">Announcing CLTK 2.0: NLP for *all* pre‑modern languages</title><link href="https://cltk.org/blog/2025/09/27/announcing-cltk-2-0-nlp-for-all-pre-modern-languages.html" rel="alternate" type="text/html" title="Announcing CLTK 2.0: NLP for *all* pre‑modern languages" /><published>2025-09-27T21:00:00+00:00</published><updated>2025-09-27T21:00:00+00:00</updated><id>https://cltk.org/blog/2025/09/27/announcing-cltk-2-0-nlp-for-all-pre-modern-languages</id><content type="html" xml:base="https://cltk.org/blog/2025/09/27/announcing-cltk-2-0-nlp-for-all-pre-modern-languages.html"><![CDATA[<p>Using the same API and architecture patterns, I have rewritten almost every line of code. A small breakthrough I found is that generative LLM are able to provide the core NLP tasks that this project aims for: part-of-speech and dependency parsing. The potentially enormous breakthrough is even generalist models (e.g., ChatGPT, Llama) can perform this task for a multitude, if not the majority, of ancient, classical, and medieval languages.</p>

<h1 id="features">Features</h1>

<ul>
  <li>Generative models support 105 languages! <a href="https://docs.cltk.org/reference/cltk/languages/pipelines/#cltk.languages.pipelines.MAP_LANGUAGE_CODE_TO_GENERATIVE_PIPELINE">See here for all of them</a></li>
  <li>For backend, one may choose OpenAI/ChatGPT, Mistral, or any model that Ollama runs (e.g., Llama). I have added support for Ollama’s (paid) cloud service (though have not tested this yet).</li>
  <li>The project keeps the Stanza backend <a href="https://docs.cltk.org/reference/cltk/languages/pipelines/#cltk.languages.pipelines.MAP_LANGUAGE_CODE_TO_STANZA_PIPELINE">see all 11 languages supported with the Stanza backend here</a></li>
  <li>The public API remains nearly unchanged from previous v1.x.</li>
  <li>Likewise the architecture is nearly the same, too, though certain processes have dropped and others added in this new, v2.. See page 22-25 of our publication: https://aclanthology.org/2021.acl-demo.3.pdf</li>
  <li>We still map the string labels into valid UD POS tags, morphological features, and dependency syntax labels. Arguably the data types are easier to use. It is certainly easier to make corrections when incorrectly named outputs come from a model <a href="https://docs.cltk.org/reference/cltk/morphosyntax/ud_features/#cltk.morphosyntax.ud_features.normalize_ud_feature_pair">see here for examples</a>.</li>
  <li>The architecture is clean, it builds fast, and has extensive logging; all of which mean that debugging and writing tested patches will be much faster than before.</li>
</ul>

<h1 id="installation">Installation</h1>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">$</span> <span class="n">pip</span> <span class="n">install</span> <span class="s">"cltk[openai,stanza,ollama]"</span>
</code></pre></div></div>

<p>See more at: <a href="https://docs.cltk.org/quickstart/#install">https://docs.cltk.org/quickstart/#install</a>.</p>

<h1 id="use">Use</h1>

<h2 id="chatgpt">ChatGPT</h2>

<p>Create a file <code class="language-plaintext highlighter-rouge">.env</code> and put in it your ChatGPT key (<code class="language-plaintext highlighter-rouge">OPENAI_API_KEY</code>), i.e.:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>OPENAI_API_KEY=YOURSECRETKEYDONTSHARE
</code></pre></div></div>

<p>Then run:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">cltk</span> <span class="kn">import</span> <span class="n">NLP</span>
<span class="n">nlp</span> <span class="o">=</span> <span class="n">NLP</span><span class="p">(</span><span class="s">"lati1261"</span><span class="p">,</span> <span class="n">backend</span><span class="o">=</span><span class="s">"openai"</span><span class="p">,</span> <span class="n">suppress_banner</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">doc</span> <span class="o">=</span> <span class="n">nlp</span><span class="p">.</span><span class="n">analyze</span><span class="p">(</span><span class="s">"Gallia est omnis divisa in partes tres."</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">doc</span><span class="p">.</span><span class="n">words</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="ollama">Ollama</h2>

<p>Install (Ollama)[https://ollama.com/], start the application, and <a href="https://ollama.com/search">find a model</a> to download. In my testing I have used <code class="language-plaintext highlighter-rouge">llama3.1:8b</code>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">cltk</span> <span class="kn">import</span> <span class="n">NLP</span>
<span class="n">nlp</span> <span class="o">=</span> <span class="n">NLP</span><span class="p">(</span><span class="s">"lati1261"</span><span class="p">,</span> <span class="n">backend</span><span class="o">=</span><span class="s">"ollama"</span><span class="p">,</span> <span class="n">suppress_banner</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">doc</span> <span class="o">=</span> <span class="n">nlp</span><span class="p">.</span><span class="n">analyze</span><span class="p">(</span><span class="s">"Gallia est omnis divisa in partes tres."</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">doc</span><span class="p">.</span><span class="n">words</span><span class="p">)</span>
</code></pre></div></div>

<p>Should anyone try Ollama’s cloud GPU service, add <code class="language-plaintext highlighter-rouge">OLLAMA_CLOUD_API_KEY</code> and your key to the <code class="language-plaintext highlighter-rouge">.env</code>.</p>

<h2 id="stanza">Stanza</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">cltk</span> <span class="kn">import</span> <span class="n">NLP</span>
<span class="n">nlp</span> <span class="o">=</span> <span class="n">NLP</span><span class="p">(</span><span class="s">"lati1261"</span><span class="p">,</span> <span class="n">backend</span><span class="o">=</span><span class="s">"stanza"</span><span class="p">,</span> <span class="n">suppress_banner</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">doc</span> <span class="o">=</span> <span class="n">nlp</span><span class="p">.</span><span class="n">analyze</span><span class="p">(</span><span class="s">"Gallia est omnis divisa in partes tres."</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">doc</span><span class="p">.</span><span class="n">words</span><span class="p">)</span>
</code></pre></div></div>

<h1 id="weaknesses-and-gaps">Weaknesses and gaps</h1>

<ul>
  <li>I should note that this is closer to “beta” than “production” software. I have decided to promote it to <code class="language-plaintext highlighter-rouge">master</code> anyways due to my limited time and limited number of test users.</li>
  <li>I have not added Spacy integrations yet, but hope to do so. For the past several years, I have had consistent difficulties getting it to install correctly on users’ machines.</li>
  <li>I have no benchmarks for the generative LLM, but hope to do so! When inspecting the generative LLM, it seems they have been trained on the entirety of the Universal Dependencies treebanks, and if so then new benchmarks ought be created, even if small.</li>
  <li>Calls to ChatGPT are slow! I am using async calls (per sentence) to speed things up, but this only goes so far. It’s relatively expensive, too, to label large amounts of text, but it’s reasonable to expect that costs will go down over time.</li>
  <li>The mapping of non-UD (but correct) labels is not complete. I have put a significant number of hours correcting those that come from ChatGPT 5 for Ancient Greek and Latin. I observe that other languages emit a number of these mistakes, but I do not feel confident mapping them. This will take a number of experts if it’s to ever be complete.</li>
  <li>There is some valuable code that did not (or has not yet) made its way over, including some wrappers around Collatinus, TLGU, and more. I have not made any firm decisions about these yet.</li>
</ul>

<h1 id="legacy-support">Legacy support</h1>

<p>I insistent in stating version 1.x will not receive any more support. Yes, I should have given better warning.</p>

<p>As we did in the transition of v0 to v1, I have preserved the code, docs, and packages on PyPI. So if you do not want to upgrade to v2, you will always have available the last 1.x version with <code class="language-plaintext highlighter-rouge">pip install cltk==1.5.0</code>. Its docs are now at <a href="https://v1.cltk.org">https://v1.cltk.org</a> and the code is at branch <code class="language-plaintext highlighter-rouge">v1</code>. Likewise the initial generation of the CLTK is at branch <code class="language-plaintext highlighter-rouge">v0</code>, <a href="https://v0.cltk.org/">https://v0.cltk.org/</a>, and <code class="language-plaintext highlighter-rouge">pip install cltk==0.1.121</code>.</p>]]></content><author><name>Kyle P. Johnson</name></author><category term="blog" /><summary type="html"><![CDATA[Using the same API and architecture patterns, I have rewritten almost every line of code. A small breakthrough I found is that generative LLM are able to provide the core NLP tasks that this project aims for: part-of-speech and dependency parsing. The potentially enormous breakthrough is even generalist models (e.g., ChatGPT, Llama) can perform this task for a multitude, if not the majority, of ancient, classical, and medieval languages.]]></summary></entry><entry><title type="html">Presentation at Digital Classicist Berlin</title><link href="https://cltk.org/blog/2021/12/14/presentation-digital-classics-berlin.html" rel="alternate" type="text/html" title="Presentation at Digital Classicist Berlin" /><published>2021-12-14T21:00:00+00:00</published><updated>2021-12-14T21:00:00+00:00</updated><id>https://cltk.org/blog/2021/12/14/presentation-digital-classics-berlin</id><content type="html" xml:base="https://cltk.org/blog/2021/12/14/presentation-digital-classics-berlin.html"><![CDATA[<p>Three of us – Kyle Johnson, Clément Besnier, and Todd Cook – presented to the <a href="https://www.berliner-antike-kolleg.org/transfer/termine/2021-2022_digital_classicist.html">Digital Classicist Berlin</a> program at the Berlin–Brandenburgischen Akademie der Wissenschaften.</p>

<ul>
  <li><a href="http://cltk.org/assets/cltk-pres-digital-classics-berlin-brandenburg.pdf">Kyle’s slides</a> on the CLTK, generally</li>
  <li><a href="https://github.com/clemsciences/cltk-2021-berlin-code/blob/main/cltk_discovery_of_america.ipynb">Clément’s notebook</a> on making a custom Process and examples using Old Norse</li>
  <li><a href="https://docs.google.com/presentation/d/e/2PACX-1vQEK5MC9uS4SEvkaD9viCsqsEDhTEOHwFg8xssZul8u9qSe3qiQ6RQv4Lsx0vWE62gpnpMnsAPeFChS/pub?start=true&amp;loop=false&amp;delayms=15000&amp;slide=id.p">Todd’s slides</a> on MLOps and large language models</li>
</ul>]]></content><author><name>Kyle P. Johnson</name></author><category term="blog" /><summary type="html"><![CDATA[Three of us – Kyle Johnson, Clément Besnier, and Todd Cook – presented to the Digital Classicist Berlin program at the Berlin–Brandenburgischen Akademie der Wissenschaften.]]></summary></entry><entry><title type="html">CLTK v. 1.0 and ACL Publication</title><link href="https://cltk.org/blog/2021/09/22/cltk-v1-acl-paper-published.html" rel="alternate" type="text/html" title="CLTK v. 1.0 and ACL Publication" /><published>2021-09-22T21:00:00+00:00</published><updated>2021-09-22T21:00:00+00:00</updated><id>https://cltk.org/blog/2021/09/22/cltk-v1-acl-paper-published</id><content type="html" xml:base="https://cltk.org/blog/2021/09/22/cltk-v1-acl-paper-published.html"><![CDATA[<p>Last month, the annual ACL conference published the CLTK’s de facto “white paper” (<a href="https://aclanthology.org/2021.acl-demo.3/">“The Classical Language Toolkit: An NLP Framework for Pre-Modern Languages”</a>).</p>

<p>Some time prior to this, we also officially promoted our version 1.0 to “production”.</p>]]></content><author><name>Kyle P. Johnson</name></author><category term="blog" /><summary type="html"><![CDATA[Last month, the annual ACL conference published the CLTK’s de facto “white paper” (“The Classical Language Toolkit: An NLP Framework for Pre-Modern Languages”).]]></summary></entry><entry><title type="html">Announcing ‘alpha’ release of CLTK v. 1.0</title><link href="https://cltk.org/blog/2020/07/05/announcing-alpha-release-v1.html" rel="alternate" type="text/html" title="Announcing ‘alpha’ release of CLTK v. 1.0" /><published>2020-07-05T21:00:00+00:00</published><updated>2020-07-05T21:00:00+00:00</updated><id>https://cltk.org/blog/2020/07/05/announcing-alpha-release-v1</id><content type="html" xml:base="https://cltk.org/blog/2020/07/05/announcing-alpha-release-v1.html"><![CDATA[<p>The core maintainers are pleased to announce the first pre-release of the CLTK’s version 1.0. More information will follow, but at the highest level the guiding principles have been (a) to add a single pre-configured interface and (b) to fully rationalize software’s architecture for adding new languages with a minimum of friction.</p>

<p>Preferably in a new virtual environment, in either Python 3.7 or 3.8, pull the latest “alpha” with:</p>

<blockquote>
  <p>$ pip install –pre cltk</p>
</blockquote>

<p>The docs should be enough to begin stress-testing the new code:</p>

<ul>
  <li>Docs: <a href="https://alpha.cltk.org">https://alpha.cltk.org</a></li>
  <li>Source: <a href="https://github.com/cltk/cltkv1">https://github.com/cltk/cltkv1</a> (Note: Not our main repo!)</li>
</ul>

<p>For now, please open issues in this new development repo. In the coming weeks, we will merge the two Git trees together.</p>]]></content><author><name>Kyle P. Johnson</name></author><category term="blog" /><summary type="html"><![CDATA[The core maintainers are pleased to announce the first pre-release of the CLTK’s version 1.0. More information will follow, but at the highest level the guiding principles have been (a) to add a single pre-configured interface and (b) to fully rationalize software’s architecture for adding new languages with a minimum of friction.]]></summary></entry><entry><title type="html">On under-resourced languages and the CLTK</title><link href="https://cltk.org/blog/2018/12/30/under-resourced-languages-cltk.html" rel="alternate" type="text/html" title="On under-resourced languages and the CLTK" /><published>2018-12-30T21:00:00+00:00</published><updated>2018-12-30T21:00:00+00:00</updated><id>https://cltk.org/blog/2018/12/30/under-resourced-languages-cltk</id><content type="html" xml:base="https://cltk.org/blog/2018/12/30/under-resourced-languages-cltk.html"><![CDATA[<p>The CLTK has as a central goal to provide complete NLP coverage of all pre-modern languages. In practice, this ambitious goal needs to be tempered by availability of language resources, digital and human. With some frequency, especially around the time of Google Summer of Code application, we are approached by potential contributors who hope to pitch in by adding NLP tools for a given language. Over the past six years, those of us centrally involved in the project have learned a great deal about what characteristics distinguish a successful from unsuccessful project. I’ll describe these characteristics in some detail below, but to summarize, a successful “add-a-language” project for the CLTK depends the presence already-available, free digitized data.</p>

<p>Languages which lack such digitized data, I’ll call “under-resourced.” To help explain, allow me to borrow <a href="http://www.elra.info/en/about/what-language-resource/">a definition of a “language resource”</a>:<sup><a href="#myfootnote1">1</a></sup></p>

<blockquote>
  <p>The term <em>Language Resource</em> refers to a set of speech or language data and descriptions in machine readable form, used for building, improving or evaluating natural language … for language studies, electronic publishing, international transactions, subject-area specialists and end users. Examples of Language Resources are written and spoken corpora, computational lexica, terminology databases, speech collection, etc.</p>
</blockquote>

<p>For the CLTK, language resources normally take the form of static data sets, such as text corpora (annotated or plaintext), <a href="https://en.wikipedia.org/wiki/Treebank">treebanks</a>, and lexica (dictionaries of various sort). Resources also take the form of algorithms, which are usually rules-based and optionally rely upon data sets. For an example of the former sort of algorithm, see <a href="https://github.com/cltk/cltk/blob/9deebf3ff050ab6c12c0c5ceb953bc8ecce21ed0/cltk/stem/latin/stem.py">Luke Hollis’s stemmer for Latin</a>, which is itself a Python-language implementation of rules defined by previous peer-reviewed scholarship (Schinke et al., 1996). For an example of rules-plus-data, <a href="https://github.com/cltk/cltk/blob/9b9cdb42dcc1c707ab3db3ef8214837bb7c262b5/cltk/prosody/latin/Syllabifier.py#L36">Todd Cook’s Latin syllabifier</a> first defines a variety of character types in <a href="https://github.com/cltk/cltk/blob/9b9cdb42dcc1c707ab3db3ef8214837bb7c262b5/cltk/prosody/latin/ScansionConstants.py">ScansionConstants.py</a> and then uses these in <code class="language-plaintext highlighter-rouge">Syllabifier.syllabify()</code>. Examples of well-resourced languages, in both data and algorithm, include Ancient Greek and Latin, for which not coincidentally the CLTK has excellent coverage (look for <a href="https://github.com/cltk?utf8=%E2%9C%93&amp;q=greek&amp;type=&amp;language=">Greek</a> or <a href="https://github.com/cltk?utf8=%E2%9C%93&amp;q=latin&amp;type=&amp;language=">Latin</a> data sets on GitHub, to get an idea).</p>

<p>The CLTK is an NLP project, and it eschews ventures outside of this mission. Simply put, this means that the CLTK is neither a data annotation nor a (non-technical) user-facing project. In reference to the classic <a href="https://en.wikipedia.org/wiki/Multitier_architecture">three-tier software architecture</a>, the CLTK is exclusively restricted to the middle <em>presentation tier</em>, relying upon the <em>data storage tier</em> as upstream dependencies and adopted downstream by the <em>presentation tier</em>. The CLTK has a vested interest in the health of data storage and presentation tiers, however our project’s core NLP task, to write NLP algorithms by leveraging already available data, is formidable enough. To satisfy the needs of downstream makers of applications, we write what we hope are sensible and well documented APIs.</p>

<p>To illustrate briefly the challenges of even relatively simple data creation, I am reminded of several students whose natives tongues were, respectively, Telugu and Kannada. They approached with proposals to do NLP in their languages, and when I pointed out they’d need data, each came up with the idea of doing OCR to obtain plaintext corpora. After all, plenty of digitized book images could be found online. However, preliminary experiments demonstrated that OCR for these particular non-Latin characters was of very low accuracy. Simply making an OCR model would have constituted a summer project itself (one outside our bounds, nevertheless). The creation of annotated texts is a laborious in the extreme, requiring equal degrees of passion and technical expertise.</p>

<p>Having explained what a language resources are and why they are so important to the CLTK, I will next explain the states in which under-resourced (dead) languages exist and how one can decide whether a given language is under-resourced. Languages generally fail to meet the minimum bar of “resourced” due to one or more of the following attributes:</p>
<ol>
  <li>resources are not digitized;</li>
  <li>resources are only available under non-free license;</li>
  <li>resources are available but do not amount to a “critical mass” around which serious NLP tooling may be developed.</li>
</ol>

<p>First, simply enough, if data has not been digitized, then it is not available for computational processing. As mentioned above, book page images are quite a distance from even plaintext files.</p>

<p>Second, non-free data poses a significant problem. My use of “free” here corresponds generally to that expounded by Richard Stallman as something more particular than no-cost and open source:</p>

<blockquote>
  <p>When we call software “free,” we mean that it respects the users’ essential freedoms: the freedom to run it, to study and change it, and to redistribute copies with or without changes. This is a matter of freedom, not price, so think of “free speech,” not “free beer.” (<a href="https://www.gnu.org/philosophy/open-source-misses-the-point.html">“Why Open Source misses the point of Free Software”</a>)</p>
</blockquote>

<p>Intended to support rigorous quantitative scholarship, users of the CLTK simply must have the ability to manipulate and redistribute data used to create results. Contrary to the understanding of many humanists I have met, “open access” and “open source” resources that have proprietary licenses are unstable foundations for projects like ours, the legacy of which (we hope) will be measured in decades, not years. It is undeniably practical for a scholar to quickly publish an article which uses proprietary data or software, however his results will likely not be reproducible, frozen in time, and lost to history.</p>

<p>Third, The line between a resourced and under-resourced language is not always clear-cut. For example, there could be annotations available, however they be of a rather small in number (the case with <a href="https://github.com/cltk/tibetan_pos_tdc">Tibetan POS</a>); or some data sets be very robust for some tasks (e.g., a lexicon and word-lookup) however be completely lacking treebanks (Pali); or in the absence of any digitized data, certain within-reach algorithms may be written. For such in-between languages, if a potential contributor will first do diligent research on data sets, project mentors will be delighted to discuss algorithmic possibilities.</p>

<p>Is there some limited data creation that could fall within scope of the CLTK? We have had some success, for example, in generating stopword lists either according to specific algorithms minimizing manual curation (e.g., run tf-idf on a corpus and removing nouns) or with very narrow scope (e.g., writing every possible inflection of a definite article). I don’t want to preclude other ideas, so I would generalize that this issue may be evaluated on a case-by-case basis.</p>

<p>All the above is intentionally discouraging of those who might want to disembark on an ill-fated proposal to add new language support to the CLTK. On the flip side, we ought to highlight that there are several shining examples of languages I might consider well-resourced and not currently covered by the CLTK. Those are, in no particular order: Hebrew, Arabic, Sanskrit, Chinese, and Old English.<sup><a href="#myfootnote2">2</a></sup> (There are likely others, too.) To conclude, I encourage those who would like to work with a mentor from our team, to first consider and have preliminary answers to the following questions:</p>
<ol>
  <li>What free data sets are available? If any non-free or ambiguously licensed language resources, are they so important that we would need to use them?</li>
  <li>What NLP algorithms can I write with this data?</li>
  <li>What free NLP algorithms have already been written? Do I have the skills, approximately, to re-implement them?</li>
  <li>What data am I missing and is it reasonable to create this data within the project? Define very precise scoping and make an effort estimate.</li>
  <li>Do you have the language skills to validate your own research? If not, have you identified another (say, a professor) who would be able to help? Please remember that the CLTK is exclusively interested in the <em>pre-modern</em> form of a language; so for example even though Hindi may be considered an “classical” language, its form as spoken today differs greatly (or so I am told) from how it was written 1,000 years ago.</li>
  <li>How do you rate your skills in programming in Python, machine learning, and NLP? We have and do work with various types of specialists (some more human language, some more computer), however knowing your particular background helps to pair you with the right mentor.</li>
</ol>

<p>With these questions answered, even if not perfectly, the core CLTK will be able to respond with concrete advice, criticism, and recommendations to further develop your proposal.</p>

<p><br /></p>

<p><a name="myfootnote1">1</a>: From the Under-resourced Languages group of the European Language Resources Association (ELRA).</p>

<p><a name="myfootnote2">2</a>: Old and Middle English have good footing, due to the major pre-modern Germanic contributions by Eleftheria Chatziargyriou and Clément Besnier, however relative to the huge amounts of source data, lots of valuable work remains.</p>]]></content><author><name>Kyle P. Johnson</name></author><category term="blog" /><summary type="html"><![CDATA[The CLTK has as a central goal to provide complete NLP coverage of all pre-modern languages. In practice, this ambitious goal needs to be tempered by availability of language resources, digital and human. With some frequency, especially around the time of Google Summer of Code application, we are approached by potential contributors who hope to pitch in by adding NLP tools for a given language. Over the past six years, those of us centrally involved in the project have learned a great deal about what characteristics distinguish a successful from unsuccessful project. I’ll describe these characteristics in some detail below, but to summarize, a successful “add-a-language” project for the CLTK depends the presence already-available, free digitized data.]]></summary></entry><entry><title type="html">Announcing Google Summer of Code Projects for 2018</title><link href="https://cltk.org/blog/2018/04/27/announcing-gsoc-2018.html" rel="alternate" type="text/html" title="Announcing Google Summer of Code Projects for 2018" /><published>2018-04-27T09:00:00+00:00</published><updated>2018-04-27T09:00:00+00:00</updated><id>https://cltk.org/blog/2018/04/27/announcing-gsoc-2018</id><content type="html" xml:base="https://cltk.org/blog/2018/04/27/announcing-gsoc-2018.html"><![CDATA[<p><img src="/assets/GSoC2016Logo.jpg" alt="GSoC banner" align="center" /></p>

<p>The Classical Language Toolkit is happy to present our Google Summer of Code projects for 2018:</p>

<ul>
  <li>Eleftheria Chatziargyriou (Aristotle University of Thessaloniki): Extending NLP functionality for Germanic Languages.</li>
  <li>Andrew Deloucas (Harvard University): The Road to CDLI’s Corpora Integration into CLTK: An Undertaking</li>
  <li>James Gawley (University of Buffalo): Expanding the CLTK with Synonyms, Translations and Word Embeddings</li>
</ul>

<p>As we see it, this year’s projects each have the potential to move the CLTK forward in its own way.</p>

<p>Eleftheria’s project takes on one of the CLTK’s primary development paths, extending the core tools to a new language and/or language group. Her project will be the first substantial introduction of resources for Germanic languages in the project, a welcome addition.</p>

<p>Andrew’s project moves in a similar direction, but with a different focus. The incorporation of cuneiform languages in the CLTK is a clear desideratum. This project not only needs to extend and develop existing corpora, resources, and tools but needs to rethink how these tools work at a basic level with consideration of encoding and multiple transliteration schemes, among other things. A challenge to be sure, but an opportunity to open up future development for other pictographic/iconographic text systems.</p>

<p>James’s project capitalizes on an emerging priority among CLTK developers and contributors – the use of the project as the basis for cutting-edge research in natural language processing for historical languages. He will develop the CLTK’s ML models and datasets for the discovery of intra-language synonyms and inter-language translations. Another clear benefit of James’s work will be that by formalizing model production for Latin and Greek, he will also provide a path forward for the development of similar resources for other CLTK languages.</p>

<p>Congratulations again to this year’s CLTK Summer of Code participants. To the participants: the organizers and mentors can’t wait to start working with you on taking your projects from proposal to working code to commits to the codebase. To our development community: look forward to upcoming blog posts from each of the participants with more detailed introductions. And to everyone whose project was not selected this year, we want to acknowledge your hard work on the proposals and encourage you to try again next year. The simple fact is that there are more great ideas for NLP work on historical languages than GSoC spots. We look forward to seeing how your work develops going forward and encourage you to submit again next year.</p>]]></content><author><name>Patrick Burns and Kyle P. Johnson</name></author><category term="blog" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">CLTK announces GSoC 2017 students</title><link href="https://cltk.org/blog/2017/05/11/cltk-announces-gsoc-2017-students.html" rel="alternate" type="text/html" title="CLTK announces GSoC 2017 students" /><published>2017-05-11T07:28:00+00:00</published><updated>2017-05-11T07:28:00+00:00</updated><id>https://cltk.org/blog/2017/05/11/cltk-announces-gsoc-2017-students</id><content type="html" xml:base="https://cltk.org/blog/2017/05/11/cltk-announces-gsoc-2017-students.html"><![CDATA[<p><img src="/assets/GSoC2016Logo.jpg" alt="GSoC banner" align="center" /></p>

<p>We’re happy to announce two exceptional students, Charles Pletcher and Natasha Voake, whom the CLTK will mentor for the 2017 Google Summer of Code. (<a href="https://summerofcode.withgoogle.com/organizations/5734549993553920/">See official GSoC page here</a>.)</p>

<p>By the end of the summer, if not before, we will offer updates on what Charles and Natasha accomplish.</p>

<h2 id="charles-pletcher">Charles Pletcher</h2>

<p>Proposal:</p>

<blockquote>
  <p>I plan to build a flexible platform for teachers to annotate texts in the CLTK Archive. The purpose of the annotations is twofold: first, to guide the students in their (re)readings of the text; and second, to provide a ready-to-hand reference to which students and teachers can return.</p>

  <p>Teachers will be able to enter annotations either by uploading a list of references and notes or by entering their annotations directly through a WYSIWYG interface built on top of Draft.js. Annotations will be searchable via Elasticsearch, taggable, and, if time allows, versioned.</p>
</blockquote>

<p>About Charles:</p>

<blockquote>
  <p>Charles is a second-year PhD student in Classics and Comparative Literature at Columbia. His academic focus centers on Western Hemisphere receptions of tragedy, with particular attention paid to messenger speech. Having previously worked as a software engineer, he maintains a keen interest in digital humanities.</p>
</blockquote>

<h2 id="natasha-voake">Natasha Voake</h2>

<p>Proposal:</p>

<blockquote>
  <p>Old and Middle French are hardly studied outside of a limited network of French universities. Implementing NLP functionality to these languages would make it easier to study them and access the rich literature and culture expressed in them, famous examples of which include the “Chanson de Roland”, Chrétien de Troyes’ Arthurian legends, Marie de France’s “Lais”, Christine de Pizan’s writings, etc.</p>

  <p>This project aims to extend basic CLTK functionality to Old and Middle French texts between c.900 and c.1500 CE, by implementing a tokenizer, stopwords, named entity recognition, a PoS tagger, and a lemmatizer with English translations for as many words as possible. Data from which the above will be sourced will be from texts licensed under creative commons licenses, which have been transcribed and digitized. For example, a number of Old French texts from the BNF’s 19th century editions have been digitized and made available at gallica.fr. Lemmas will be sourced from Godefroy’s 1901 “Lexique de l’Ancien Français” and the “Dictionnaire Electronique de Chrétien de Troyes”, which has the advantage of English-language definitions.</p>
</blockquote>

<p>About Natasha:</p>

<blockquote>
  <p>Natasha is a final-year Linguistics student at Peterhouse, Cambridge with a particular interest in computational linguistics. She is a French-English bilingual and previously lived in New York and Paris, France.</p>
</blockquote>]]></content><author><name>Patrick Burns, Luke Hollis, Marius Jøhndal, Kyle P. Johnson, and Suhaib Khan</name></author><category term="blog" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">CLTK in Google Summer of Code 2017</title><link href="https://cltk.org/blog/2017/03/01/cltk-google-summer-code.html" rel="alternate" type="text/html" title="CLTK in Google Summer of Code 2017" /><published>2017-03-01T21:28:00+00:00</published><updated>2017-03-01T21:28:00+00:00</updated><id>https://cltk.org/blog/2017/03/01/cltk-google-summer-code</id><content type="html" xml:base="https://cltk.org/blog/2017/03/01/cltk-google-summer-code.html"><![CDATA[<p><img src="/assets/GSoC2016Logo.jpg" alt="GSoC banner" align="center" /></p>

<p>We are thrilled to announce that the CLTK will once again be participating in the Google Summer of Code. The four of us will be acting as mentors to the forthcoming students.</p>

<p>See our organization’s <a href="https://summerofcode.withgoogle.com/organizations/5734549993553920/">official GSoC page</a> for more information, but here is our call for projects:</p>

<blockquote>
  <p>See our <a href="https://github.com/cltk/cltk/wiki/Project-ideas#gsoc-projects">Project ideas page</a> for a list of GSOC tasks that are suited to three months’ work for a beginning–to–intermediate programmer or language student. What follows is a high–level overview of these projects and a few tips in applying. Most work will be done in the Python and JavaScript languages, of which a beginner’s or intermediate knowledge is expected.</p>

  <p>If you know a Classical language that is not yet supported well by the CLTK (e.g., Hebrew, Sanskrit, Chinese), you may follow the pattern set by the current Greek and Latin libraries. See the Projects page on the wiki for ideas of what a good application will include. See also <a href="http://legacy.cltk.org">the CLTK docs</a> for what we can already do for a given language.</p>

  <p>The other family of tasks regards the CLTK’s in–development website, the Classical Language Archive (<a href="https://github.com/cltk/cltk_frontend">in Meteor and React</a>) and API (<a href="https://github.com/cltk/cltk_api_v2">in Python’s Flask framework</a>). See our description of the frontend, experiment with the live demo, and explain in your application which functionality you would most like to see added, why, and how you will do so over the course of three months.</p>

  <p>The CLTK’s GSoC mentors may not be able to reply to all enquiries, however the best way to ensure a good response is to submit, as early as possible, a draft application essay, in which you explain the resources you will use, the order in which you will do your work, and a timeline. For the website project, email <a href="mailto:lukehollis@gmail.com">Luke Hollis</a>; for the add-a-language project, email <a href="mailto:kyle@kyle-p-johnson.com">Kyle P. Johnson, Ph.D.</a>.</p>
</blockquote>]]></content><author><name>Patrick Burns, Luke Hollis, Marius Jøhndal, and Kyle P. Johnson</name></author><category term="blog" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Guest post: Descartes meet Python. Python meet Descartes.</title><link href="https://cltk.org/blog/2016/06/12/guest-post-descartes-meet-python-python-meet-descartes.html" rel="alternate" type="text/html" title="Guest post: Descartes meet Python. Python meet Descartes." /><published>2016-06-12T00:00:00+00:00</published><updated>2016-06-12T00:00:00+00:00</updated><id>https://cltk.org/blog/2016/06/12/guest-post-descartes-meet-python-python-meet-descartes</id><content type="html" xml:base="https://cltk.org/blog/2016/06/12/guest-post-descartes-meet-python-python-meet-descartes.html"><![CDATA[<p><em>Note: The following is re-posted from Peter’s website, <a href="http://ithaca.arpinum.org/2016/06/08/latin-lexicon.html">ithaca</a>.</em></p>

<p>This summer I’m working on <a href="https://bitbucket.org/telemachus/descartes-meditations">a commentary on Descartes’ <em>Meditationes de prima philosophia</em></a>, usually known in English as his <em>Meditations on First Philosophy</em>. (Though actually a better translation is the less literal <em>Metaphysical Meditations</em>, which is how it’s usually translated into French.) In addition to providing a text and commentary, I plan to produce a vocabulary.  This is a time-consuming and error-prone job, so naturally I want to offload at least some of the grunt work to a computer. (See <a href="http://c2.com/cgi/wiki?LazinessImpatienceHubris">laziness as virtue for programmers</a>.)</p>

<p>As a start, I took at look at <a href="http://cltk.org/">The Classical Language Toolkit</a>. CLTK provides a set of natural language processing utilities for ancient texts. Although the Latin tools in CLTK are aimed primarily at classical material, the neo-Latin that Descartes wrote is largely classical in style and vocabulary. The following short Python script <a href="https://www.ibm.com/developerworks/community/blogs/nlp/entry/tokenization?lang=en">tokenizes</a> and <a href="https://en.wikipedia.org/wiki/Lemmatisation">lemmatizes</a> Descartes’ first meditation. This gets us well on our way since the script reduces its input text to a list of unique words reduced to their dictionary entry. That is, suppose that Descartes wrote <em>scīre</em> and <em>sciēns</em> (an infinitive meaning <em>to know</em> and a participle meaning <em>knowing</em>), the script would output only <em>sciō</em>, the shared dictionary entry of those two forms.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">cltk.stem.lemma</span> <span class="kn">import</span> <span class="n">LemmaReplacer</span>
<span class="kn">from</span> <span class="nn">cltk.stem.latin.j_v</span> <span class="kn">import</span> <span class="n">JVReplacer</span>
<span class="kn">from</span> <span class="nn">cltk.tokenize.word</span> <span class="kn">import</span> <span class="n">WordTokenizer</span>

<span class="n">meditatio_prima</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">"./meditatio-prima.txt"</span><span class="p">,</span> <span class="s">"r"</span><span class="p">).</span><span class="n">read</span><span class="p">()</span>

<span class="n">jv</span> <span class="o">=</span> <span class="n">JVReplacer</span><span class="p">()</span>
<span class="n">meditatio_prima</span> <span class="o">=</span> <span class="n">jv</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="n">meditatio_prima</span><span class="p">)</span>

<span class="n">t</span> <span class="o">=</span> <span class="n">WordTokenizer</span><span class="p">(</span><span class="s">"latin"</span><span class="p">)</span>
<span class="n">l</span> <span class="o">=</span> <span class="n">LemmaReplacer</span><span class="p">(</span><span class="s">"latin"</span><span class="p">)</span>

<span class="n">words</span> <span class="o">=</span> <span class="n">l</span><span class="p">.</span><span class="n">lemmatize</span><span class="p">(</span><span class="n">t</span><span class="p">.</span><span class="n">tokenize</span><span class="p">(</span><span class="n">meditatio_prima</span><span class="p">))</span>
<span class="n">words</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="nb">set</span><span class="p">(</span><span class="n">words</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">words</span><span class="p">))</span>
</code></pre></div></div>

<p>It would be even better if we could take the dictionary entries from this first script and feed them to a program that would give us definitions. So that’s what I worked on next. The <a href="http://www.perseus.tufts.edu/hopper/">Perseus Digital Library</a> has opensourced <a href="https://github.com/PerseusDL/lexica/tree/master/CTS_XML_TEI/perseus/pdllex/lat/ls">the XML version of <em>A Latin Dictionary</em> edited by Lewis and Short</a>. I’d prefer the modern <em>Oxford Latin Dictionary</em>, but Lewis and Short is still an outstanding Latin dictionary. Once again, a relatively short Python script will parse the XML and match the words from Descartes’ text with their Lewis and Short dictionary entries.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>from sys import stderr
from lxml import etree as ET

# Helper functions
def warn(*args, **kwargs):
    '''Print message to stderr'''
    print(*args, file=stderr, **kwargs)

def inner_text(node):
   '''Return all inner text of an XML node'''
   return ''.join([text for text in node.itertext()])

xml = ET.parse('no-entities-ls.xml')
entries = xml.xpath('//entryFree')
lewis_short = {}
for item in entries:
    lewis_short[item.attrib['key']] = inner_text(item)

# Load the vocabulary words that we're searching for
words_file = open('meditatio-words.txt', 'r')
wanted_words = words_file.read().splitlines()
words_file.close()

# Work through the words we want, trying to match their meanings
for wanted in wanted_words:
    if wanted in lewis_short:
        print('%s' % (lewis_short[wanted]))
    else:
        warn('%s has no entry.' % (wanted))
</code></pre></div></div>

<p>This is worth walking through. After defining two small helper functions, the next block of code parses the Lewis and Short XML file and stores the entries in a Python dictionary. (Since <em>dictionary</em> is now very ambiguous, I’m going to call the data structure a hash from here on in.) The keys of the hash are lookup words, and the values are full entries from Lewis and Short. Since lookup words are just what our first script gave us, we load those into a list and iterate over them. For each word in Descartes, we test whether it’s in the hash. If it is, we print out the entry from Lewis and Short. If it’s not in the hash, we tell the user that no matching entry was found. (The “no item found” messages are printed to <code class="language-plaintext highlighter-rouge">stderr</code> instead of <code class="language-plaintext highlighter-rouge">stdout</code>, so that if the user wants they can print the two output streams to different places. E.g., <code class="language-plaintext highlighter-rouge">python3 vocabulary-builder.py 1&gt;descartes-vocabulary.txt 2&gt;missing-words.txt</code>)</p>

<p>This is a terrific amount of progress for two short Python scripts. In addition, I extracted the textual data from the XML file and saved that in plain text format. This allows for lots of other potential uses of Lewis and Short without having to deal with the XML. (The plain text is available <a href="https://github.com/telemachus/plaintext-lewis-short">on GitHub, under a CC license that allows for further changes</a>.)</p>

<h2 id="whats-next">What’s next?</h2>

<p>There’s a lot of room for improvement. Here’s a quick list of things I’m working on or would like to see.</p>

<ul>
  <li>Ambiguous words in Lewis and Short have numbers in their XML keys to distinguish them. E.g., there’s a verb <em>adeō</em> and an adverb <em>adeō</em>, and their XML keys are <em>adeo1</em> and <em>adeo2</em> respectively. The CLTK lemmatizer also numbers ambiguous words, but I’m not sure whether the two systems are the same. For what I’m doing here, it would be ideal if they were, but there may be other considerations at play. Still, it would be worth investigating how much the two numerations overlap and whether CLTK could mimic Lewis and Short.</li>
  <li>The vocabulary builder spends most of its time parsing the XML file, which is very large and rather complex. As an alternative, it would be simpler to store Lewis and Short in a simple CSV file. (This would make reading in the data trivial.)</li>
  <li>On the other hand, the way that I extracted the textual data from the XML loses some information. In particular, the details of the section breakdowns in Lewis and Short are not stored as text, but only in the structure and attributes of the XML tags. My method of brute-force extraction didn’t preserve those structural signposts. For my purposes, that’s fine, but other people may prefer to try to get more structure out of the XML.</li>
  <li>Lewis and Short have very odd conventions for macrons. First, they very frequently use the brevis to show where vowels are short. E.g., <em>ăgō</em>. That’s nearly never useful for students. Second, and much worse, they don’t provide macrons in several cases where macrons can be assumed as the norm. E.g., the first principal part of verbs or third declension nouns ending in -<em>tiō</em>. These cases, and many others, never have macrons. So in fact, the entry for that verb actually starts with the form <em>ăgo</em>: an unhelpful brevis and a missing macron. Both would likely confuse contemporary students. That’s enormously frustrating, though luckily there is <a href="https://github.com/Alatius/latin-macronizer">an excellent Latin macronizer available on GitHub that might help with this problem</a>, and it’s easy enough to remove all the brevis marks via simple scripts.</li>
  <li>Finally, as helpful as all of this may be, it’s not what I really need. To build a student glossary, I need to strike a balance: (i) I should provide general and basic meanings for the words I gloss, but (ii) I have to make sure to include any specific meanings necessary for the text at hand. (Student glossaries that only provide the meanings for the text at hand are the worst. They should be illegal.) But what these programs provide is far, far more than that. The result now is the entire dictionary entry for each word. Lewis and Short is a detailed, scholarly dictionary. The entries for core words in the language (e.g., <em>ferō</em> or <em>et</em>) can go on for column after column of small, single-spaced writing. So what I’d need to do is trim these entries down to something more manageable. That’s fine, but it would still require a fair amount of work—though nowhere near as much as doing it all from scratch.</li>
  <li>These command-line scripts are great for me, but for most people they would be a pain. Eventually, it would probably be good to wrap this in a web application or a desktop application. A user could upload the text they wanted to gloss and the application would give them a vocabulary list. (This would be something like <a href="http://bridge.haverford.edu/">Haverford’s terrific Bridge</a>, but for any Latin text. It’s obviously a tall order.)</li>
</ul>

<p>As usual, trying to make a computer do your work for you is fun, helpful, and exhausting. I got to learn some Python, and the tokenizer and lemmatizer results alone will save me hours compared to doing it by hand. At the same time, I’ve only scratched the surface and there’s far more work to do. But I’m excited to continue learning Python and to improve the glossary-maker scripts that I’ve started here.</p>]]></content><author><name>Peter Aronoff</name></author><category term="blog" /><summary type="html"><![CDATA[Note: The following is re-posted from Peter’s website, ithaca.]]></summary></entry><entry><title type="html">Analyzing Latin clausulae</title><link href="https://cltk.org/blog/2016/03/27/analyzing-latin-clausulae.html" rel="alternate" type="text/html" title="Analyzing Latin clausulae" /><published>2016-03-27T21:28:00+00:00</published><updated>2016-03-27T21:28:00+00:00</updated><id>https://cltk.org/blog/2016/03/27/analyzing-latin-clausulae</id><content type="html" xml:base="https://cltk.org/blog/2016/03/27/analyzing-latin-clausulae.html"><![CDATA[<p>Latin prose authors since Cicero often employed rhythms at the end of their periods to produce sonorous effects like those found in poetry. These rhythms not only add to the elegance of an author’s style, but they even often help characterize a specific style. The prose rhythm preferences of some Roman authors are so distinct that they are often used in matters of textual emendation and author attribution, and so philologists for the last century have spent considerable time cataloguing the rhythm of prose with the hope that better textual judgements may be made with more data.</p>

<p>The CLTK now includes a clausulae analysis module that – in conjunction with the existing prosody module – can produce a rhythmic profile of a text.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">cltk.prosody.latin.scanner</span> <span class="kn">import</span> <span class="n">Scansion</span>
<span class="kn">from</span> <span class="nn">cltk.prosody.latin.clausulae_analysis</span> <span class="kn">import</span> <span class="n">Clausulae</span>

<span class="c1"># A sample analysis of Cicero's First Catilinarian
# NB: The text must be macronized before scanning
</span><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">"~/CiceroCat1.txt"</span><span class="p">)</span> <span class="k">as</span> <span class="n">file_open</span><span class="p">:</span> 
    <span class="n">text</span> <span class="o">=</span> <span class="n">file_open</span><span class="p">.</span><span class="n">read</span><span class="p">()</span>

<span class="n">scansion</span> <span class="o">=</span> <span class="n">Scansion</span><span class="p">().</span><span class="n">scan_text</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
<span class="n">clausulae</span> <span class="o">=</span> <span class="n">Clausulae</span><span class="p">().</span><span class="n">clausulae_analysis</span><span class="p">(</span><span class="n">scansion</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">clausulae</span><span class="p">)</span>
</code></pre></div></div>

<p>This profile comes in the form of a dictionary in which each key is a clausulae type and each value is the frequency of that clausulae in the text.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span>
    <span class="s">'molossus + double trochee'</span><span class="p">:</span> <span class="mi">12</span><span class="p">,</span> 
    <span class="s">'cretic + double trochee'</span><span class="p">:</span> <span class="mi">6</span><span class="p">,</span> 
    <span class="s">'choriamb + double trochee'</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> 
    <span class="s">'cretic + double spondee'</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> 
    <span class="s">'heroic'</span><span class="p">:</span> <span class="mi">9</span><span class="p">,</span> 
    <span class="s">'1st paeon + trochee'</span><span class="p">:</span> <span class="mi">12</span><span class="p">,</span> 
    <span class="s">'double cretic'</span><span class="p">:</span> <span class="mi">9</span><span class="p">,</span> 
    <span class="s">'molossus + cretic'</span><span class="p">:</span> <span class="mi">6</span><span class="p">,</span> 
    <span class="s">'double spondee'</span><span class="p">:</span> <span class="mi">8</span><span class="p">,</span> 
    <span class="s">'molossus + iamb'</span><span class="p">:</span> <span class="mi">11</span><span class="p">,</span> 
    <span class="s">'1st paeon + anapest'</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span> 
    <span class="s">'dactyl + double trochee'</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span> 
    <span class="s">'cretic + trochee'</span><span class="p">:</span> <span class="mi">27</span><span class="p">,</span> 
    <span class="s">'cretic + iamb'</span><span class="p">:</span> <span class="mi">13</span><span class="p">,</span> 
    <span class="s">'substituted cretic + trochee'</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span> 
    <span class="s">'double trochee'</span><span class="p">:</span> <span class="mi">27</span><span class="p">,</span> 
    <span class="s">'4th paeon + trochee'</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span> 
    <span class="s">'4th paeon + cretic'</span><span class="p">:</span> <span class="mi">3</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The specific clausulae that the module searches for is derived from a list provided in the introduction of John Ramsey’s Cambridge commentary on Cicero’s <em>Philippics</em> I-II. See the table below for the relation between the name of a clausulae, the ‘type’ which Ramsey assigns to it (note that the heroic clausulae, type 6, is my own addition to the list), and its metrical construction. Short syllables are denoted with ‘u’, long with ‘-‘, and substitution with parentheses.</p>

<table>
  <thead>
    <tr>
      <th>Type</th>
      <th>Name</th>
      <th>Definition</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Type 1</td>
      <td>Cretic + Trochee</td>
      <td>- u - / - x</td>
    </tr>
    <tr>
      <td>Type 1a</td>
      <td>Fourth Paeon + Trochee</td>
      <td>(uu) u - / - x</td>
    </tr>
    <tr>
      <td>Type 1b</td>
      <td>First Paeon + Trochee</td>
      <td>- u (uu) / - x</td>
    </tr>
    <tr>
      <td>Type 1c</td>
      <td>Substituted Cretic + Trochee</td>
      <td>(uu) u (uu) / - x</td>
    </tr>
    <tr>
      <td>Type 1d</td>
      <td>First Paeon + Anapest</td>
      <td>- u (uu) / u u x</td>
    </tr>
    <tr>
      <td>Type 2</td>
      <td>Double Cretic</td>
      <td>- u - / - u x</td>
    </tr>
    <tr>
      <td>Type 2a</td>
      <td>Fourth Paeon + Cretic</td>
      <td>(uu) u - / - u x</td>
    </tr>
    <tr>
      <td>Type 2b</td>
      <td>Molossus + Cretic</td>
      <td>- (-) - / - u x</td>
    </tr>
    <tr>
      <td>Type 3</td>
      <td>Double Trochee</td>
      <td>- u / - x</td>
    </tr>
    <tr>
      <td>Type 3a</td>
      <td>Molossus + Double Trochee</td>
      <td>- - - / - u / - x</td>
    </tr>
    <tr>
      <td>Type 3b</td>
      <td>Cretic + Double Trochee</td>
      <td>- u - / - u / - x</td>
    </tr>
    <tr>
      <td>Type 3c</td>
      <td>Dactyl + Double Trochee</td>
      <td>- u u / -u / - x</td>
    </tr>
    <tr>
      <td>Type 3d</td>
      <td>Choriamb + Double Trochee</td>
      <td>- u u - / - u / - x</td>
    </tr>
    <tr>
      <td>Type 4</td>
      <td>Cretic + Iamb</td>
      <td>- u -/ u x</td>
    </tr>
    <tr>
      <td>Type 4a</td>
      <td>Molossus + Iamb</td>
      <td>- - - / u x</td>
    </tr>
    <tr>
      <td>Type 5</td>
      <td>Double Spondee</td>
      <td>- - / - x</td>
    </tr>
    <tr>
      <td>Type 5a</td>
      <td>Cretic + Double Spondee</td>
      <td>- u - / - - / - x</td>
    </tr>
    <tr>
      <td>Type 6</td>
      <td>Dactyl + Spondee</td>
      <td>- u u / - x</td>
    </tr>
  </tbody>
</table>

<p><br />
For more information, see <a href="http://legacy.cltk.org/en/latest/latin.html#clausulae-analysis">Clausulae analyses docs</a>.</p>]]></content><author><name>Tyler Kirby</name></author><category term="blog" /><summary type="html"><![CDATA[Latin prose authors since Cicero often employed rhythms at the end of their periods to produce sonorous effects like those found in poetry. These rhythms not only add to the elegance of an author’s style, but they even often help characterize a specific style. The prose rhythm preferences of some Roman authors are so distinct that they are often used in matters of textual emendation and author attribution, and so philologists for the last century have spent considerable time cataloguing the rhythm of prose with the hope that better textual judgements may be made with more data.]]></summary></entry></feed>