Treebanking with Arethusa

Treebanks are text corpuses which have had their syntactic structures annotated in a machine-readable format.  They are used by computational linguists and others to conduct syntactic and stylometric studies of texts and corpuses.  Syntactic annotation, which can be thought of as a very precise form of sentence diagramming, is normally performed manually by humans–as opposed to machines–to ensure the accuracy of this crucial step.  But the process of annotating the syntax of every word in a text can be tedious and time consuming.  Arethusa is a new tool on the Perseids platform designed to make this process easier.  It is aimed especially for the annotation of Ancient Greek, Latin, and Arabic texts, but many of its core features can be applied to other languages.

Arethusa has a sleek web interface offering easy access to many useful features.  It provides color coding for parts of speech and a graphical representation of the dependency tree.  One can even zoom in and out as one’s tree grows.  For Latin and Ancient Greek texts Arethusa interfaces with the morphological tools at the Perseus Digital Library to automatically suggest the most likely analyses of words.   On the backend, the treebank is encoded as an XML file.  Once a text has been fully annotated, the XML can be downloaded and used with many different types of analytical tools.

Here is the XML which results from the dependency tree of Olympian 1 which is displayed at the top of this post.

Screen Shot 2015-10-08 at 12.46.20 AM

Since the platform is still undergoing rapid development, getting started can be daunting.  Here is a guide to upload your first text and begin annotating.

  1. First, prepare your digitized text, making sure it is free from spelling and punctuation errors.
  2. Now go to http://sosol.perseids.org/sosol/.
  3. Click the Sign In link at the top right of the page.
  4. Rather than making you create yet another user id and password, the Perseids/Arethusa platform make things simple by letting you log in from an account you probably already have, e.g. a Google account.
  5. After logging in you will arrive at the Perseids dashboard screen.  Click New Treebank Annotation to upload your first text for annotation.
  6. There are several options for uploading your text.  
    1. Cut and paste a text you have prepared into the top text box. 
    2. Select a text from an assortment available texts. 
    3. Enter a URI which points to your text on the web.
    4. Upload an existing treebank XML file to edit.
  7. Now select the language and text direction and click Edit to upload your text.
  8. Your text will now show up in the list of texts on the dashboard.  Click the text in the Treebank Annotation column.  When the text loads click on the first sentence to being annotating.

Learning how to annotate a text for a treebank requires a great deal of grammatical precision and also a knowledge of the conventions of the format.  Fortunately, there are two video tutorials which will help you get started.

  1. Part one of the video tutorial details how to add morphological annotations:
    https://www.youtube.com/watch?t=1&v=FbRRoVnVuDs
  2. Part two goes through creating the dependency hierarchy and labeling the relations:
    https://www.youtube.com/watch?v=hp-bhasd96g

For texts in Ancient Greek even more precise syntactic annotation is possible by turning on the Smyth Grammar Tag Set.  Smyth is the standard Grammar of Ancient Greek in English.  This option is accessible from the advanced menu when initially uploading your text.  This guide is available detailing the three levels of annotation available when the Smyth Grammar Tag Set is enabled.  Texts can be annotated on the morphological level, the Prague syntactic level, and an advanced syntax which is specific to Ancient Greek using categories from Smyth’s Grammar.  This makes it possible to tag words by Greek case uses such as “locative dative,” etc.

In a subsequent post, I will discuss some of the types of studies which can be conducted using treebanks.