The Linked Fragment: TEI and the Encoding of Text Reuses of Lost Authors

This paper presents a joint project of the Humboldt Chair of Digital Humanities at the University of Leipzig, the Digital Library at Tufts University, and the Harvard Center for Hellenic Studies to produce a new open series of Greek and Latin fragmentary authors. Such authors are lost and their works are preserved only thanks to quotations and text reuses in later texts. The project is undertaking two tasks: (1) the digitization of paper editions of fragmentary works with links to the source texts from which the fragments have been extracted; (2) the production of born-digital editions of fragmentary works. The ultimate goals are the creation of open, linked, machine-actionable texts for the study and advancement of the eld of Classical textual fragmentary heritage and the development of a collaborative environment for crowdsourced annotations. These goals are being achieved by implementing the Perseids Platform and by encoding the Fragmenta Historicorum Graecorum , one of the most important and comprehensive collections of fragmentary authors.

It includes a workow engine that enables documents and data of dierent types to pass through exible review and approval processes. The SoSOL application includes user interfaces for editing XML documents, metadata, and annotations. While it does not include a full-featured XML editor, it supports alternative text-based input of XML markup, and can enforce XML schema validation rules on the documents being edited. 8 A key goal behind the initial development of the platform was to enable original undergraduate research in the eld of Classics. 12 The workows related to the encoding of text reuses and lost authors represent core use cases for the current phase of work on the platform . 13 In developing features of the Perseids Platform to support these workows, we are focusing rst and foremost on the data. We expect that techniques for visually representing digital editions will change rapidly with technology. So, while our work includes prototype representations of digital editions suitable for publication on the web, our rst priority is to enable scholars to create data about the authors, texts and related commentaries, annotations, links, and translations in a way that encourages and facilitates their preservation and reuse. We have identied the following core requirements to meet this goal: • The ability to represent the texts themselves, links between them, and annotations and 2.1 Data Formats 2.1.1 Texts 9 We use TEI encoding to represent the source texts and textual fragments preserved within them. 14 TEI provides the markup syntax and vocabulary needed to produce XML that enables citable passages of text to be unambiguously identied and linked to within their preserving context, a key requirement of the representation of text reuses as discussed above. For example, the following excerpt from Athenaeus' Deipnosophists 3.6 uses the TEI <div> element to demarcate the book and chapter: <div type="textpart" subtype="book" n="3"> ... <div type="textpart" subtype="chapter" n="6"> <p> . annotations, and the lost texts themselves. As URNs, these identiers are not web-resolvable on their own. By combining them with a URL prex and deploying CTS and CITE services to serve the identied resources at those addresses, we have resolvable, stable identiers for our texts, data objects, and annotations (Smith and Blackwell 2012). One of our key motivations for using CTS URNs is that they give us a robust means of targeting annotations at specic substrings of text within a canonical work. The pointers to the text are specic to the location of these strings within their canonical citation structure, and not to the XML markup of any particular digital edition of the text. Further, the URN syntax degrades gracefully to allow us to reference either the notional work or a very specic edition of that work. Our goal in using this syntax, together with RDF and stand-o markup techniques as discussed below, is to enable the assertions we are making to stand on their own as data, independent of the encoding techniques used to digitize the text.

11
For example, the following set of identiers might be used to represent a reuse of a lost work of The RDF data model also gives us more precision than XML with respect to targets of annotations and subjects of assertions. A human reader can usually tell from context whether an assertion concerns the wording of a sentence, the writing of a sentence as an event, the author of a sentence, or another scholar's assertion about the sentence. XML has no standardized semantics, and interpretations of relationships among elements and attributes are often underspecied in XML vocabularies (Renear et al. 2002). In RDF, subject, predicate, and object roles in a statement are explicit at the data structure level. Distinctions among domain entities (such as author vs. authorship event, or morpheme vs. sentence) are encoded as a class identity for the resource.

Annotations 16
The term "Annotations" covers a potentially wide variety of data types. We can have simple annotations in the form of typed links between data points, such as: a textual fragment and a proposed author of that fragment; detailed textual commentaries making an assertion about a text; a complex morphosyntactic analysis of a section of text; an alignment between editions or translations of a text. The Open Annotation (OA) data model "species an interoperable framework for creating associations between related resources, annotations, using a methodology that conforms to the Architecture of the World Wide Web." 18 The OA model enables us to serialize every annotation in its most simple form, as a link between one or more target items being annotated, and one or more bodies representing the contents of the annotation. OA also gives us a standard vocabulary for categorizing the motivation for the annotations. URIs are used to specify both the target and the body of the annotation. We use the OA data model both as the primary representation of an annotation, in cases where the annotations are created by linking two identiers (such as a link between a passage in a text and an identier for a named entity or event), and also as a serialization method for more complex annotations, where the annotation process involves the creation of complex documents as the annotation bodies which we can then reference by their URI identiers. In the latter case, we use a variety of standard formats for the actual annotation bodies, including: • The Perseus Ancient Greek and Latin Treebank Schema 19 for morphosyntactic analyses.

•
The Alpheios Translation Alignment Schema 20 for text alignments.

Collections 19
We need to be able to organize text reuses into various types of collections of data, including those represented in a given traditional print edition which comprises: reuses from one or many authors; all text reuses attributed to a specic author; all text reuses quoted by a specic author; all text reuses referencing a specic topic; all text reuses attributed to a specic time period.

Provenance 20
Scholars produce data through a variety of activities, observations, and other events. Some data is created, other data discovered. The provenance of a digital object is an account of its origin and change over time. Documenting and preserving data provenance in a structured, machinereadable format enables us to more precisely track and document shared resources, ultimately improving data quality and encouraging further sharing. Specically, we intend our retrieval, editorial annotation, and communication tools to create records of the research transactions in which they are used. These records are expressed in RDF vocabularies that are based on abstract provenance models. 21 According to Groth et al. (2006), two key principles for provenance data are that (1) actors must only record propositions that they know to be true, through statements of what they observe; and (2) each statement of provenance must be attributable to a particular actor.

22
In our use case, we need to be able to (1) reference ancient data that can be identied but that did not literally come into existence as the result of any modern computational interaction (and which may in fact no longer be extant in any preserved source); and (2) identify the role a data item, such as an ancient scholarly assertion, plays as the vehicle for the modern scholarly claims.
A third requirement, which results from the second, is that we need to be able to represent the assertions of the ancient scholars, on which our modern assertions depend, in a format that can be included computationally in a common data set with the modern claims. like these is to employ complementary data provenance models: one with a transformational view and another with a semiotic view of the same research events.

24
The W3 Consortium's PROV is a specication for expressing provenance records with descriptions of the entities and activities involved in producing and delivering or otherwise inuencing a given digital object. 23  Although we could represent this action as an activity involving the scholar's physical interaction with the system, we would lose the signicance of the linguistic force behind the scholar's identication of the text.

29
In reaction to her selection of text, the system computes a URI identier for that text. This event can be represented as a SAM computation, but in order to represent the fact that the URI came into being as a result of a user-system interaction, it is also appropriate to describe this as a PROV activity that was informed by the scholar's previous action of selecting a string of text

Dissemination and Presentation 30
As discussed above, our primary focus thus far has been on capturing the data about the authors, texts and related commentaries, annotations, links, and translations in a way that is accurate and also encourages and facilitates its preservation and reuse. Visual representation of the data is one type of reuse, and the data format selections have been made with the need to support disseminations for online presentation in mind. The JSON-LD syntax recommended by OA allows us to easily build a dynamic display interface in Javascript which navigates the JSON-LD data object and retrieves the datasets identied as the targets and bodies of the annotations at their addressable URIs, as served by supporting CTS and CITE services. Our prototype interface 25 provides a demonstration of one possible approach to a digital representation of text reuse data.
Similarly, although we hope eventually to be able to represent the rich provenance metadata discussed above in the visual representation of our digital editions, we are above all concerned with capturing the data in a way that ensures they can be preserved and serve as the basis for further research.

35
The Leipzig Humboldt Chair and Perseus Digital Library partnership entails a synergetic eort not only to enrich the Perseus Catalog 29 but also to provide new testing material for the continuous development of the Perseids Platform, where the rst EpiDoc version of the FHG will be placed for further annotation. This undertaking is proving to be invaluable not only for the growth of partner resources but also for the development of the EpiDoc schema itself. 30

36
The project team is also working on the text of the so-called Marmor Parium (IG 12.5.444), which is a fragmentary inscription from the island of Paros that preserves a Hellenistic chronicle from the reign of Cecrops (1581/1580 BCE) to the archonship of Euctemon (299/298 BCE). 31 The epigraphical text was edited in the rst volume of the FHG because of its historiographical value (Müller 1878-85, 1:533-90). The author of the text is unknown, but the content reects his choices and it consists of a list of historical events mainly based on the Athenian history. In this respect, this evidence is a perfect example of a fragmentary author, whose work is preserved not through quotations in later texts, but in a fragmented original form. The text of the inscription is also being encoded in EpiDoc. An important part of the project is the identication of named entities mentioned in the inscription (such as names of kings and magistrates, personal names, and place names). The Pleiades gazetteer has been referenced for the place names. 32 The identication of individuals will make use of and feed into the Standards for Networking Ancient Prosopographies project (SNAP). 33 The team is also producing a visualization of the chronology preserved by the Marmor Parium with the open source tool TimelineJS, which allows the comparison of the text not only with other ancient chronologies but also with dierent chronological interpretations of the content of the inscription made by modern scholars (Berti and Stoyanova 2014). 34

The EpiDoc Guidelines and Encoding Process 37
The digital text of the FHG was obtained by feeding the volume scans available at the Internet Archive 35 to an Optical Character Recognition (OCR) engine, which transformed the printed text into digital form. The error-laden Greek output was corrected semi-automatically and stored in text les. Any remaining errors are manually rectied during the encoding process. The editorial team of LOFTS creates one XML le per fragmentary author. Every le contains a <teiHeader> with specic information about the author, volume, or book in question. As for the <text>, Müller's layout is not kept since the Greek text is separated from its Latin translation. The structure within <text> reects the structure of each volume, using the <div type="textpart"> with dierent @subtype values, whenever needed. Müller's Latin translations of the fragments are encoded in a separate XML le to facilitate text alignment.

40
For example, fragment 6 of Ion of Chios will be encoded as follows (gures 2, 3, and 4):   While retaining the same number (6), the Greek is encoded in a <cit> and the Latin translation in a <p>. The <cit> is broken down into <bibl>, which contains the source <author> (Plutarch) and work, and a <quote>, containing the Greek text. Whenever there is a note, the <note> element is also placed within <cit>. Any information that does not strictly pertain to the fragment is encoded in a <p>. This initial editorial stage also involves replacing special characters, such as the ae diphthong, with their Unicode entities (see the Latin translation, where ae is displayed as &#230;) in order to avoid font and potential display issues and confusion between graphically similar but semantically dierent characters (e.g., capital Latin c and Greek capital lunate sigma). Numbers, including ordinal, are also tagged. Footnotes are given arabic numbers (Müller uses asterisks) for clarity.
The FHG does not include a critical apparatus, but sometimes Müller signals variant readings in the text, and these are tagged as <rdg> elements inside an <app> within the main text. Personal names are simply tagged as <persName>. Names often carry additional information, such as patronymics and epithets or clues about an individual's profession. Encoding such specicity is not a stage one priority but a stage two expectation.

44
Another problematic feature when working with editions of fragmentary works is how to encode specic editorial decisions concerning the attribution of a quotation or a text reuse to an author whose original works are lost. Müller sometimes adds dierent marks (parentheses, square brackets, or question marks) to the fragment number in order to signal uncertainty about the attribution of that fragment. 38 In these cases the uncertainty is encoded using the @ana attribute to the <cit> element, with one of the following values: "#dubia", "#incerta", and "#anonyma".
These values are dened in the <classDecl> element in <encodingDesc> in the header. The reason for this choice depends on the fact that when the project was started not all relevant elements had the @cert attribute. Even if they did, Müller reasons for marking something as uncertain vary a lot, and he is also inconsistent in his use of these sigla. The use of the <certainty> element also proved problematic, because of this inconsistency, since it was at times dicult to determine whether the uncertainty was the content of the citation, the attribution to an author, or the extent of the quotation. So it was decided to roughly map Müller's three sigla to what seemed the most frequent and probable uncertainty categories across the collection, and rather than have @cert or <certainty> sometimes in <cit> sometimes in <bibl> sometimes in <quote>, to have three dierent values of the @ana attribute in <cit>.

Stage Two: Crowdsourced Annotation 45
Once complete, these basic XML les are deposited in the Perseids Platform for further annotation by third parties. While still in testing mode, this second stage encourages annotators to tag any additional information, including-but not limited to-the following: • Names (person). 39 • Names (place). 40 • Bibliographic references: expansion of bibliographic references and linking to bibliography le.
• Titles within Greek and Latin text.
• Stage 2 encoding of the Latin translations (after initial encoding at stage one).

46
The editorial board provides the nal review for each le. EpiDoc-encoded DFHG les are being progressively added to the DFHG GitHub repository 41 for everyone to download, improve, and share in accordance with our CC BY-SA 4.0 International License.