TeiCoPhiLib: A Library of Components for the Domain of Collaborative Philology

In this article we illustrate a work in progress related to the design of a library of software components devoted to editing, processing, and visualizing TEI-annotated documents in the domain of philological studies, in particular in the subdomain of collaborative philology, which concerns the social activity of scholars focused on shared philological tasks. We discuss the technologies related to XML markup languages and the processing of marked-up documents. We describe the method used to design and implement the TeiCoPhiLib, outlining the design patterns as well as discussing general benets of the overall architecture. Finally, we present case studies in which some components of our library currently implemented in Java have been used.

In distributed and collaborative environments, the maintenance of links and relations among editable XML documents is a challenging task (Di Iorio et al. 2009;Peroni, Poggi, and Vitali 2013;Schmidt and Colomb 2009).Indeed, this kind of environment must preserve the referential integrity among interrelated textual units and the consistency among interlinked contents (Schmidt 2014;Barabucci et al. 2013;Ronnau, Philipp, and Borgho 2009).It is worth noting that both texts and related annotations may change asynchronously.Accordingly, the problems with maintenance have an increasing relevance in social editing (Siemens et al. 2012) and in general in collaborative philology.This emerging eld concerns the social activity of scholars focused on shared philological tasks (such as scholarly editing and collaborative annotation) through a cyberinfrastructure (Terras and Crane 2010; Crane, Seales, and Terras 2009).In order to face this challenge, our approach exploits software engineering techniques illustrated in section 3.3, which explains the TeiCoPhiLib design patterns.
For this reason, the design of TeiCoPhiLib widely leverages the stand-o approaches provided by the TEI Guidelines, that is, both the reference to plain text osets and the reference to nodes denoted by the @xml:id unique identiers (TEI Consortium 2015;Wittern, Ciula, and Conal 2009;Pierazzo 2015).Consequently, the design of the components aims at the separation of concerns through four distinct layers: (1) textual structure; (2) semantics; (3) style; and (4) behavior, in order to ensure modularity, scalability, and exibility.

Background
The main benet of XML, and especially of the TEI Guidelines (TEI Consortium 2015), resides in simplicity, exibility, readability, and customizability, with the assurance of a formal approach for validating the marked-up data.Consequently, XML provides a standard way to dene a set of tags (vocabulary) for specic purposes.Moreover, the cluster of technologies associated with XML 3 allows us to process, query and publish structured documents.
Several frameworks and initiatives have been developed over the years for handling XML, achieving great results and benets for both scholars and developers.Among others, the opensource general-purpose framework Cocoon 4 and the native XML database eXist-db 5 deserve to be mentioned.Specically for TEI-annotated documents, TUSTEP, 6 TEIBoilerplate, 7 TXM, 8 and TAPAS 9 are prominent projects.For all of these initiatives, the transformation from an XML document structure to another format by XSLT can be considered the focal point.

Flexibility and Reusability
A document-oriented approach can be complemented by an application/API-oriented approach for the development of textual analysis tools.We are adopting a top-down design integrated with bottom-up processes (Del Grosso and Boschetti 2013), which allows us to generalize, extend, and refactor the overall architecture as new requirements and common issues emerge from use cases under development.The design is top-down because we are dening both the general abstract framework and the mechanisms that allow us to implement new functionalities according to emerging needs.On the other hand, our library also adopts a bottom-up approach because we apply refactoring strategies to adapt existing components implemented in our previous projects to the general framework, extending the framework.
The library of components is designed by exploiting object-oriented methods and processes such as analysis of requirements, denition of the domain entities, separation of concerns, information hiding, and software reusability and extensibility (Fowler 1996).Extensive use of design patterns (i.e., recurring solutions to common problems within a given context [Gamma et al. 1995]) facilitates the achievement of these goals.
Agile software development 10 and use case-driven modeling (Rosenberg and Stephens 2007) ensure the progressive enhancement of old functionalities and the development of new ones.The aforementioned paradigm is applied in the TeiCoPhiLib library by (1) the implementation of a exible importing and normalization module in the pre-processing phase, which ensures a coherent abstraction model of the resources; (2) the denition of the functional specication by designing the objects and by declaring the application interfaces; (3) the export of the information encapsulated in the objects into dierent data formats, in order to enable data integration and data exchange.

Separation of Concerns
First of all, objects that represent the whole document or interrelated documents are initialized by parsing the original TEI document(s) and by creating a new data structure, which decouples the orthogonal information conveyed by the XML elements: (1) textual structure, (2) semantics, (3) style, and (4) behavior.It is important to point out that the new data structure is the result of transformations (by XSLT DOM transformations or SAX event-driven transformations) managed during the parsing process.Thus, the current implementation of the TeiCoPhiLib exposes methods that parse the XML le and create Java objects.The resources are stored and maintained in a native XML database management system (i.e., eXist-db).The APIs and services provided by Lucene, a software library developed and hosted by the Apache Foundation, have been used for indexing the textual data.
For instance, the information conveyed by the following TEI snippet is distributed among the appropriate Java objects that handle the four levels described above: The parsing process concerns the following aspects: 1.
Textual structure.The same document originally structured paragraph by paragraph for literary analysis can easily restructured page by page for layout analysis and for comparison with the original page image.Semantics, style, and behavior are represented by objects separated from (but linked to) the nodes of the DOM tree.

2.
Semantics.At the semantic level, both attributes (such as @type) and tag names (such as <p>) are processed in the same way and linked to the related DOM node.

3.
Style.The style is managed by separated renderers, which point to textual positions aected by stylistic features.For instance, the information extracted from the @style attribute is used to instantiate the Java objects devoted to managing the rendering information.

4.
Behavior.Behaviors are handled by objects that process textual resources according to the current state of the data structure and the rules to manage such a state.For example, a hyphenator performs its tasks according to the language of the textual data (encoded in the original TEI le, e.g., by @xml:lang="it") and the related hyphenation rules (such as the hyphenation rules for the Italian language, managed by the hyphenator bundles). 14 The object-oriented representation of the document allows data to be processed dynamically, taking account of its physical and logical structure, in an attempt to overcome the multiple hierarchies issue.This means that the data model of the library has a decoupled and abstract structure which can be serialized in any available le format, including all standard TEI-Journal of the Text Encoding Initiative, Issue 8, 23/09/2015 Selected Papers from the 2013 TEI Conference compliant approaches.Consequently, the TeiCoPhiLib document entity keeps structure and logical information in an independent but related aggregate of objects (gure 3).Furthermore, output modules or visitors (gure 2) can traverse and serialize the object representation of the document to a le.In particular, modules for TEI les dene the marshalling process, which encompasses the operation to obtain the encoded XML les.In this way the system provides the actual representation of the document.

Design Patterns 15
The overall architecture of the library is based on several design patterns, according to the objectoriented paradigm (Gamma et al. 1995;Buschmann, Henney, and Schmidt 2007). 16 Design patterns were introduced in software engineering in order to provide a common solution for a recurring problem in a specic context.A design pattern denes the instantiation policies, the structure, or the behavior for an aggregate of objects that cooperate to provide a complex but recurrent functionality, such as the creation of polymorphic entities, the management of decoupled modules, or the selection at run time of the most suitable algorithm for the current task.The general idea of object-oriented patterns is to encapsulate functionality and data inside an ecient and exible collection of classes.The current implementation of the prototype exploits the Java programming language technologies.

1.
The Model-View-Controller (MVC) pattern (Burbeck 1992) determines the architecture of the library by separating the internal representation of the data from the rendering and the behavioral purposes.

2.
The Factory pattern allows designers to enhance the object creation procedure by means of special classes (i.e., the factory classes) that guarantee abstract coupling among system modules (i.e., the object relationships do not reference implementing classes).As a matter of fact, the foregoing design makes it possible to programmatically reference abstract objects independently from the run-time instances which actually perform the task.This design pattern lets applications change the implementation of a class in a exible way.
In our case, for instance, a document object has textual content maintained in a specic DOM data structure transparent to the user.The client agent is able to manipulate and process the document independently of its internal DOM representation.The algorithms and processing keep the state of the data structure coherent by updating the DOM representation in a transparent way.

3.
The Builder pattern is used to initialize and populate the document data structure.Together with the factory pattern, it hides the real type of the objects from the user agents and maintains the state consistency of the interconnected information.In this way, in the library initialization process, it is possible to create dierent data structures for dierent aims, in a way that is completely transparent to the user agents.For instance, a Builder oriented to the layout analysis can restructure the information parsed from the TEI input document.

6.
The Observer pattern provides a mechanism for handling dependencies among interrelated objects.This ensures that when a change occurs, the overall state is synchronized and updated.For example, if an edit operation deletes some values in the document, all related structures are notied and updated accordingly.The library organizes the entities derived from the original document information through the stand-o approach.In this way the document structure is separate from its semantics, style, and behavior.Finally, the data structure is an object-oriented representation of the entities in the real domain of the digital document, but the storage platform/paradigm could actually be relational, hierarchical, semi-structural (XML), or network/graph-structured (Hohpe and Woolf 2004).
The UML diagram on the left side of gure 1 shows the structure of a TEI document DOM representation.Objects, in this case, exactly map the tags of the TEI-encoded le format.The UML diagram on the right side illustrates the composite-components design.This diagram maps the same aforementioned information and structure, but using a exible and recursive design.Each TEIComposite object reects a TEI element and each child is recursively a TEIComposite or a textual element.

Case Studies
The case studies illustrated below have been implemented with the components already developed for our library.
4.1 Euporia: Visualization, Editing, and Annotation of Parallel Texts for

Didactic Purposes
Euporia is a project aimed at visualizing, editing, and annotating bilingual texts displayed in parallel.The original digital resources are stored and maintained in authoritative digital libraries available online, such as the Biblioteca Italiana and the Perseus Digital Library, or they are downloaded from social proofreading websites, such as WikiSource, and subsequently processed and marked up in TEI.Some examples of Greek and Latin texts potentially alignable or actually aligned with their Italian translations are shown in table 1.
As mentioned in section 3, dierent subcollections of texts that must be aligned may provide or omit some extratextual information (such as line number or page number) and they may organize texts in dierent ways (for instance, lines can be grouped or not inside <lg> elements).For this reason, the XSD schema (which is expected to be a subset of the general TEI schemas) is generated a posteriori from the actual text subcollections.This approach can be considered complementary to the TEI Roma approach, a kind of reverse engineering, which also allows us to generate the ODD le.Studying the schemas, XSLT transformations are created in order to deal only with relevant information and canonical formats processed by the appropriate Aligner.Currently only the SpeechAligner for dramatic texts has been implemented: correspondences between the Italian translation and the original Greek text are automatically injected with the @corresp attribute (see row B in table 1) and misalignments must be manually corrected.As shown in gure 6, the feature "status" (attested / partially attested / conjectural) is not only visualized in dierent colors (a CSS stylesheet is enough for this task), but also available in query masks to lter the results of a query (for instance, "nd only attested or partially attested words") and in tables of results.Unlimited stand-o layers of analysis can be added (such as morphological, syntactic, and semantic analysis), at dierent levels of granularity (for example, at the level of words for morphological analysis and at the level of sentences for semantic analysis).The layers of annotation are added through the web application and stored in the XML database.Moreover, they can be manually encoded and visualized through the web application.

Conclusion
The TeiCoPhiLib is a work in progress focused on the creation of a library of software components aimed at managing a limited subset of TEI tags used in the domain of collaborative philology.
Because of the increasing complexity of annotations and the multiple usages of the same texts in collaborative environments, stand-o annotation and dense mark-up make it challenging to keep annotated documents readable and manageable.While annotation focuses on types of texts (such as poetic, dramatic, and with or without critical apparatus), software development focuses on abstraction of data structures and behaviors related to those texts (such as searching them in parallel, ltering by morphological features, and comparing text and image).
Reusable software components promote the management of stand-o annotation at any level (such as editing, searching, or visualizing), improving the experience of the annotation and use of TEI documents.
The document parsing in the current Java implementation takes place on the server side, where the Java virtual machine runs within the web application environment.
The marshalling and unmarshalling process handles the serialization of the object representation of the TEI document, in order to store and retrieve data on the lesystem or in native XML databases, such as eXist-db.
Performance measurement tools such as JMeter will help to optimize the performance of the library components.
The main principles of agile software development that we adopt are: (1) individuals and communication are more important than processes and tools; (2) documentation and design must be accessible to everybody all the time; (3) software development starts as soon as possible; (4) changes and refactoring are part of the design and the development process; (5) all lab team members participate in all presentations; (6) software is organized in short releases and divided into short iterations; (7) results are validated by domain expert collaborations and test-driven development (both unit tests and acceptance tests).The continuous integration and release are Journal of the Text Encoding Initiative, Issue 8, 23/09/2015 Selected Papers from the 2013 TEI Conference supported by open source Integrated Development Environments (IDEs) like Eclipse or NetBeans and by a software conguration management tool such as SVN or Git for versioning and revision control.
Journal of the Text Encoding Initiative, Issue 8, 23/09/2015 Selected Papers from the 2013 TEI Conference <div type="chapter" n="1" style="font-variant:normal"> [...] <p xml:lang="ita"> <lb n="1"/>Io nacqui veneziano ai 18 ottobre del 1775, giorno <lb n="2"/>dell'evangelista san Luca; e morrò per la grazia di Dio <lb n="3"/>italiano quando lo vorrà quella Provvidenza che governa <lb n="4"/>misteriosamente il mondo.</p>[...]<milestone type="page" n="2"/>[...] </div> Composite pattern is the core of the data structure (gure 3).The document object is dened as an aggregation of hierarchical entities with the same data type.The hierarchy maps either the DOM structure of the original XML-TEI document or the structure of one of its transformations based on an XSLT input parameter.Thanks to this pattern, an ecient object-oriented structure, sketched through the UML class diagram on the right of gure 3, represents the whole/part relationships among the objects in the data structures.Journal of the Text Encoding Initiative, Issue 8, 23/09/2015 Selected Papers from the 2013 TEI Conference5.The Strategy pattern implements dierent operations in dierent ways based on the object type or on specied parameters.For example, the building process uses dierent strategies, which are driven from specic features given through property les.In the previous example, the original TEI page milestone is represented by an element node in the DOM internal structure; conversely, the original TEI element for paragraph can be represented by a milestone.Furthermore, the Strategy pattern is useful for rendering the same data in multiple views in dierent contexts or processing the same data with dierent algorithms.Journal of the Text Encoding Initiative, Issue 8, 23/09/2015 Selected Papers from the 2013 TEI Conference

Figure 1 .
Figure 1.Class diagram of the Observer pattern designed for the TeiCoPhiLib.

Figure 2 .
Figure 2. Class diagram of the Visitor pattern designed for the TeiCoPhiLib.
24Parallel texts are visualized and managed through EuporiaWebApp (gure 4), which is a server-side Java web application compliant with the JSR 314 specication intended for educational purposes.Students, the end users of Euporia, are allowed to query texts, both jointly and independently, through multilingual or monolingual keywords.Journal of the Text Encoding Initiative, Issue 8, 23/09/2015 Selected Papers from the 2013 TEI Conference