Building a Collaborative Editorial Workbench for Legal Texts with Complex Structures

This paper presents the work undertaken by the Capitularia project to integrate a collaborative editorial workbench into the open-source content management system (CMS) WordPress. It introduces the reasons for selecting WordPress as the project’s CMS, the workows established (including a sophisticated XSL-scripting pipeline), as well as three plug-ins created to integrate certain functionalities. The Cap-X2WP plug-in facilitates XSL transformations of XML les to HTML directly within the WordPress framework. The Cap-PaGer plug-in is used to generate WordPress pages automatically based on the XML les located in specic folders on the server. Their publication status can be administered via a special interface added to the general WordPress dashboard at a moment’s notice. Whereas the aforementioned plug-ins facilitate the daily work of the sta members in the general management and enhancement of the project’s website, the CapJournal of the Text Encoding Initiative, Issue 11, 12/05/2020 Selected Papers from the 2016 TEI Conference Building a Collaborative Editorial Workbench for Legal Texts with Complex Structures 2 Coll plug-in eases the specic editorial task of collating texts by including the CollateX algorithms in a WordPress plug-in. The report concludes with a brief perspective on the possibilities for further developments.


Introduction 1
For scholarly projects with long-term funding, in addition to achieving the project's actual objectives, there is often also an expectation that the project will create a general added value from which other projects or even the entire research community can benet. This might, for example, be achieved by documenting the experience gained, in order to contribute to the establishment of best practices. Especially in digital humanities projects, a more concrete contribution lies in the development and provision of helpful software applications that allow for reuse. In the last few years, a lot of eort has been put into developing specialized infrastructures (such as GAMS, 3 FuD, 4 or TextGrid 5 ) to ease the administration and publication of digital humanities data, and thus to avoid insular solutions. 6 Still, the choice of infrastructure is not an easy one, since certain diculties exist: • a large number of scholars working with TEI lack the (access to) technical expertise (Burghart and Rehbein 2012) and/or the nancial support needed for the proper use of at least some of those; • for many, especially small-scale ventures, infrastructural projects might even be a little oversized, and/or the familiarization might take too long; • infrastructures depend on regular funding to ensure maintenance and further improvements, and thus ensure their longstanding viability. Expired funding might block necessary adjustments to the changing technologies of the future, calling into question the sustainability of the software. 7

4
Especially in a long-term scholarly editorial project such as Capitularia, which is supposed to run for sixteen years (2014 to approx. 2029), the choice of the basic technological framework is extremely important, since the resource is supposed to be available for the entire duration of the project and beyond, and certain requirements need to be met as well. Besides more specic demands, one of these requirements was that the website should be up and running immediately after the project started in the spring of 2014. The research community should be allowed to monitor the project's development and start working on the material provided as soon as possible.
These requirements ruled out from the beginning a time-consuming in-house development, but suggested the adoption of a dierent approach that had previously been used in some digital Content management systems already contain a lot of functionalities that serve useful purposes in digital humanities projects. Furthermore, the application of already established software, which is not specically intended for a particular use and has a wide-ranging community participating in its constant development, could also be more sustainable (Stürmer 2015). WordPress was selected as the CMS for Capitularia for the following reasons: • WordPress provides a framework which is easy to use and maintain even for people with limited technical expertise, but also oers a vast number of possible extensions for those having (access to) programming skills. Compared to Drupal, the learning curve is less steep.
• One of the huge advantages of WordPress is that-owing to its widespread use-a large community participates in its further development and documentation. Therefore, more or less ready-made solutions already exist for numerous problems; when they do not, one can easily develop one's own plug-ins with the help of the good documentation.
• WordPress is PHP-based and uses a MySQL database-both standard technologies.
• WordPress can be regarded as sustainable open source software (Stürmer 2015).
• WordPress allows for a multilingual interface.
• WordPress had already been used in the Bibliotheca legum project, 10 a database of Carolingian secular law texts, and therefore one could build upon experiences. In contrast with Capitularia, the Bibliotheca legum relies solely on the use of already existing plug-ins without any specic in-house developments.
3. About the Project The Capitularia project is concerned with the hybrid edition of decrees by Frankish rulers. These legal texts are an important source for various aspects of early medieval European history.
Capitularies originated as individual texts from deliberations and assemblies at court, but hardly any original has survived. Mostly they were transmitted in sundry collections compiled by attendants of these assemblies, or based on copies sent to bishops or other oce holders, which created a vast variety of dierent versions of the texts. 11 What most capitularies have in common is that they appear as a list of chapters, with dierent capitularies often amalgamated with one another. This outward appearance also explains why they are commonly called "capitularies." Most often capitularies mention neither date, not place, nor the "issuer." Some appear to have been ocial documents; others might have been private notes, drafts, or extracts. They were rearranged, modied, or extracted by the compilers, sometimes with individual titles, vague titles, or no titles at all. This wide spectrum makes it hard to judge the status of a particular text. The texts also dier signicantly in length and number of extant witnesses, ranging from unique up to more than thirty. All in all, there are about three hundred texts in more than three hundred extant manuscripts. The characteristics of the source material raise particular issues that aect the TEI

Logistics 8
Capitularia is funded by the North Rhine-Westphalian Academy of Sciences, Humanities and the Arts, and is being prepared in close collaboration with the Cologne Center for eHumanities (CCeH), 13 the MGH, and other partners. 14 The digital edition is overseen by a team based in Cologne.
The print edition is being prepared jointly by a group of editors. Since the collaborators are scattered, a central platform for internal communication as well as for the distribution of resources among sta is essential to facilitate successful cooperation. Hence, WordPress not only provides the web publication and thus the outward presentation of the project to the public, but also serves as a means of exchange and a collaborative editorial workbench.

9
There is an internal workspace in WordPress for the project sta and the editorial team. It allows the participants to access data, resources, manuals, and tools. Recently, GitHub has been introduced for the overall administration of the project's technical developments. 15 To ensure the long-term availability of the resource, the Capitularia project relies on a combination of suitable technical infrastructure and strong institutional ties. The server space is provided by the University of Cologne's Regional Computing Centre (RRZK), with which both the CCeH and the Cologne Data Center for the Humanities (DCH), 16 an institute specically dedicated to the sustainability of humanities data, maintain close contact. In addition, Capitularia also participates in the web archiving program of the Bavarian State Library (BSB). 17

Workflows 10
The workow for creating Capitularia web content is as follows (gure 2): as has been mentioned before, the main source for the manuscript pages as well as most index pages (such as lists of manuscripts or capitularies) is Mordek's Bibliotheca capitularium (1995). He provided descriptions of all witnesses bearing capitularies, but died in 2006 before he could provide a new edition of the material. His book was digitized by means of optical character recognition (OCR) and marked up with XML. 18 Further markup was then added to this corpus le to enable the automated creation of TEI-compliant manuscript descriptions that are stored in <msDesc> elements inside the <teiHeader>.

11
The diplomatic transcriptions of the individual capitularies are mostly based on digital facsimiles.
Whenever possible, the originals are also consulted. Preceded by an editorial preface by the Capitularia sta members, the encoded transcriptions form the <body> of the le. The TEIcompliant encoding is carried out in the oXygen XML Editor, 19 which is connected to the server by means of Web-based Distributed Authoring and Versioning (WebDAV). 20 This ensures that all employees have access to the latest versions at all times. GitHub is used for managing the les. In addition to that, older versions are manually saved in a special archive folder.

12
The transcriptions are checked by a very strict schema-a project specic customization in RELAX NG -supplemented by Schematron, 21 and checked manually by the sta members. One person is responsible for transcribing and encoding a particular capitulary of a manuscript.
The transcription is then reviewed twice by other sta members before the original transcriber incorporates their annotations or corrections to nalize the transcription. Before publishing the manuscript page on the web, the HTML version of the le is proofread once more.   The core idea of this plug-in is to cache the result of a transformation in its WordPress page, and only retransform if either the XML or the XSL le has changed. Each transformation's result is stored in a <div class="xsl-output"> element during this process, which is replaced when a new transformation is triggered. Writing the generated content simultaneously into the WordPress page itself has the additional advantage that the regular full-text search already included in the WordPress core can also be used in this context. General settings can be congured easily via the options interface included in the WordPress dashboard (gure 5). The Cap-PaGer plug-in is used to generate WordPress pages automatically based on the XML les located in specic folders on the server. Their publication status can be administered via a special interface added to the general WordPress dashboard with a single click. 19 Cap-PaGer was developed to enable the automated generation of the numerous manuscript pages  correspond to the actual directory structure on the server. This means that all les located within the mss (manuscripts) directory on the server will be displayed as a list within the manuscript section. A schema can also be adjoined, as well as one or more XSL les used for the transformation.
For example, there are three dierent transformations associated with the section "manuscripts": rst, a transformation to display the comprehensive manuscript description taken from Mordek (1995) as mentioned above; second, a transformation for the main transcription; and nally, a transformation for a footer that attaches some additional notes such as how to cite this particular page, a hyperlink to the XML source le available for download, 26 and the revision history. This modular approach was deliberately chosen to reduce the complexity of the single XSL les and thus to facilitate their maintenance. Despite the complex processing pipeline working in the background, the interface enables the project sta to connect the dierent parts and determine what will be displayed on the page in an easy and clear way.
sta members have a synoptic view of all pages belonging to a particular category as well as their publication states. They can easily select publishing, publishing privately, or unpublishing, as well as further options (such as the extraction of metadata), enhancing functionalities originally implemented in the WordPress core ("private" vs. "published"). These synopses enable collaboration among sta as dierent researchers can work on many manuscripts at the same time without getting lost in the process.

Cap-Coll 22
In order to facilitate the editorial tasks involved with the numerous textual witnesses, collation is supported by alignment tables. This functionality is based on CollateX 27 with the algorithms included in the Capitularia Collation Tool. Before the actual collation takes place, the XML les containing the TEI-compliant transcriptions are preprocessed and normalized by some simple XSL transformations to eliminate surplus information that would otherwise complicate the collation process. Each manuscript can be included in or excluded from the collation with a single click.
Dragging and dropping changes its position within the default (alphabetical) order. Various settings can be chosen to customize the automated collation, such as the alignment algorithm applied (Dekker, Needleman-Wunsch, or MEDITE 28 ), the Levenshtein distance 29 score, as well as other options. The conguration can then be saved to replicate the run (gure 8). Some of the options are only available to sta members, while for public display and usage, only those settings that have proven to lead to best results are available. Based on the collation output, the editors investigate the liation and reconstruct, annotate, and translate the single capitularies for the print edition.

Conclusion and Further Prospects 24
In the light of the experience gained so far, WordPress has proven an eective and easy-to-maintain framework for Capitularia. Its simple extensibility and adaptability are especially strong arguments for the adoption of WordPress in digital humanities projects working with TEI les. 25 One of the main problems of using existing tools was that, for the most part, they were developed to meet the needs of a specic project, and so adaptions to other material are dicult, time consuming and resource-intensive. Often infrastructure maintenance is limited by the project's funding. That is at least not the case for the Cap-X2WP and Cap-PaGer plug-ins, since their functionalities are so general that they are of use even beyond the domain of digital humanities, and thus universally applicable to WordPress websites. In its current state, the code is optimized for usage within the Capitularia project, but it can easily be adjusted to others' particular needs and is accessible via GitHub. By making the code available on public repositories, others can build upon previous work. 26 The Cap-Coll plug-in has a specic eld of application, but still collation is an essential task in