Journal of the Text Encoding Initiative, Issue 4 | March 2013

This article describes an XSLT-based logging framework developed for Abbot, a markup conversion and interoperability tool. Abbot logs structural and textual divergence from an XML source. Logging is a useful component of verification when there are too many text alterations, or too many texts, to verify closely and individually. Abbot’s conversion and logging Logging the Abbot: Reflection-Oriented XSLT Programming for Corpora Conversio... Journal of the Text Encoding Initiative, Issue 4 | 2013 10


Curation and Conversion 1
The growing number of electronic texts available to scholars affords us the opportunity to think about combining heretofore separate collections for analytical purposes. Distinct XML collections sometimes require conversion into a common format such as that developed by the Text Encoding Initiative (TEI) Consortium. TEI is, for many purposes, a satisfactory format for text corpus aggregation, though not always without attendant difficulties. As John Unsworth writes, "The 'I' in TEI sometimes stands for interchange, but it never stands for interoperability…. (I)f there's a single interoperable format … it has to be a common or baseline representation that is technically valid and intellectually acceptable in multiple systems" (Unsworth 2011). While a precise definition of such a format may still be evolving, it is clear that an interoperable text markup format should probably, on the most fundamental level, permit, require, and exclude features. TEI is, in the abstract, able to accommodate each of these conditions, and it therefore represents considerable progress toward interoperability. Yet because TEI permits local customizations it is no longer a representation that is fully shared. To arrive at a condition of interoperability, a reliable and verifiable conversion process is crucial. While it can be relatively easy to verify that no words are inadvertently lost or rendered out of sequence for a small text collection, it becomes progressively more difficult with more texts. Curation and verification routines that rely on individual human scrutiny will not operate at a large scale or in a reasonable amount of time.

2
In early 2007, the MONK (Metadata Offer New Knowledge) Project began to develop a procedure for batch converting varying collections of XML-encoded texts into a specialized application of TEI P5 that we called TEI-Analytics (TEI-A). That effort produced a command-line application, Abbot, which works by analyzing the XML schema that describes the document structure to which the target collection should be converted. Developed by the author, Stephen Ramsay, and Martin Mueller, Abbot uses that analysisan enumeration of allowable elements and their associated attributes-to programmatically generate a very large XSLT stylesheet that is used for the conversion. Abbot, at last, makes it possible to eliminate customizations or other differences between markup systems, either for the short or long term.

3
While Abbot has already been described in detail elsewhere (see Pytlik Zillig 2009), it might be helpful to briefly explain how it moves from an analysis of the desired output schema to generating a stylesheet which does the conversion. Figure 1 illustrates the Abbot workflow. When the program is launched, a meta-stylesheet reads a schema file for the desired output and a configuration file that details what custom transformations, if any, are needed. The schema contains information about elements and their allowed attributes. For example, given an element <p> in the input, if an element with the same name is specified in the output schema, Abbot will retain the <p> tags and any attributes that are associated with the element in the schema. Abbot assumes that an input element resembles the desired output element, as is often true. But when this assumption isn't true and a user wants to rename elements or perform a complex or conditional mapping of an input element, a custom transformation must be specified in the configuration file as an XSLT template (see fig. 4 below). XML validation reports and Abbot transformation logs help identify the presence of elements and attributes that are not accounted for.

5
All subsequent steps of the Abbot pipeline involve an XSLT processor using a conversion stylesheet to convert one or many input files that are then validated. Output files generate logs of all processed elements and any associated changes. Valid files require only celebration. Invalid files require alteration of the configuration file, and the process is repeated. With Abbot, it is normal to process an input collection several times before all files are valid.

6
A text collection that is valid in structure may still benefit from some additional scrutiny to verify that no words were inadvertently lost or rendered out of sequence-that is, that the conversion was lossless. For MONK and its texts, this scrutiny was undertaken manually and on a selective basis by members of the project team-an approach that worked for a limited project such as this. In the case, though, of the more than 40,000 texts produced by the Text Creation Partnership (a collection more than ten times larger than the MONK corpus), the problem of simultaneously validating markup and verifying textual fidelity becomes clear, and new procedures are needed. While it can be relatively easy to verify losslessness for a small text collection, it becomes progressively more difficult with more texts.

Verification at Scale 7
Abbot is a meta-program, in the sense that it is "code that writes code." Abbot observes and modifies its own structure and behavior at runtime, and it performs self-adjustments and dynamically calculates the effects of transformations. Of course, calculation of the results of markup transformations can itself pose technical problems due to scale. Abbot's solution to the problem is inspired by distant reading, the sort of reading that one does when there are too many texts to read closely (Moretti 2005). Distant verification, if we may call it that, becomes necessary when there are too many text alterations, or too many texts, to verify closely and individually. logging every difference. For each XML node, a log entry is made that records any changes to the node, including the node name, the names of child nodes, the attribute names and values, the text nodes that are children of the current node, and counts of each of the above. Abbot stamps the date and time of every change. Moreover, it records the locations of all changes in the file.
13 These alterations are made as part of the Abbot transformation pipeline and logged in a file that is produced in comma-separated values (CSV) format. While a command-line diff operation could potentially be used to perform the task of comparing XML files to their source texts, Abbot adds this comparison functionality as a first-class operation to the processing pipeline. It is now possible to test and quantify possible outcomes of various conversions. The CSV format makes it a trivial task for a spreadsheet program to view the consequence of a single conversion, or all conversions. Moreover, conversion from CSV to XML (if desired) is trivially easy.
14 Every substantive change to the XML structure or to the text content is recorded. Abbot's measurement of nodal difference is not based on simple string comparison, which would report differences such as those between <foo n="1" id="a"/> and <foo id="a" n="1"/>. In this example, the order of the attributes is reversed, but the two nodes are otherwise the same and the change is non-substantive. While XML differencing applications exist, they are not sufficient for the present purpose because they are unable to refer to the specific code responsible for a given change. The same pipeline that alters the XML input nodes and writes the output nodes must be able-as Abbot now is-to log all differences.
15 In Abbot, templates are created at runtime based on input that is gathered at runtime. They vary depending on the source texts and on the desired output schema. The richness of the Abbot transformation logs may, because of their length, present problems of scale in their own right. For example, with an input file that contains 20,000 XML elements, Abbot makes a corresponding number of log entries. While the log files are eminently readable in theory, it would be helpful if, in future iterations, the software permitted the user to refine the results in some way. For example, users might want to suppress those entries that record non-substantive alterations.   Figure 2, a detail of a much larger log, illustrates eight nodes as processed and logged by Abbot. Here, eight discrete templates, identified in the left-most column, are responsible for the transformations shown to the right in the corresponding rows. These specific entries show: (1) renaming the root element, (2) adding a change element within <revisionDesc>, (3) deleting an attribute, (4) deleting an element, (5) examining a text node for any differences, (6) deleting <p> elements in certain conditions, (7) adding an @ident attribute to <language>, and (8) changing <text> to <floatingText>.
The following example shows a generic reflective XSLT template, somewhat simplified for brevity: Logging the Abbot: Reflection-Oriented XSLT Programming for Corpora Conversio...  In a simple example, suppose that we are attempting to convert a text collection that contains many instances of the following unusual customization signifying a page break:

Journal
<break n="1" ref="00000001.tif"/> 19 Because <break/> is unspecified in the output schema, and because it contains no text node, Abbot will remove this element. While this may sound a bit reckless, it is the job of the Abbot log to report the fact and consequence of this removal. The report that an element called <break/> has been removed signals to the user that it may be desirable to add a custom routine to the configuration file, such as in the following example: <transformation type="xslt" activate="yes"> <desc>convert 'break' to 'pb' and its @ref attribute to @facs attribute </desc> <xsl:template match="break | BREAK" priority="1"> <xsl:element name="pb"> <xsl:for-each select="@*"> In this example, the <break/> tag has been converted to <pb/>, @n has been preserved, and @ref has been replaced with @facs. The log entry for the <break/> element, shown here in tabular form, confirms these facts: Simply put, Abbot's aim is to remove markup differences when aggregation, temporary or not, is desired. Abbot makes it possible, on a large scale of thousands or tens of thousands of documents, to identify, quantify and rectify such problems using the log that records every changed character in a document conversion effort.
23 Generic XSLT templates such as those described here could be used as a basis for a logging library intended to account for changes in XML documents. With support from the soon Abbot will gain an application programming interface and a graphical user interface. The former will help Abbot to work with other tools in complex pipelines, and the latter will improve general usability. 24 It is a goal of the Abbot project to help keep the "I" in TEI. Anna Gold asserts that a "great challenge of data curation is ensuring that data, once preserved, remains meaningful either within the same research area or ideally across areas or even across domains" (2010). The change-logging extension of Abbot, by making the integrity of texts verifiable across transformations, solves an important obstacle to keeping curated data meaningful. When the happy day arrives, perhaps soon, that we have at our disposal the "million(s of) books" that Gregory Crane (2006) writes about, we will curate them with precision and care and caution and a complete accounting of alterations.