Transforming Backward

The standard workflow for preparing digital editions for display involves writing XSL to transform handcrafted TEI into either 1) HTML for the web or 2) XSL-FO for conversion into a print friendly format such as PDF. With either method we implicitly recognize that TEI, even coupled with CSS, is not designed as a presentation technology. Many born-digital documents, however, are encoded in formats that are, such as HTML. Hypothetical future editions of such documents would most likely need to be supplemented by a document description that goes beyond the facilities of HTML to meet the needs of editors. Thus we foresee cases where born-HTML documents could be supplemented and described by TEI in much the same way as TEI currently supplements and describes manuscripts and printed books. In this paper we investigate ways that XHTML documents both with and without RDFa can be “transformed backward” into TEI. In addition to the digital edition use case, we also investigate a process for converting HTML content to TEI-based language corpora.


Introduction
There are many benefits to using XSL Transformations (Clark 1999; Kay 2007) in digital document creation and management.One such benefit is that, for transformations for which both the source and target documents are well-formed XML (including XML application profiles such as TEI, DocBook, MathML, XHTML, KML, etc.), 1 the transformation has the potential to be bidirectional.As long as one can determine the proper relationships of elements or groups of elements between the source and target dialects, one can craft rules to match them.The possibility for bidirectional conversion provides affordances in preserving and editing born-digital documents and in linguistic analysis. 2standard workflow for preparing digital editions for display is unidirectional, involving writing XSLT to transform TEI into either or both of the following: • HTML and CSS • XSL-FO (for conversion into a print friendly format, such as PDF).
With either method we implicitly recognize that TEI, even coupled with CSS, is not, for the most part, designed for direct presentation.In one sense, TEI documents that are transcriptions of non-digital source documents are themselves end products in terms of the transcription process.However, in terms of the contemporary electronic edition, TEI XML is just one of many steps along the path toward generating the final product. 3That is, TEI must be transformed "forward" to another format. 4After this, other support structures, such as application code, must be built around it.
Capitalizing on the bidirectional possibilities for markup transformation, the authors experimented with converting HTML documents, both with and without embedded RDFa, 5 to TEI.We call this "transforming backward," in that converting HTML to TEI constitutes the opposite of our community's typical XSLT workflow.While conversion from XSL-FO to TEI would also be transforming backward, we chose to focus on HTML source documents because we believe that conversion from HTML to TEI (whether automatically or manually) is an increasingly important use case.
Many born-digital documents are encoded in formats that are, by design, meant for display, such as HTML. 6Hypothetical future editions of such documents would most likely need to include a richer description of the document than that provided by a displayoriented format alone.That is, HTML lacks the rich vocabulary that TEI provides and thus is not ideal for document description of the sort that editors require.Thus we foresee cases where born-HTML documents could be supplemented and described by TEI in much the same way as TEI currently supplements and describes manuscripts and printed books.This use case is essentially one approach to the process of transcribing born-digital documents originally created for the web.
Transformation of HTML to another vocabulary would also benefit those who use TEI as a means for recording linguistic data.The accessibility and massive quantity of Web pages make them a rich source of data on contemporary language; these could be transcoded by means of XSLT for the purpose of building corpora to be mined or otherwise interpreted.For this reason, we also experiment with converting from HTML to TEI through an intermediate program to tag linguistic aspects of the text.

Methods and Results
We prepared three proof-of-concept transformations: one representing a data-centric HTML+RDFa document, one representing a more narrative HTML document, and one representing an HTML document to be incorporated into a linguistic corpus.We wrote and processed the transformations for all three examples using version 12 of the oXygen XML Editor (SyncRO Soft 2010), with Saxon-EE as the processor (Kay 2011).We used XSLT 2.0 due to its built-in tokenize() function.
10 We crafted an XSL transformation that included the following template designed to match the HTML and RDFa in the example: This snippet presents challenges because it contains both prose and verse.At first it looks easy to transform.After all, most of the novel is merely prose and prose is rather easily transformed from HTML to TEI because of the equivalence between HTML <p> and TEI <p>.The XSLT would merely need to contain the following: 10 <xsl:template match="//html:p"> <p> <xsl:value-of select="normalize-space(.)"/> </p> </xsl:template> The verse sections are a bit more difficult.Since HTML does not contain any elements that indicate verse content, they are wrapped in HTML <pre> tags, indicating to the browser only that the text should be presented with whitespace intact.It is up to the human reader to understand this is verse because of the way it is presented.Secondly, as redacted by the creator of the PG version, the <pre> element above contains more than just verse.A metatextual "chorus" comment marks off the chorus of the silly little song from its verses and there is an aside that is also not part of the song-"(In which the cook and the baby both joined):"-but that appears within the <pre> block.Thus, one could say that all verse in this particular version of the text appears in <pre> tags, but not everything that appears in <pre> is verse. 11This makes it difficult to use an XSLT statement to match HTML <pre> to TEI <lg>.The contents of <pre> in the HTML PG Alice are encoded as <pre> only so that the browser will render it with non-breaking spaces intact so that the reader will understand it as not prose.Were this preformatted text represented only as such in TEI, we would argue that the creator of the TEI document had not gone far enough in marking up the verse content of the novel, i.e. had not made use of basic elements such as <l>.Thus, encoding the verse content of the <pre> elements for PG's Alice requires either human coding of TEI or a script that would tokenize the lines of poetry (using the newline marker as the delimiter) and wrap the lines in <l>.
In order to account for the majority of <pre> sections within this document, which contain verse, we crafted the following template: 12 <xsl:template match="//html:pre"> <xsl:variable name="lines" select="tokenize(.,'n')"/><xsl:for-each select="$lines"> <xsl:if test="normalize-space(.)!=''"> <l><xsl:value-of select="normalize-space(.)"/></l>This template matches the HTML <pre> element in the document.It constructs an array, called lines, that contains all the lines of the content within the <pre>.Then, for each of the non-empty lines, it constructs a TEI <l> containing the content of that line (with space normalized).This is not intended to handle either the aside or non-verse uses of <pre> in the document but rather to provide an acceptable first-pass solution at encoding the verse sections as TEI.
The transformation results in a TEI document whose body consists of the following: TEI is particularly useful as a tool for linguistic analysis when the linguistic segment category elements 14 , which allow researchers to build corpora to suit their needs, are used.We chose to focus here on part-of-speech (POS) tagging, a form of syntactic analysis, in order to provide a proof of concept. 15 produced TEI from the Collins snippet by means of a three-step process: 1. Running it through a POS tagger that was instructed to produce XML output ).We instructed it to accept XML input and produce XML output.The Collins fragment we chose was well-formed XML, so we did not have to run HTML Tidy (see note 2) on it before handing it over to the tagger.The model used was left3words-wsj-0-18, based on the Wall Street Journal; this model has a 96.97% accuracy rate ("Stanford POS tagger FAQ", n.d.).

Discussion
28 With the example transformations presented herein, we have demonstrated that transforming backward is possible.Under the right circumstances, this technique could prove fruitful for a project that needs to make use of existing HTML content.We anticipate that projects working with a large number of source documents might find that the benefits of automated or semi-automated transformation outweigh the costs of writing what may be quite project-specific XSLT.In addition, a tool such as OxGarage, which uses TEI as a "pivot format" (Oxford 2010) for conversions between multiple formats, could benefit from adding rules for matching RDFa (which can be hosted in multiple languages) in order to produce more nuanced translations.
29 We would like to stress that a backward transformation will, at times, need to be guided by and supplemented with human intelligence.Furthermore, transforming from HTML to TEI necessitates accounting for correspondences between HTML and TEI elements and attributes.The Alice example demonstrates that HTML authors' element choices can complicate the process: the element <pre> in Alice maintains the lineation of the poetry it contains but is unable to mark the content as poetry.Preformatted text is just one way that content authors get around an impoverished markup scheme, so such hacks and other tag abuse can confer meaning to readers.It would be difficult to account for a significant number of these decisions and impossible to account for all of them.
In addition to accounting for element and attribute correspondence at the level of the markup language, transforming HTML+RDFa to TEI also requires accounting for: 1.All of the RDFa attributes, such as @typeof, @property, @rel, etc.
2. Correspondences between the vocabularies used (FOAF, Dublin Core, etc.) and TEI elements and attributes 3. Differences in the way relationships can be expressed by means of different combinations of HTML elements, RDFa attributes, and vocabulary choices In addition to more experimentation with a generalized HTML-to-TEI stylesheet, we envision two main areas of further research in this area: developing RDF-aware transformations and expanding work to include HTML Microdata.
Whereas software that understands RDF understands the relationships among elements in an ontology (for example, the relationship between foaf:Person and foaf:name), XSLT does not and is not meant to.For example, the @select attribute of <xsl:value-of select="p/a[@rel='foaf:mbox']"/> does not instruct the XSL parser to find all the FOAF mbox items in the document.Rather, it instructs the parser to find each a element that is a child of a p element, and that has an @rel attribute, the value of which is the string "foaf:mbox".One could simplify the XPath statement to match merely any element that has @rel='foaf:mbox', but this would not handle any differences in output that the stylesheet designer meant to produce based on the identity of the HTML element. 19Furthermore, searching for an RDF predicate by means of its appearance as the value of a particular HTML+RDFa attribute, such as @rel, would handle cases where that predicate appeared in the same attribute for multiple HTML elements, but it would not handle instances of that predicate being expressed in a different attribute.We would like to see transformations that were in some sense aware of the RDF data model or at least able to make inferences about which attributes contain the same predicate.Such an "RDF-aware" transformation would then be able to employ common patterns based on the identity of the predicate.
On a less esoteric level, we would like to see development of conversions between HTML Microdata and TEI.Microdata (Hickson 2011) currently competes with RDFa as a method for making assertions about HTML content.At the time of this writing, Microdata is not yet a W3C recommendation.Despite this, the format received a major push when the three top search engines (Google, Bing, and Yahoo!) announced schema.org, a "one stop resource for webmasters looking to add markup to their pages" (Guha 2011) which promotes use of Microdata with the schema.orgvocabulary to express information about page content (schema.org2011). 20It seems likely that Microdata usage, especially Microdata that uses the schema.orgvocabulary, will increase.Converting HTML containing Microdata to TEI would be possible using the same techniques discussed above for converting HTML+RDFa to TEI.In fact, the same rules for matching TEI elements and attributes to a known vocabulary could be used because Microdata can express FOAF and other common vocabularies.If the schema.orgvocabulary does gain widespread use, it would be worthwhile to devise a crosswalk between it and TEI.While it is debatable whether use of a single schema is a good thing for the web's development, it would drastically simplify the task of converting backward from HTML infused with Microdata to TEI.
from a forward transformation.Furthermore, "forward," in this context, is not a synonym for "lossless." 5. RDFa (the "a" stands for "attributes") is "a collection of attributes to express structured data in any markup language" via RDF (Adida et al. 2008).It allows document authors to make RDF (Lassila and Swick 1999; Klyne and Carroll 2004; Hayes 2004) assertions about markup content directly in the host markup language (predominantly XHTML).An alternative to RDFa, for HTML5 only, is HTML Microdata (Hickson 2011).For a set of single-purpose HTML data formats, see Microformats (2011).For a primer to RDF, see Manola and Miller (2004).
6.This is not to say that all visual aspects of a web page should be effected in HTML; many aspects of presentation are the proper place of CSS.

7.
It is tempting to say that it checks for the presence one or more HTML divs with the RDF predicate of foaf:Person, but this is untrue for a couple of reasons.First, as discussed later, to search for all the ways one can express an RDF predicate in RDFa, one would have to look for @rel, @rev, and @property.More importantly, @typeof always has a predicate of rdf:type (Adida et al. 2008 sec.6.1.1.4.).Setting @typeof to foaf:Person does not specify a triple with a predicate of foaf:Person.Rather, it specifies that a blank node should be created with the predicate rdf:type and the object foaf:Person.

8.
The TEI results, for the sake of brevity, are only fragments.We expect that projects making use of transforming backward would likely not generate TEI headers from the HTML source.

9.
We chose PG's Alice because it is a readily available, permissively licensed HTML version of a narratively interesting text.We should note that this document's DOCTYPE is HTML 4.01 Transitional, so it is not encoded in XML.However, the excerpted portion, provided it were wrapped in a root element, should produce no problem for an XSLT processor.
10.The normalize-space() function is needed here because Widger's Alice contains hard returns at the end of each line.
11. Widger repeats the strategy of using <pre> for verse throughout Alice, but he uses <pre> for other purposes as well.These include the PG boilerplate that appears at the beginning of each of their books and for decorative indicators of scene breaks within chapters, i.e.: 12.This template contains tokenization code adapted from a script suggested by an anonymous reviewer.We would like to thank him or her for this contribution.

2.
Replacing the character entity references produced by the tagger in step one with their proper characters 3. Transforming the results of step two to TEI by means of XSLT To produce a tagged document, we invoked the Stanford Log-Linear Part-of-Speech Tagger, version 3.0.3(Toutanova et al. 2011