Texts and Documents : New Challenges for TEI Interchange and Lessons from the Shelley-Godwin Archive

The introduction in 2011 of additional “document-focused” (as opposed to “text-focused”) elements represents a signicant additional commitment to modeling two distinct ontologies for textual data within the standard governed by the Text Encoding Initiative (TEI) Guidelines. A brief review of projects using the new elements suggests that scholars generally treat the “document-focused” and “text-focused” models as distinct and even severable—the tools of separate interpretive communities within literary Journal of the Text Encoding Initiative, Issue 8, 23/09/2015 Selected Papers from the 2013 TEI Conference Texts and Documents 17 studies. This paper will describe challenges encountered by members of the development and editorial teams of the Shelley-Godwin Archive (S-GA) in attempting to produce TEI-encoded data (as well as an accompanying reading environment) that supports both document-focused and text-focused approaches through automated conversion. Based on the experience of the S-GA teams, the increase in expressiveness achieved through the addition of document-focused elements to the TEI standard also raises the stakes for “interchange” between and among data modeled according to these parallel approaches.

ability to overlay images of the original printed editions, compare images side by side, search fulltext, and tag text with user annotations" [sic] (New York Public Library 2010).By the time work commenced in late 2011, the encoding of materials from the archive in TEI had been promoted to a more signicant component of the project alongside digitization, while other potential features were deferred.

4
The decision to invest in TEI encoding of the S-GA materials along with a reading environment designed to take advantage of features of these textual data has allowed the project to remain current with developments in the digital humanities and scholarly editing not envisioned in the original proposal.Through work on the text encoding schema and reading environment, the S-GA project has rened its conception of how this type of scholarship can create new knowledge.Rather than a grab-bag of "Web 2.0" functionalities, engagement with text encoding, an interpretive method created by and specic to the digital humanities, forms the backbone of the S-GA project's work with these important materials.

Motivations for the S-GA Encoding Scheme
The start of text encoding work on the S-GA coincided with the addition of the new "documentfocused" elements to the TEI in the release of P5 version 2.0.1.These additions were the product of recent eorts by a subgroup of the TEI Manuscript Special Interest Group and have resulted in a considerable expansion of "Representation of Primary Sources" (chapter 11 of the TEI Guidelines).
In the ontology that the new elements are intended to help express, digital text is encoded by describing the relationship of written traces to their physical carriers (TEI Consortium 2011).This encoding approach switches focus from text as communicative act or linguistic content to text as sign on some physical support; for the purposes of this paper this approach will be referred to as "document-focused" encoding.That is, encoders may formalize their understanding of how the text takes form on a surface; they will, for example, identify zones that group text topographically and the lines of text within such zones.Encoders may moreover track an author's actions on the page, identify textual revisions, movements, and deletions, and assert a temporal order for such actions.Describing how a text has been inscribed on a surface is an essential preoccupation of scholars who practice genetic textual criticism.The document-focused encoding strategy is closely but not exclusively identied with the interpretive goals of genetic criticism throughout this discussion.Divergences from a strict genetic editing approach will be described below.The document-focused approach contrasts with an ontology of text in which encoders describe, through combinations of elements and attributes, what a certain portion of text "is." By surrounding characters with a <p> element, for example, an encoder formally asserts that those characters form a paragraph.This approach can quickly become more and more complex, depending on both the text and the encoder's research agenda.Herein this approach will be referred to as "text-focused" encoding (cf.Rehbein and Gabler 2013).
Given that the majority of materials in the Shelley-Godwin Archive consist of autograph manuscripts, the editorial team quickly adopted several of the new elements proposed by the manuscript working group such as <sourceDoc>, <zone>, and <line> (with their greatly restricted content models) into its TEI customization.This document-focused approach has served the project well.It yields an encoding scheme that targets features of greatest interest to the scholarly editors who make up the initial user community (the principal investigator and collaborators).Also, focus on documents permits rigorous description of often complicated sets of additions, deletions, and emendations.In the case of the manuscript notebooks of Frankenstein, the encoding scheme for S-GA identies each page as a <surface> containing one or more <zone>s.As the discussion above suggests, the S-GA encoding scheme borrows concepts and terminology from genetic editing where the motivations align; namely in the representation of the miseen-page of writing on the manuscripts.However, the SG-A does not seek to produce genetic editions of the works represented in the archive.Digital genetic editions usually emphasize the temporal sequence of authorial revision and make a stronger distinction between what is on the page and what is interpreted from the page.For example, the Digitale Faustedition project-which directly contributed to the expansion of the chapter of the TEI Guidelines on representation of primary sources-made this distinction by using two dierent encoding models for the same documents.The models correspond to the concepts of "record" and "interpretation" (Befund and Deutung) rst introduced in 1971 by Hans Zeller (cited in Brüning, Henzel, and Pravida 2013).The "record" (Befund) consists of information about a primary source, of which the editor can make a detailed diplomatic transcription as the record of what is "found" on the source. 3For example, some text that has been struck through may only be encoded as being struck through (i.e., not deleted).The "interpretation" (Deutung) records an editor's understanding of a writing act on the page; for example, some struck-through text is interpreted as a deletion.The Digitale Faustedition project sees this interpretation as belonging to a linguistic domain; therefore, it conates the marking of "deletion" with a more traditional text-focused encoding that includes linguistic and literary structures such as paragraphs and verses.

8
The S-GA, not being a genetic edition, conates in one encoding model the transcriptional and editorial work more closely related to the document (roughly corresponding to Befund and Deutung).This conation is not an accidental failure to conform to one or another ontology of text.The task of developing an encoding scheme to match the goals of the S-GA project pushed the editorial team to consciously borrow tools from both interpretive communities.That is, rather than seeking to produce an established edition or publication, the S-GA project has chosen to embrace the ambiguous nature of this type of digital humanities work (Price 2009) in pursuit of the goal of constructing the S-GA as a work site wherein the encoded text, along with its tailor-made reading environment, "operationalizes" certain aspects of editorial theory.The encoding scheme of the S-GA, drawing on a document-focused approach, is a formal model for operationalizing literary knowledge about how texts are constructed from written documents (Moretti 2013).
Franco Moretti denes "operationalizing" as "the process whereby concepts are transformed into a series of operations … .Operationalizing means building a bridge from concepts to measurement, and then to the world" (103-4).In Moretti's case measurement involves quantication of features of literary texts, but measurement need not imply only quantication.In the case of S-GA, the encoding scheme is a formal model by which information about the mise-en-page of the various writing traces helps develop greater knowledge about the literary work Frankenstein.9 To achieve this goal, the editorial team needed to be able to produce two distinct representations of the S-GA materials so as to provide rigorous, semi-diplomatic transcriptions of the fragile manuscripts for those with an interest in the compositional practices of a signicant group of British Romantic authors, and also to make available clear "reading texts" for those who are primarily interested in the nal state of each manuscript.Thus, what was needed was not only the powerful new formalization of document-focused encoding but also a mechanism to enable movement back and forth between document-focused and text-focused models of the materials in the Archive.The development of the document-focused encoding scheme has been described above.The work of automating the production of usable "reading texts" encoded in text-focused TEI markup from data that is modeled according to a document-focused approach proved much more challenging.Conversion and interchange between these two models poses challenges for encoding practice and workow, for data provenance and maintainability, and for reading environment and presentation.

Encoding Workflow Challenges 10
The conict between representing multiple hierarchies of content objects and the aordances of XML is well known, and the TEI Guidelines as well as the professional literature of the text encoding community discuss several possible solutions (TEI Consortium 2011; Renear, Mylonas, and Durand 1993;Roland 2003;Piez 2013).One of these solutions is to designate a primary hierarchy and to represent additional hierarchies with empty milestone elements that can be used by some processing software to construct an alternate representation of the textual object.The approach taken by the S-GA team to produce both document-focused and text-focused TEI data is a version of the milestone-based approach.The document-focused elements form the principal hierarchy while milestone elements are supplied to support automatic conversion to text-focused markup (which will contain elements such as <div>, <p>, <lineGrp>, etc.).

11
This solution places increased burden on document encoders to maintain "correctness," thus potentially lowering data consistency and quality.For instance, empty element milestones representing the beginning and ending of textual features have no formal linkages as part of the document-focused document tree.Encoders must supply identiers and pointers to indicate these linkages.Ensuring that these identiers and pointers pair correctly must be accomplished with some mechanism other than the RELAX NG validation that checks conformance to the rules specied in the TEI schema.In S-GA this is partly addressed by a number of Schematron rules added to our TEI schema customization.These further checks, however, add an additional step within the processing workow that must be balanced against the need for a simpler and ecient encoding workow.As noted above, managing multiple hierarchies through the use of milestones is not new.The experience of the S-GA team suggests that the new possibilities available through the increased expressiveness of the TEI Guidelines, which include the additional document-focused Journal of the Text Encoding Initiative, Issue 8, 23/09/2015 Selected Papers from the 2013 TEI Conference elements, also increase the scope for projects to produce data that reect two divergent ontologies, and thus to encounter the diculties involved in the "workarounds" for multiple hierarchies more frequently.

Maintainability and Provenance Challenges 12
In addition to posing challenges for maintaining workows with good quality control while producing data, use of the milestone strategy for multiple hierarchies (ontologies) decreases the reusability of the textual data produced.The project relies on an automatic process to convert the document-focused encoding into a text-focused one.This process consists of a set of XSLT transformations authored by Wendell Piez, who served as a consultant to the S-GA project in 2013.These transformations are structured as a pipeline-progressively remodeling the documentfocused TEI data to a more familiar text-focused TEI. 4 Some of the stages involved in this process include, for example, identifying chapter boundaries that span across multiple <surface>s (which for convenience are maintained in separate les), and then combining the content of these surfaces into a single <div type="chapter">.While some transformations can be handled heuristically, others require a "hint" for the processor.To support this automated conversion, the S-GA team needed to go beyond purpose-built milestone elements like <delSpan> and <addSpan> and, in eect, semantically overload the general purpose <milestone> element using attributes.
The value of an attribute on <milestone> indicates which text-focused element is intended to appear in a particular location: 5 This solution is explained in the project's documentation, and the convention used would be (one hopes) evident after cursory examination of the data.Nonetheless, the desire to make available two models of the text forced the S-GA team to add markup to the project's canonical documentfocused data.This makes the encoding more unique to the S-GA project and less easily consumable by future users with dierent goals.

14
To avoid the conceptual and technical challenges involved in automating the transformation between text-focused and document-focused representations, the two sets of data could each have been created by hand (rather than automatically generated) and maintained separately.Indeed, this is the approach followed by the Digitale Faustedition project, where a distinction between what the project calls "documentary" and "textual" transcription was considered necessary not only Journal of the Text Encoding Initiative, Issue 8, 23/09/2015 Selected Papers from the 2013 TEI Conference as a reaction to encoding problems, but also as a practical application of theoretical distinctions between documentary record and editorial interpretation (Brüning, Henzel, and Pravida 2013).
The Faustedition project team, however, still encountered technical challenges when trying to correlate and align these two transcriptions automatically.Use of collation and natural language processing tools helped with this problem, but eventually more manual intervention was needed.

15
The S-GA team felt that maintaining two data sets representing dierent aspects of the textual objects would have led to serious data consistency, provenance, and curation problems.As the example of the Faustedition project shows, separate representations must be kept in sync with project-specic workows developed for this purpose.In the case of S-GA, documentary transcription is the main focus; the greatly increased cost and time involved in also maintaining a textual transcription would have reduced the size of the corpus that could be encoded and thus the amount of materials from the archive that could be made fully available under the initial phase of the project.These exigencies prompted the project's attempts to automate the generation of text-focused TEI data from the core document-focused data that the project editors were creating.

Presentation Challenges 16
The display and presentation of document-focused encoding is another technical challenge introduced by the new TEI elements.Rendering a diplomatic transcription is more easily achievable in a coordinate-based system; the S-GA project, therefore, adopted SharedCanvas, a data model developed by Stanford University and a coalition of partners, which allows editors (and potentially future users) to construct views out of linked data annotations.Such annotations, expressed in the Open Annotation vocabulary, relate images, text, and other resources to an abstract "canvas."S-GA is developing and deploying a viewer for SharedCanvas that uses HTML5 technologies to display document-focused TEI elements that are mapped as annotations to a SharedCanvas manifest, a Linked Open data graph that ties all the SharedCanvas components together.The encoding scheme does not record position coordinates for every zone and line, but the positions of main zones to be painted on a canvas are automatically inferred, and base HTML display rules govern the rendering of text within these zones.
SharedCanvas not only provides S-GA with a framework to publish TEI transcriptions, but also enables the Archive to move further toward a sustainable participatory infrastructure.The SharedCanvas model allows for further layers of annotations to be added dynamically to a manifest; the S-GA already makes use of this for appending search result highlights to the linked data graph and for displaying these to the user.Eventually, the project aims to use the same mechanism to enable user comments and annotations.The engagement of students and other scholars will be driven by the possibility of creating annotations in the Open Annotation format, so that any SharedCanvas viewer will be able to render them.It remains a matter for the future development of the project to understand whether annotations can be added dynamically to the source TEI-especially those pertaining to transcription and editorial statements-or whether these secondary annotations, created after the main encoding of documents is complete, should always be managed separately from the source TEI data.

The Need for Interchange Between Document-Focused and Text-Focused Models
There is intellectual power and utility in both document-focused and text-focused approaches to creating digital texts; the scholarly community has gained by the increased expressiveness of maintaining two ontologies within the standard governed by the TEI community.There are also intellectual and practical reasons why it is undesirable to maintain data reecting these two models as separate or severable representations.The experience of the S-GA editorial team lends support to Peter Robinson's claim that "document, text and work exist in a continuum, and [that] the questions of intention, agency, authority, and meaning exert pressure at every level of reading" (2013,114).The ability to move along this continuum is an aordance that a digital text should support because it is in the alternation between these two ontological models that editors enact the construction of their particular form of humanistic knowledge.
The motility inherent in this model of digital text projects creates signicant pressure to address the problem of "interchange" between and among textual data modeled according to dierent schemes.Syd Bauman has provided a valuable operational denition of interchange in the context of text encoding.Following Bauman's argument, "interchangeable" data is that for which some human intervention (changing data to suit a new system or modifying a system to process new data) is required but for which this intervention can be accomplished without direct human communication with the data originator-because the data is in some way standard or documented (Bauman 2011).Indeed what the S-GA team has pursued is a strategy for staying document-focused in terms of data creation while preserving the ability to produce and share text-focused encoded data by specifying the appropriate semantics within the more widely used text-focused ontology of digital text and developing data-transformation pipelines that use those semantics as a guide.20 For Bauman it seems that interchange ("blind interchange" as he delineates it) represents a best compromise between interoperability (full equality of semantics across dierent text models) and the liberty or expressiveness that motivates scholarly text encoding in the rst place.This form of interchange is made possible by "a lot of adherence to standards" and extensive documentation of deviation from those standards (see also Flanders 2009).This argument is deployed against skeptics of the value of markup or advocates of other approaches to curation of digital textual data.The case of the two ontologies of text now co-existing within the TEI Guidelines is somewhat dierent-both are part of the TEI standard.The need for interchange between data modeled according to these dierent approaches is real and urgent for a project such as the Shelley-Godwin Archive, which will increasingly depend on the ability to ip back and forth between dierent representations of the data most relevant to dierent communities seeking to use the Archive as a site for their own knowledge-making.

21
The debates around if and how TEI is a usable standard for interchanging scholarly information about texts are by now very old.The dilemma faced by the S-GA in attempting to create specialized document-focused data but also to generate and share text-focused data-all of it "TEI data"suggests a complication of and a possible extension to Bauman's conclusions about interchange.
Bauman's argument is still couched in terms of polarity even as it suggests a "common-sense" relaxation of intensity toward the whole question of TEI's suitability as a standard, a kind of lowering of expectations from interoperability to interchange.Georey C. Bowker and Susan Leigh Star suggest that "standardization has been one of the common solutions" to the problem of "how objects can inhabit multiple contexts at once, and have both local and shared meaning" but that the vocabulary of standardization is insucient "to characterize the heterogeneity and the processual nature of information ecologies" (Bowker and Star 1999, 293).The more nuanced conception of the issues and interactions involved in the types of problems that Bowker and Star develop could be useful to the TEI community.
Following earlier scholars in the domain of social studies of science, Bowker and Star describe a process of balancing "local constraints, received standardized applications, and the rerepresentation of information" (1999,292).This could easily be describing the process of developing a TEI project.When these arrangements become "ongoing stable relationship [s] between dierent social worlds and … shared objects are built across community boundaries," Bowker and Star refer to the results as "boundary objects" (1999,292).Thus Bauman's description of "interchange" around the standard of the TEI above, and Star's assertion that "boundary objects are a sort of arrangement that allow dierent groups to work together without consensus" (Star 2010, 602), seem well aligned.In this discussion, the TEI encoding standard is the boundary object: a "set of work arrangements that are at once material and processual … resid[ing] between social worlds (or communities of practice) where [this object] is ill structured" (604).Star observes that multiple groups take advantage of the "interpretive exibility" of boundary objects and customize them for local purposes.In this sense, the S-GA's use of the TEI's formal mechanisms to produce a custom schema and the workow-driven introduction of project-specic markup and markup conventions (overloading milestones) discussed earlier are hallmarks of work with boundary objects.

23
Yet, according to Star, a less-studied dynamic in the use of boundary objects is the way "groups that are cooperating without consensus tack back-and-forth between [local and shared] forms of the object" (605).This is where the experience of S-GA becomes particularly relevant.By seeking to maintain at the level of the data model the kind of exibility that Peter Robinson sought to achieve through processing and presentation (Robinson 2009), the encoding scheme that S-GA has developed for itself and the automatic conversion processes that operate on it enact the tackingback-and-forth that Star describes.The implications for the wider TEI community reside in the linkage Star articulates between boundary objects and the development-and, it would seem to follow, maintenance-of infrastructures and standards.
According to Star and her collaborators, infrastructures and standards are a "scal[ing] up" from the back-and-forth use of boundary objects (2010,605).In this sense, the introduction of a document-focused ontology within the TEI alongside the more common text-focused approach is an opportunity as well as a challenge.Increased commitment to one or the other ontological model of text increases the diculty that other interpretive communities will face in adapting the digital text to their local meanings and practices.The process of developing the S-GA exposed challenges related to workow and quality control, maintainability, and presentation.Yet, the concept of boundary objects and their extension into infrastructures and standards provides a framework for articulating the value of constructing a digital text object that spans current boundaries in editorial theory and practice.Digital texts that span the interpretive communities of dierent schools within textual editing and literary scholarship, by applying pressure to notions of interchange, promote circulation within the system of the standard that contributes to its greater health.
scheme for S-GA identies each page as a <surface> containing one or more <zone>s. 2Most pages have a main body of writing as well as a wide left margin, where Shelley's husband Percy wrote annotations and revisions to the developing text.These separate writing areas are encoded as <zone> elements containing writing organized into <line> elements.On top of this basic model, information about authorial hands (who wrote what) is encoded as well as revisions-including deletions, additions, substitutions, and transposed and retraced text.The reading environment created from this encoding is used to publish a semi-diplomatic transcription of the text alongside a facsimile image.

Figure 1 .
Figure 1.A screenshot of the primary reading interface of the Shelley-Godwin Archive.