Texts as Hypergraphs: An Intuitive Representation of Interpretations of Text

Over the past decades, the question of what text really is has been addressed by a large number of conferences, workshops, articles, and blog posts. If there is one thing that, taken together, those contributions illustrate, it is that our understanding of text is—and has been—constantly in ux and open to many interpretations. Still, there is often a gap between how an editor conceptualizes a source text and how this text is encoded and stored on a computer: using TEI XML, editors are compelled to model their text as a single tree (a hierarchy), whether this structure corresponds with their intellectual understanding or not. Textual features that do not t naturally into the XML data model require additional layers of code, which hinders processing, querying, and interchange. The Text-As-Graph (TAG) data model and the associated syntax TAGML are developed to express and store textual information as a network. To this end, TAG implements a hypergraph model. In the present contribution, we illustrate the benets of TAG’s hypergraph for the modeling of features like nonlinearity, discontinuity, and overlap. In contrast to a tree model, a hypergraph accommodates these nonhierarchical structures naturally. By making them part of the data model and the syntax, a TAGML processor can process the features without having to resort to workarounds or schema-aware tools. This lowers the diculty of working with digital editions and facilitates querying and interchange. 1

we present the Text-As-Graph (TAG) hypergraph data model and its associated syntax TAGML, and illustrate how text modeled as a hypergraph will correspond more closely to the editor's "ideal" model. Furthermore, we propose that in addition to facilitating the text-encoding process, using a hypergraph also benets the processing and querying of the encoded texts. 3 This contribution builds upon previous publications that introduced the TAG data model (Haentjens Dekker and Birnbaum 2017) and the TAGML syntax , and examined the modeling of partially ordered text , Bleeker et al. 2020. The specic objectives of the present contribution are to focus on the dierences between TAGML and XML when modeling nonhierarchical structures, and to demonstrate the gains in terms of text processing. After giving our denition of text (section 1.2) and briey describing the research eld called textual genetic studies (section 1.3) from which we take our use cases, we move on to review previous work on modeling complex and nonhierarchical text structures (section 2.1). We focus on approaches that do not require the use of workarounds or local solutions. Section 2.2 then briey outlines the relevant features of the TAG data model and the TAGML syntax. In section 3 we illustrate the dierence between modeling nonlinear, discontinuous, and overlapping structures as a tree or as a hypergraph, and the advantages of a hypergraph for processing (section 4). We conclude that, XML's current prevalence notwithstanding, it does pay to question the data models we use for text encoding. We should emphasize here that it is not our intention to merely criticize the XML data model; rather we want to show the value of questioning the prevailing standards so as to nd the most suitable way to model what text really is. After all, our main focus should be on nding the best way to examine, express, query, and publish text. This requires an open, inquisitive way of looking which we hope this contribution will stimulate. 4 Sahle points out, text encoders are likely to ignore textual aspects that are not part of the TEI text-encoding model (Sahle 2013). The models we use can-very subtly-encourage us to exclude textual features that are not represented in that particular model (Dillen 2015, p. 69;Haentjens Dekker et al. 2018). 5 In view of these philosophical and technical factors, the core of the development of the TAG model is based on detailed denitions of text and document. A document, here, is a physical object: a carrier of written text. Written text is a sequence of characters (e.g., letters, digits, spaces, and punctuation, including symbols and music notation) that is inscribed in a document. From the text, a reader derives information which is organized in a network structure. Finally, we propose that text is partially ordered. This means that it is not always possible to determine the order of all characters in the sequence. Instances of partially ordered text are nonlinear, discontinuous, or overlapping structures. 6 Take for instance the inline revisions that are often present in historical or literary (draft) documents like the one in gure 1. Here, the revision results in words or characters that may be placed in more than one order, meaning that the character sequence is temporarily nonlinear. From the perspective of a human reader, the deleted word and the added word in the text fragment of gure 1 represent two variant readings of the character sequence. The deletion and the addition are located at the same position in the sequence and they are mutually exclusive: reading from becomes again fully ordered. Note that the characters within each branch are at the same location (index) in the character sequence: they represent two mutually exclusive variant readings of the text.

Definition of Text
1.3 Use Case: Textual Genetic Research 7 As we have said, the use cases in this contribution come from the the eld of textual genetic studies. This type of research is concerned with the way literary works originate and develop over time. Draft manuscripts provide a great source of information, as such documents often reect the author's train of thought: words are crossed out, sentences are added, paragraphs transposed, etc. In other words, draft manuscripts represent traces of the writing process, and by extension the creation and development of a literary work. Since this type of information is at the core of textual genetic research, it needs to be expressed in as much detail as possible. And like most digital editors, textual genetic researchers wish to store and represent the results of their research in such a way that it can be explored by others, for instance in digital research environments or digital scholarly editions. 6 8 To adequately represent and study a work's genesis, editors typically (1) express textual variation within one text version, (2) compare textual variation across versions, for instance by collating them, and (3) map the relationships among the various texts and documents related to the genesis of a work. To support either of these activities, a tool needs to be aware of information that is relevant for textual genesis. Ideally, this information is retained throughout the processing of the 2. Background 2.1 Related Work 9 Scholars have been working on topics like text modeling, text encoding, and markup for decades, and they are well aware of the diculties of representing complex textual characteristics in an eective way. By eective, we mean with few to no additional workarounds or customized technical solutions. As Fabio Vitali has argued, one can theoretically use any data model to express any kind of text, no matter how complex, as long as one is willing to use some workarounds, do some extra coding, and hand over certain tasks to other data formats (Vitali 2016). But the use of hand-overs and extra coding typically hinders the processing and analysis of the encoded text. Furthermore, it impedes human readability and makes it harder to exchange or reuse the encoded text. Fulltext search provides a good example of why it is important for a processor to recognize partially ordered text and act on it. Consider a simplied TEI XML encoding of the inline revision discussed above: Example 1. Example of inline revision. <text> <!--some markup and text --> another task <del>soon</del><add>also</add> devolved <!--some more markup and text --> </text> Since in the XML data model all text and markup are typically ordered, 8 generic XML tools will process the deletion before the addition. This results in a nonsensical sentence: "another task soon also devolved." Without access to additional information about the nonlinearity implied by the <del> and <add> elements, a search engine will nd neither the phrase "another task soon devolved" nor the phrase "another task also devolved." It will, however, nd the phrase "another task soon also devolved," even though this phrase never existed in the manuscript. Indeed, as Desmond Schmidt noted, only ten percent of digital editions using inline markup "could nd literal expressions that span inline substitutions" (Schmidt 2019, note 3).  Previous surveys of the data models for text can be found, among others, in Piez 2008, Huitfeldt et al. 2010, and Vitali 2016. The present overview is based on Vitali's principle mentioned above: any data model can theoretically express any kind of text feature if it is complemented with workarounds or customized coding, but this is not what we should aim for when investigating the most suitable data model. Accordingly, a green cell labeled "yes" means that the feature is natively supported in the underlying data model. If a feature is supported only with the help of a hack or workaround, or in the application layer, it is taken as a "no" (represented with a red cell in the table). The following subsections focus on three features: nonlinearity, discontinuity, and overlap.

Nonlinearity 11
The markup language TexMECS is designed as a linear representation for nonlinear objects, modeled as a GODDAG data structure (Huitfeldt and Sperberg-McQueen 2003). In GODDAG, all children of the markup nodes are typically ordered, but TexMECS provides a notation to mark certain markup nodes as unordered. The GODDAG processor ignores the default linear order of these elements' children, and therefore TexMECS supports the representation of nonlinear structures. No known working implementation of TexMECS, however, is currently available. At rst glance, EARMARK (Extremely Annotated RDF Markup) also seems to support the option to represent nonlinearity: with EARMARK, users can express dierent linear structures using RDF statements about text fragments, and in this way it is possible to describe multiple text orders (Peroni and Vitali 2009, 4.1;Di Iorio 2009). However, multi-orderedness is not the same as partial orderedness: if a text is partially ordered, it means that (part of the) text has no order. Multiorderedness always implies a certain order. The EARMARK specication as described in Peroni and Vitali 2009 does not natively support partially ordered text, in the sense that EARMARK users cannot mark the branching of the text stream. It is also important to note that EARMARK is a metamarkup language, which means that users encode their texts not in EARMARK but in an RDF serialization. 9

Discontinuity 12
Discontinuity, by which we mean the encoding of a single continuous utterance even though it is interrupted by one or more other structures, is also natively supported in TexMECS. The syntax provides a notation to suspend and resume the discontinuous markup elements (Sperberg-McQueen and Huitfeldt 2008), ensuring that the data model GODDAG considers it one single unit, even though it appears fragmented in the serialization. The same holds for the EARMARK approach: in their 2009 paper, Peroni and Vitali mention that EARMARK is able to handle situations where "non-contiguous ranges are contained by a particular markup item" (2009), which can be understood as a discontinuity situation. Evidently, there are several widely used pointer mechanisms in TEI XML to aggregate elements that belong together but are necessarily separated because of the hierarchical structure of the implied data model. For instance, elements can be linked via the @next and @prev attributes or via the <join> element with @target attributes. Still, these TEI XML mechanisms fall short when held against the criterion that the encoding method needs to be natively supported in the data model and independent of any vocabulary-specic applications.

Overlapping structures 13
Considering the amount of attention given to "solving" the overlap limitation posed by the single ordered hierarchy of XML, it comes as no surprise that this feature is supported by all alternative encoding approaches, some of which have been designed to address only the overlap constraint. Accordingly, overlapping structures are supported by the aforementioned EARMARK and TexMECS, as well as by the Concurrent XML approach, which also implements a GODDAG structure instead of the single hierarchy tree model of XML Dekhtyar 2005 andDekhtyar 2003). Another extension of XML designed to allow overlapping structures is the Multi-Colored Trees (MCT) approach of Jagadish et al. 2004. A single-ordered tree is just like an XML tree, but each colored tree denes its own local order of the nodes it contains. In the MCT approach, individual nodes can be part of multiple colored trees; as a result, one node can be part of multiple hierarchies. In addition, there is XCONCUR (Hilbert et al. 2005), an XML implementation of SGML's CONCUR that allows encoders to express concurrent, overlapping markup hierarchies over the same text nodes (hence the name). A non-XML-based data model that allows structures to overlap is the layered markup and annotation language LMNL of Wendell Piez and Jenni Tennison (2002), which permits ranges of markup annotations in a text stream to overlap. 14 Finally, there exist several stando approaches to dealing with overlapping structures (not part of the table in gure 2). The Multi-Version Document (MVD) approach, designed by Desmond Schmidt, rst separates markup from the text content, and second, breaks down the text content into fragments or ranges (called stando properties). These fragments are linked to the (set of) witness(es) in which they occur and stored in an MVD. The structure of an MVD is thus similar to that of a variant graph: it is a collection of nodes and edges in which common fragments of text are merged together and only the variant text is made explicit. All text in common between witnesses is recorded only once, and all the dierences are stored as separate les. The MVD approach circumvents the challenge of dealing with overlapping hierarchies within one text by separating the multiple layers of revision in a draft manuscript and treating them as individual witnesses. Another recent stando approach is seen in the Codex project created by Iian Neill, inspired by Schmidt's work, which combines plain text and stando properties stored in a Neo4J database.

Text-As-Graph
The following paragraphs briey outline the relevant properties of the TAG hypergraph model and the associated markup language, TAGML. A detailed discussion of the features, properties, and constraints of the TAG data model is not within the scope of this paper, but can be found in Haentjens  (appendix A). The TAG denition of text (see section 1.2) has informed-and continues to inform-the design of the TAG data model and the TAG markup stack. 10 When discussing data models for text, it is important to keep in mind that a syntax is not necessarily the same as a data model. A data model can theoretically be serialized in multiple ways, but some serializations are more expressive than others. Like XML, TAG is a data model for data models. Accordingly, TAGML is a metamarkup language that can be used to model text as a hypergraph. The TAG data model and markup stack work together to represent and process textual features in a straightforward manner. The goal is to avoid as much as possible the delegation of responsibilities to the schema, ODD, or other vocabulary-specic applications if they can or should be handled by the model.

16
As we have said, the underlying data model of TAG is a hypergraph. A hypergraph consists of nodes and edges just like any other graph, but with the important dierence that some edges in a hypergraph can join together two or more nodes (in contrast to the one-to-one edges of regular graphs). These are called hyperedges. The regular edges in the TAG hypergraph model are directed; the hyperedges are undirected. Nodes in the hypergraph can be connected with either a hyperedge or a regular edge. The TAG hypergraph consists of ve types of nodes: • One Document node. This node serves as the root of the graph. Via a directed edge, the Document node is connected to zero or more Text nodes, Markup nodes, Branching nodes, or Annotation nodes.
• One or more Text nodes. A Text node contains textual content (UTF-8-encoded), and may be connected to one or more Markup nodes with hyperedges. It is connected to other Text nodes or Branching nodes with directed edges.
• Zero or more Markup nodes. A Markup node is connected to one or more Text nodes, and has zero or more Annotation nodes.
• Zero or more Annotation nodes. The Annotation node is connected to one or more Markup nodes or another Annotation node.
• Zero or more Branching nodes. A Branching node is connected to a Text node or another Branching node with a directed edge. It is used to mark the beginning and end of a nonlinear structure.
This variety of edges, hyperedges, and nodes ensures the exibility of the hypergraph model.  Like XML, TAGML is also a metamarkup language, but it models textual information as a graph (a network) instead of a tree. The edges and hyperedges in the hypergraph are created by the parser, ensuring the compactness of the TAGML syntax.
TAGML may resemble existing markup languages like XML, TexMECS, or LMNL, but TAGML is more expressive. For instance, in XML all annotation values are of type string, but TAGML oers datatyping of annotations. These data types are expressed in UTF-8 and interpreted by the TAGML parser as dierent data types. Encoders can distinguish between integer, string, or Boolean values (gure 4). Annotations can also be nested (i.e., annotations on annotations) (gure 5).  We dene nonlinearity as a characteristic of a character stream with multiple branches, the content of each branch pointing to the same location in the stream. As mentioned above, inline revision oers a good example of nonlinear text. The three gures below show dierent cases of nonlinearity in the text of a draft manuscript.

Single deletion or addition 24
In this example (gure 6), the author struck out the words "impossible barriers." This means that there are two variant readings of the text: one including the deletion ("dierence of opinion, impossible barriers, prejudices") and one excluding it ("dierence of opinion, prejudices"). These readings can be described as two simultaneous branches of text, one branch including the deleted characters and one branch without them. The <del> marks the beginning of the forking of the branches. This could be expressed in TEI XML as follows (example 2): Example 2. TAGML of a single deletion.
<text> <!--some text and markup --> difference of opinion, <del>impossible barriers</del> prejudices <!--some more text and markup --> </text> The TAGML notation looks quite similar (gure 7):  As in a regular variant graph, the text in the hypergraph below is read from left to right, starting with the Document root node and following the directed edges. Note the branching nodes that mark the start and end of the nonlinear text.  It is dicult to capture the nature of a revision made currente calamo in TEI XML. It is usually encoded by placing an attribute on the <del>, such as an @rend with the value "immediate", an @seq with the value "0", or an @instant with the value "true" (example 3): Example 3. Example of <del instant='true'> <text><del instant="true">This</del> the idea has grown . . .</text> . Without a schema and an ODD, however, an XML processor would have no way of knowing what the attributes @instant, @seq, or @rend imply. In other words, it would not distinguish a regular deletion from an immediate deletion (gure 11).  In TAGML, it is possible to make this subtle distinction: [text>[del>This<del] The idea has grown . . . <text]. By omitting the ax ? on the del we indicate that the del tag is not optional: there is just one path through the text stream. Compare the visualization of the immediate deletion in gure 12 with that of the regular deletion in gure 9. In the hypergraph of the immediate deletion there is only one branch, whereas in the hypergraph of the regular deletion there are two.
This corresponds to the way we interpreted the source manuscript and encoded the text. It is not necessary to add an annotation on the del element to indicate that it is an immediate deletion, so the information is accessible to any TAGML parser.

Grouped Revision 30
A grouped revision is similar to a single deletion and a single addition: again, there are two mutually exclusive ways of reading the text: one reading includes the deleted word(s), and one reading includes the addition. We have already presented an example of a grouped revision in the introduction (gure 1); gure 13 represents another case. Here, the two words "so" and "certainly" are mutually exclusive: whether we choose the original reading "so" or the corrected reading "certainly," they are at the same location in the text and at the same distance from the start of the sentence. If scholars interpret the deletion and the addition as belonging together semantically, they can group them together using markup. In TEI XML, this can be indicated with the <subst> element, whose purpose is "solely to group its child elements together, the order in which they are presented is not signicant" (TEI P5, chapter 11.3.1.5). The grouped revision example given above (gure 13) can be transcribed as follows in TEI XML: The <subst> element functions as an indication of a split in the stream of text, which is very similar to the TAGML mechanism to encode the start of branching. We have already illustrated how the ax ? in TAGML implies that the markup element is optional, and that using this ax splits the text stream into two branches: one branch with the markup element and any associated text, and one branch without. To indicate that the text within two branches is semantically related, the divergence of the text stream can be agged with <|; the individual branches are separated with a vertical bar | and the converging of the branches is indicated with a |>. The TAGML notation of the example above would thus be as in gure 14.  The XML data model contains no information about the existence of two dierent paths through the text. When the TEI XML data is parsed by a tool without access to the schema or the ODD le, the only reading is the nonexistent "for being so certainly disagreeable." The two branches are present on the level of the hypergraph model. Also note that all Text nodes that are directly related on a semantic level are also related in the hypergraph via a direct edge. This means that both "for being so disagreeable" and "for being certainly disagreeable" will be retrieved with a full-text search.

Other cases of nonlinearity 35
In the examples given so far, the branching of the text stream occurs in the written text in the source document. There are also cases in which an editor creates a nonlinear structure that is not in the source document. The TEI XML markup elements <app> and <choice>, for instance, indicate partially ordered information: they are intended to group together "a number of alternative encodings for the same point in a text" (TEI P5, Chapter 3.4). As with <subst>, the order in which the children of a <choice> element are placed have conceptually no inuence on the meaning.

</app>
Here, the <rdg> elements and the <lem> element oer alternative readings for the same part of the text, and encoders do not consider the order in which the <rdg> elements are placed within the <app> to be informational. The same applies to the children of the <choice> in the example below, where "the <sic> and <corr> elements can appear in either order" (TEI Guidelines, chapter 3.4.1): Whereas the partial orderedness of both text and markup are noted in the TEI Guidelines, the children of <subst>, <app>, and <choice> are not stored as partially ordered in the underlying data model of XML. Again, any rules for processing <subst>, <app>, and <choice> and their children need to be expressed in an associated schema, which complicates further processing. Generic XML processors that do not know the schema will assume that their children are fully ordered and produce undesired results.

Discontinuity 37
Discontinuity happens when a text forms semantically a single continuous utterance, but is interrupted by other elements. The example in gure 17 is taken from a question on the TEI mailing list, and presents an interesting case in which a narrator (Marion) cites a letter she has received. 14 Marion intersperses the citation with her own comments on the text of the letter ("wrote Ada" and "I had told them so"). Ideally, the citation is encoded as a single expression regardless of the interruptions, so that queries for every utterance of the narrator would return either the full quotation or the quotations that are split up into more parts, depending on the editor's query. Figure 17. An example of discontinuity in a running text (Watanna 1916, p.171).
Work is in the public domain.

38
There are several mechanisms to express discontinuous structures in TEI XML. For example, using the @prev and @next attributes, the example of discontinuity (gure 17) would look as follows: Example 7. Encoding of figure 17.
<text> <s> <q xml:id="1" next="#2">"Dear Marion:</q> (wrote Ada.) <q xml:id="2" prev="#1"> We are all very glad..."</q> </s> </text> However, each mechanism requires extensive tagging, a schema that documents the specic properties of the <q> elements, such as their @next and @prev attributes, and documentation such as an ODD le that explains what needs to be done with the <q> elements, their attributes, and the attribute values in order to correctly process the encoded text. So while specialized TEI software would be able to process the two <q> elements as part of one and the same structure, ideally a TEI XML le should be compatible with a wider variety of XML-based tools.
Users can encode discontinuity in TAGML in a more compact way that does not require generating unique values for the @ids of the <q> element. Instead, TAGML users can use the axes -and + to indicate that a q element is paused and subsequently resumed (gure 18): Figure 18. A TAGML transcription of discontinuous text.
The visualizations of the respective data models show the dierence: in the XML encoding, the sentence contains two separate <q> elements that are not connected on the level of the data model.
The TAGML visualization, in contrast, shows that the Text nodes are associated with one and the same q Markup node (gures 19 and 20).   by Raymond Brulez. During the revision of his own typescript, the author decided to cross out two entire paragraphs that also cross document borders. As a consequence, there are two overlapping structures: (1) the deletion of two paragraphs, the second of which (2) runs over document borders. Figure 21. The quarto typescript of Sheherazade (Brulez 1927), p.3. Figure 22. The quarto typescript of Sheherazade (Brulez 1927), p.4. 42 In TEI XML, the overlap example (section 2.1.3) could be encoded using the <delSpan> mechanism: Example 8. Encoding section 2.1.3.
We use the axes -and + on the del element to indicate that the deleted text runs over two pages but is part of one and the same deletion.
1. one layer for the document structure containing the pages and the deletions (text > page > del), with the layer identier "D"; 2. one for the book structure with the paragraphs (text > p), with the layer identier "B." A simplication of the TAGML transcription would look as follows (gure 23):   "p" in layer "B," we can model both discontinuity and overlap. The visualization shows that there is just one Markup node labeled "del" in the hypergraph, and that this Markup node is connected to two Text nodes by means of an undirected hyperedge (visualized in green). These Text nodes are in turn associated via undirected hyperedges (in yellow) to two separate Markup nodes labeled "p" for "paragraph." Because the p Markup nodes and the del Markup node are grouped in dierent layers, the fact that they overlap is not a problem. All this information is available at the level of the model and can be parsed and queried without additional information from a schema.

Processing 43
As we have hinted at more than once, the consequences of working with a data model in which nonlinear structures are idiomatically represented become most clear with processing and querying. As mentioned in section 2.1, a generic XML processor takes the text characters in a TEI XML le as fully ordered. This has, rst of all, implications for full-text search (only ten percent of editions are able to retrieve literal expressions that include substitutions). By way of example, let us return to the grouped revision (gure 26): A generic XML processor would process the word "certainly" directly after the word "so." As a consequence, the reading "for being so disagreeable" will not exist for an XML processor, nor will the reading "for being certainly disagreeable." The only reading that would turn up is a nonexistent one. As shown in section 3.1.3, the two distinct readings do coexist in the TAG hypergraph model: the Text nodes "so" and "certainly" are both at the same distance from the root Document node.
This would appear as such in query results. Finally, the direct relationships between the Text nodes are also stored in the hypergraph, by means of a directed edge. A full-text search of the hypergraph would therefore return both readings of the text.

44
The dierence between processing TEI XML and TAGML also becomes clear with discontinuous structures. Let us return to the example given in section 3.2. We can think of at least two scenarios: one in which a user wants to retrieve the fragmented quotes, and one in which a user wants to retrieve all quotes together. The rst would not pose a problem for TEI XML, but retrieving the disjointed quotations as one (merged) utterance would only be possible with additional, vocabulary-specic coding. Processing the two <q> elements as a single <q> requires a set of XSLT instructions that check the values of the @xml:id and the @next and @prev attributes in order to know which <q> elements should be stitched together. In TAGML, both scenarios would be equally straightforward. The hypergraph can be queried for the q element(s) and their textual content as a whole, or for the q elements that have been suspended and resumed.

45
Processing discontinuous structures can become quite complex. Consider the following fragment (gure 27): Let us focus on the deleted phrase "brought . . . so near-only a night & a sail." Note that the words "within touch" have been inserted into the phrase. Whether they were added later or at the same time is hard to tell. But they are certainly not part of the deleted text. A simplied TAGML transcription of this text fragment would read as in gure 28. Theand the + axes on the del tags serve to temporarily suspend and then resume the deletion.
In the underlying hypergraph model, the deleted passages are associated with one and the same Markup node. Consequently, a simple query for all the Markup nodes labeled "del" suces to retrieve the deleted text "brought . . . so near-only a night & a sail" as one phrase. Similarly, a query for all the del elements that have been suspended and resumed would retrieve the fragmented quote.

47
Now consider a TEI XML transcription of the same fragment, simplied for readability: Example 9. TEI transcription of figure 28.
<text><!--some text and markup --> the wonders to which he had looked forward <del instant="true">th</del> <del instant="true">br</del> <del xml:id="del1">brought</del> within touch <del prev="#del1">so near -only a night &amp; a sail</del><add>with</add> a dazzling, uneasy disquietude, <!--some text and markup --></text> To process the text of this fragment correctly, one needs to write a rather complicated set of XSLT instructions. At the very least, these instructions need to match the values of the @xml:id and @prev in order to process the rst part of the deletion, look for the second part of the deletion, and then concatenate their textual content. At the same time, one has to prevent the second part from being processed twice (rst as the second part of the deletion, and the second time together with the regular <del> elements). After some experimenting and consulting several XSLT specialists, we have come to no less than three dierent sets of instructions. 16 And considering the ingenuity and technical expertise of the TEI community, we are quite certain there are even more ways. In short, it can be a challenging and time-consuming process to write and tweak vocabulary-specic and schema-aware tools-a daunting task for any TEI XML user who lacks a certain level of technical expertise.

Conclusion 48
The process of text encoding is a constant negotiation with the features of the data model in which the text is expressed. Of course, a data model's technological limitations can be expanded with workarounds, additional layers of code, or the use of vocabulary-specic tools, but doing so entails several trade-os. First, depending upon additional les to explain the tagset hinders the (blind) interchange of TEI XML les, not to mention their interoperability (cf. Bauman 2011).
Second, not many textual editors can boast the required technological skills-or the funding to engage an IT specialist-to carry out complex coding tasks. As a consequence, the threshold of digital editing is raised. What is more, the technical aspects of data models are tightly intertwined with how we conceive of text. It is therefore crucial that we, as the text-encoding community, continue to explore how the limitations of data models inuence our editing methods as well as our understanding of texts.

49
In this contribution, we used the presentation of the TAG data model to oer a higher-level perspective on text modeling. In section 1, we rst dened written text as a partially ordered character sequence of letters, digits, spaces, and punctuation, including symbols and music notation. The textual information taken from reading and interpreting a written text can be conceptualized as a network. We illustrated the concept of partially ordered text with examples of nonlinear, discontinuous, and overlapping structures. Section 2.2 explained that while these complex, networked characteristics of text cannot be expressed idiomatically in the existing data models for text, they can be straightforwardly modeled as a hypergraph. In section 3 and section 4, we contrasted TEI XML, as the prevailing data model for text encoding, with TAGML. By visualizing the data models of both TEI XML and TAGML, we illustrated how partially ordered information is stored directly in the hypergraph model, ensuring that TAGML-encoded transcriptions can be queried by any generic TAGML processor.

50
The scope of the paper was necessarily limited in that it provided only simplied examples of multi-hierarchical content structures, while cultural heritage texts often present much more complicated cases, such as additions within additions, or open variants. 17 Future developments will include a TAGML schema and ontology and further improvements of the TAG query language.
In terms of usability, an editor that provides an autocomplete feature is also no luxury, nor is a workow that includes version management. Finally, we do recognize that TAGML's setup of a plain-text transcription with several layers of markup (i.e., annotations) pointing to the text nodes does correspond to the concept of a stando approach. So far, development has focused on inline markup, but future work will explore the potential of stando markup for TAG. Current work concentrates on further development of validation and autocompletion in the TAGML parser.
While the TAG data model itself is still under active development, we believe that our work and ndings so far may be of use to the broader text-encoding community, as it will help to broaden the discussion about text modeling.
Naturally, we are aware of the ubiquity of XML for text encoding and the broad functionalities of related X-technologies for modeling and publishing text. We are also aware that designing a new markup language involves a number of nontechnical challenges, such as training and teaching, unfamiliarity, and the (un)willingness of users to adopt new ways of editing. Nevertheless, we see much value in maintaining openness and curiosity toward alternative syntaxes for text encoding and what those may mean for solving long-standing challenges of text representation. 18 Accordingly, we did not set out merely to nd fault with the XML data model, but rather to use the TAG model as an occasion to examine some fundamental assumptions about text.

52
In that respect, it is worth emphasizing that TAG can already be implemented in existing (TEI XMLbased) editorial workows. 19 When exported to TEI XML, overlapping structures in the TAGML document are automatically rendered as milestones using Trojan Horse markup (see Bleeker et al. 2020). Of course, the down-conversion from a hypergraph to a tree model inevitably implies data loss. A TAGML-to-XML export therefore requires a user to reect on how to render complex textual features in TEI XML. In other words: what XML workarounds need to be implemented to deal with overlapping or nonlinear structures? In view of our argument for more awareness of data models for text, we do not consider this pause for reection a major disadvantage.