Encoding of Variant Taxonomies in TEI

The inherent exibility of the digital format has favored the rise of editions that enable access to every witness of a particular textual work. These types of editions might have dierent goals and seek to answer dierent research questions, but they usually coincide in drawing attention to the importance of textual variants. To maximize the computational analysis that may be practiced with the variants in dierent witnesses, a complex taxonomy that reects the diversity of cases is required. Many scholars have followed the recommended TEI method for encoding types of variants— that is, through the attributes @cause or @type inside the element <rdg> —while others nd that method insucient. These attributes are not able to enclose the hierarchy intrinsic to complicated taxonomies or the overlap of classes in an ecient way. However, the TEI Guidelines do oer a module that addresses this complex encoding issue: feature structures. The method proposed in this paper does not advocate for a controlled vocabulary to categorize types of variants. What it oers instead is a pliable encoding method that allows the editor to include multiple layers of information in each apparatus tagset. she is for the SNFS-funded project A world of possibilities. Modal pathways on the extra-long period of time: the diachrony of modality in the Latin language which aims at reconstructing the evolution of modal meanings from the prehistory of the Latin language up to the seventh century CE. In this project, Helena supervises the technical aspects of the annotation workow, including the automation of corpus pre-processing, annotation and publication.

The most common issues addressed through the study of variation include the following: • The analysis of the results of the collation of witnesses is a mandatory step to establish the relationships of liation among them and, consequently, the construction of the stemmawhen possible-or local stemmata in highly contaminated traditions (Mink 2000).
• Stemmatic analysis sheds light on stages of transmission that are not preserved directly, but that can nonetheless be identied as sources for the extant witnesses-that is, it helps in the description of an archetype and hyparchetypes (Maas 1958). • The exploration of internal variants may reveal dierent stages during which the textual materials that form a work were compiled.
• Alongside codicological analysis, variants may contribute information about the time and geographical location in which a particular witness was created.
• Variants present data of interest for the development of the history of writing and the history of a language.
• Variants may provide evidence for the identication of copyists, historical owners, editors, and other persons who may have participated in the transmission of a work or in the alteration of its contents. In addition, variants may provide sociohistorical information through the study of the motivation underlying those alterations.

•
The study of variation is a key feature of genetic editions that allows one to analyze the creative process of a particular writer or editor. 6 Considering all of the above, it goes without saying that any type of scholarly edition that takes into account a multi-witness tradition needs to refer to variation somehow. How these references will be made depends on the theoretical background and the editorial model. Textual scholarship is very rich in its treatment of the relation of variants and the nal result of the edited text, but both "materialistic" approaches to the edition of texts (for example, Cerquiglini 1989) and methods that look for a curated version (for example, Chiesa 2002) require the analysis of variants.

7
The information a scholar might want to retrieve from the analysis of variants is so dependent on the circumstances of the text and on their interests that a generic taxonomy able to enclose the complexity of variation and reect all the nuances is impractical. Thus, the goal of this paper is to provide an encoding mechanism that would formalize, for all intents and purposes, textual variants.
Among the studies concerning textual variation that can be conducted through a taxonomy of variants, there is the possibility of combining both quantitative and descriptive approaches.

9
A ne categorization of the types of textual variants enriches the critical annotation of any scholarly edition because of the ways in which this information can be embedded in a digital environment. In addition, the analysis of a work after the ordering and categorization of all variants may open a window onto new research questions. For example, a classication regarding content-that is, addition, omission, and mutation-can be systematized according to subdivisions that oer possible explanations for the variation. Applying this kind of analysis to the oeuvre of an author or authors could shed light on the composition process, as well as serve as a source for dierent types of genetic and stylometric studies.

10
When the object of study is a textual tradition transmitted by acts of copying, a multilayered analysis is especially appropriate. The analysis of linguistic variation between witnesses brings to attention the core, original language of the text, and the patina, the linguistic layers left by the copyists (Trovato 2014). In addition, linguistic variants shed light on the distinctive features of every witness, which makes the study of linguistic aspects of the textual tradition as a whole more practicable. Furthermore, linguistic variants may occur as a result of self-dictation by copyists after reading and memorizing the extracts they intend to copy, and thus may reect features of their idiolects. The study of those features contributes to a better understanding of the history of a witness, since it provides evidence of the agents that participated in its creation.

11
Additionally, the linguistic data that can be retrieved by the classication of linguistic variants are of interest to historical linguistics in ways that may transcend the critical analysis of a particular textual work. Any linguistic variant must be analyzed rst in a synchronic context: the oldest variants coexist with newer variants in the same linguistic community, but in dierent contexts or registers (Jakobson 1971, 528). Thus, by identifying linguistic variants in a multi-witness tradition, we are discerning the variants that are competing in a specic chronological framework. Historical corpora used for linguistic studies have sometimes relied on scholarly curated texts or on a transcription of only the oldest witness, and in those cases linguistic variation within the tradition may be neutralized. 1 Depending on the historical period to which a tradition belongs, even graphic variants, often dismissed as nonsignicant variation because they may not be textually signicant, may nonetheless provide a rich source of cultural information. A graphemic analysis derived from a taxonomy allows an ecient computation of frequencies that can be ltered according to dierent codicological or historical aspects encoded in the corpus, such as the quire or folio number. This analysis may clarify certain elements of the genesis of the witnesses, such as the identication of hands or editorial interventions. Similarly, scribal errors may provide information related to the sources, such as the distinctive features of a hyparchetype (for example, an unusual graphic substitution might mean that the aected letters had similar shapes in the model) or linguistic data.

Feature Structures 13
From the TEI documentation (see the set of attributes specic to elements representing variant readings: TEI Consortium 2016, Appendix B: Attribute Classes, "att.textCritical") we can assume that the recommended way to describe the motivation behind a variant and its categorization is done through the attributes @type and @cause. However, the complexity of variation, with overlapping categories and complex hierarchies of variation types, cannot be recorded with a straightforward use of these two attributes. 2

14
The TEI oers a module well suited for the encoding of complex taxonomies: the Feature Structures module (TEI Consortium 2016, 18). A feature structure is a group of attribute:value pairs, where the values may be either atomic or nested feature structures. As described by Witt and Stegmann (2009), feature structures are a generic method to organize data with a metarepresentation format that presents numerous advantages, some of which will be discussed in greater detail below. For a more detailed description of feature structures and their rationale see Pose, Lopez, and Rosemary (2014, 9-10).

15
One argument in support of the use of feature structures is the ocial recognition of the model in 2012 as an international standard (ISO 24612) (Romary 2015), which bestows a certain stability on the methodology and conrms, at least to some extent, its importance and inuence within a user community. to the study of variation as a source of information for the genesis of the dierent witnesses.

18
The two features that primarily describe the variant presented in example 1 are the taxonomy that will be used (in this case, the linguistic one) and a feature called "description" that contains a denition of the phenomenon.

19
The category of the linguistic taxonomy to which this variant belongs is the phonetic one, and the phenomenon of progressive nasalization is further dened by the features "process", "position", and "constriction". As we can see, the "process" feature is more complex than the others and it requires further decomposition. The selection of features to dene the phenomenon was based on their relevance for the work in which this taxonomy is applied. For example, in medieval poetry it is important to know which phonetic phenomena occur at the end of a word creating a consonantic coda (constriction): the metrical analysis of the medieval Galician-Portuguese tradition depends on the number of syllables, and progressive nasalization can alter that count through a literary device known as synalepha. A synalepha is the merging of two syllables into one whenever a word ends in a vowel and the next word also begins with a vowel. This means that progressive nasalization, with the addition of a nasal consonant, would prevent that merging. If one witness presents a regular metric paradigm and in the other the paradigm is broken because of the presence or the absence of this phenomenon, it is the deviant witness that presents a "spurious" variant.

20
One of the advantages of the internal organization of feature structures is that any layer of information may be referenced during the description of the value of other features. This facilitates the creation of long and complex taxonomies on the foundation of a small set of shared features.
For example, in the case of a linguistic variation taxonomy, it could be convenient to dene a feature structure that would represent any phonetic phenomenon that implies the addition of a sound. In this manner, we could refer to that structure when dening paragoge (addition at the end In example 2 the attribute @feats refers to the features related to the sound addition, so that the phenomena "paragoge" and "prothesis" can be characterized simply by adding the features that dene them more precisely.
Example 2. "Reusing" features to define phenomena that entail the addition of a sound.
<fs xml:id="sound-addition"> <f name="taxonomy"> <fs type="linguistic"> <f name="category"> <fs type="phonetic"> <f name="process"> <symbol value="addition"/> In the sample presented in example 4, there are two features in the outer layer, the feature "description", whose value cannot be an empty string, and the "taxonomy", which can contain any of the following nested feature structures: "linguistic", "error", "material", "equipollent", or "graphic". Similar declarations are built according to the same model to describe linguistic feature structures and the individual features that dene them. When there is no need to go more deeply into the decomposition of a feature structure, and the possible values conform to a limited list, the <symbol> element is used to dene this controlled vocabulary. In the case of boolean-type values, as in the feature "constriction" seen in example 1, the <binary> element is declared instead in order to dene the constriction.

Encoding Textual Variation in TEI 26
As is also often the case elsewhere in the TEI Guidelines, there are several ways to encode variants, as well as alternative methods for linking the apparatus information to the text (see the Critical Apparatus documentation for more information: TEI Consortium 2016, 12). Of the available methods, the parallel segmentation method (TEI Consortium 2016, 12.2.3) seems to be a popular encoding technique for multi-witness editions, in terms of both the specic tools that have been created for this method and the number of projects that apply it. 4 The discussion below explores the integration of a variant taxonomy into an edition that follows this method by inserting an <app> element for each variation unit, that is, in every locus in the text where at least two concurrent readings exist (Macé, De Vos, and Geuten 2012, 113).

27
Taxonomies can be formalized as complex modules of structured information, and in the interest of maintaining legibility for human editors, an ecient way to incorporate analytic information into an edition involves the use of stand-o annotation methods (Bański 2010). Stand-o refers to annotation that is not inserted in line. It usually entails the development of the annotation of a primary document in a dierent le or les from the one that contains the primary textual data.
The process of relating the primary document to its annotation involves linking between specic locations of the primary source and the information that describes them, whether through byte osets, elements, attributes, or other methods (Ide and Romary 2004, 218).

28
In simpler traditions, a semantic correspondence through the use of the attribute @ana, as in the examples below, may also be suitable. Each entry of the taxonomy has an ID that is referred to in an @ana attribute in the edition.
Example 5. Excerpt from a multi-witness edition.
The text nodes that are direct children of <l> are common text shared by all witnesses, and an apparatus element, <app>, is introduced wherever there are divergences. If there is more than one variant per <app>, the element <seg> encloses the aected characters in order to avoid ambiguities regarding which part of the token refers to which variant. If there were two or more variants inside the same token, then the <seg> element would contain a @corresp attribute whose value would be the ID of the variant. When there are additional elements that provide the required semantics for the identication of the characters related to the variant, the use of <seg> is avoided (see in example 5 how the variants related to the use of abbreviations are encoded with specic markup which prevents any possible ambiguity). This strategy allows an accurate retrieval of any instance of the phenomena dened in the taxonomy. One of the functions that the variant taxonomy can fulll in the publication of the edition is the provision of an accurate description for each textual variant. One way to explore the use of the taxonomy is through enhancing the edition by using visual cues to dene the type of variant. color. If we click on any of the variants, we retrieve their more specic descriptions, that is, the contents of f[@name eq "description"], as shown in gure 2 In the same way that we access the dierent hierarchies of the taxonomy to enrich the edition, we can query the textual variants.
For instance, we can create a web form following the classication of the dierent subcategories.
This would allow us to explore the frequencies of these variants in the corpus (gure 3). This type of approach makes it possible to study each variation phenomenon by calculating its distribution according to witness and scribe, by period of composition, and, of course, by analyzing all its occurrences in the corpus (gure 4).  Complex queries can be implemented with a small piece of code such as the one presented in example 6. First, I look in the taxonomy for the feature structure IDs in whose denition there are features that would alter the number of syllables: those containing the element <symbol> with the attribute values "addition" or "repetition" (the $additionPhenomena variable) or those with the "reduction" and "omission" values ($reductionPhenomena variable). These include both linguistic variants and scribal errors. Then I look for all lines of the corpus that contain <rdg> elements with at least one reference to each of these two groupings of variant typologies. Such queries retrieve occurrences like the one presented in example 5, an authentic example of my corpus in which a linguistic phenomenon might have motivated the scribe to correct the contents of the line in order to regularize the metric pattern. In this line ("Yet, I always loved her more") there is a linguistic phenomenon that creates an extra syllable, the palatalization of the past stem of an irregular verb, which requires a paragogic vowel for pronunciation (quige versus quis). The witness that does not contain this phenomenon presents extra textual content, the monosyllabic word muj, which has an emphatic sense and is therefore omissible without changing the denotative meaning of the line. A plausible hypothesis for explaining this variation is that "quige" is an innovation that motivated a conscious omission of "muj" in order to maintain the correct number of syllables per verse. These nonaccidental omissions or additions are eectively retrieved when the description of linguistic phenomena includes a feature that mentions the addition or reduction of phonemes (as seen in the sample presented in example 4), which enables us to construct queries that look for the co-occurrence of those types of variation with variants related to the textual content.

Conclusions 38
Variation is a complex and multifaceted issue. For that reason, a hierarchical model based on the accumulation and nesting of layers of information and able to represent any concepts that depend on categories, subcategories, and even the overlapping of categories is necessary for representing all of these nuances.

39
The examples presented in this paper were modeled based on a specic project and its research questions, 5 but the intention was to present a more general method through a particular application. Nevertheless, the tradition used for exemplication is quite homogeneous and the maximum number of witnesses for the same piece of text is three. This means that a semantic correspondence presented through the use of an attribute in the edition whose value points to the taxonomy might not be suitable for more complex traditions. However, alternative stand-o methods should overcome those limitations. This will be one of the focus points in the future development of a more solid editorial model whose dening feature will be its aptness for descriptive and quantitative analyses of textual variants. Following the distinction made by Jannidis and Flanders (2013), the future work will entail the transformation of an egoistic modeling, designed for a specic research question, to an altruistic one.

40
In spite of its limitations, the core of the methodology presented here might be of interest for other projects. The creation of a variant taxonomy encoded using the feature structures model is a exible method which brings multiple advantages for textual scholarship. On the one hand, a granular denition of variation phenomena whose information can be embedded later into the edition entails a descriptive model that helps the user browse through the witnesses' readings. On the other hand, it enables quantitative analyses with greater precision and eciency. 2 For examples of dierent variant classications, see Colwell and Tune (1964) and Italia, Vitali, and Di Iorio (2015).

BIBLIOGRAPHY
3 I use XSLT to process the feature structure declaration in order to create all required Schematron rules that will constrict the feature library accordingly. I am currently working on creating a more