The CLiGS Textbox: Building and Using Collections of Literary Texts in Romance Languages Encoded in TEI XML

The CLiGS textbox is published by the Computational Literary Genre Stylistics (CLiGS) group. The textbox is the group’s publication channel for several collections of literary texts. We describe the rationale for the manner in which the collections of literary texts included in the textbox have been compiled, annotated, and published. Furthermore, we suggest several ways in which the text collections can be used for research in literary studies. We aim to document some of the work of the CLiGS group, to showcase the unique TEI XML-based collections of French, Spanish, Spanish-American, and Portuguese novels and French drama we make available, and to encourage reuse of these text collections by others. We argue that agreement on common formats and procedures for text preparation, encoding, and publication fosters the accessibility, analysis, and reuse potential of literary text collections.

The Corpus of Spanish Short Stories from 1880-1940 contains 20 texts written by 8 Spanish authors (Bazan, Blasco Ibáñez, Clarín, Galdós, Miró, Pereda, Unamuno y Valle). In total this collection contains 302 short stories, which represent 811,000 tokens. It was originally a subset of a greater corpus of Spanish prose that was nally split into two corpora (novels and short stories). One possible use is to analyze the style of the same author in novels compared to short stories. 8 The Collection de romans français du dix-neuvième siècle has been compiled to enable contrastive analyses regarding major subgenres of the novels across several decades. With this in mind, 36 French novels rst published in the 1860s, 1870s, or 1880s and belonging to the subgenres of adventure novel, crime ction, Bildungsroman, and fantastic novel have been selected, with each genre and decade covered by similar amounts of text. This collection contains a total of 4.3 million words.

9
In a similar manner, the Collection de nouvelles françaises du dix-neuvième siècle contains 28 texts published in the 1830s, 1840s, 1850s, 1860s, 1870s, 1880s, and 1890s that can be classied as either fantastic or realistic novellas. With its wider chronological span, this collection is suitable not only for contrastive analyses of fantastic and realistic types of novellas, but also for investigations into the diachronic development of the genre. As novellas are much shorter than novels, this collection contains a total of slightly less than 500,000 words. 10 The Collection de pièces de théâtre français du dix-septième siècle contains 100 dramatic works rst performed between 1640 and 1670 and classied as being either comedies, tragedies, or tragicomedies. For better comparability, all plays selected are written in verse. This collection is a subset of the Théâtre classique collection edited by Paul Fièvre (2007-2018 and is suitable for contrastive analyses of dramatic subgenres, particularly for investigations into the particular position of tragicomedy with regard to comedy and tragedy. This collection contains about 1.3 million words. the person responsible for entering a value or the degree of certainty of a value), which is not as easily achieved in tabular formats. Second, XML provides us with convenient mechanisms to preserve structural information contained in the source les (like divisions into front, body, and back matter, chapters and paragraphs, or acts and scenes) or inline typographical information (like italics or bold type), which might be of interest for text analysis. 16 The data schema for the textbox follows the TEI Guidelines (for an introduction, see Burnard 2014), thus connecting the resources of the textbox to an established infrastructure and de facto standard as well as to a large community of users. The schema includes elements and attributes from the TEI modules core, header, textstructure, analysis, drama, namesdates, and linking, being as restrictive as possible. It was decided to include basic elements for the encoding of literary texts in prose and verse-like paragraphs and verse lines, but to avoid more specialized block-level elements like lists and tables or inline elements like <foreign> or <emph> in order to keep things simple. 4

17
All the elements of the CLiGS encoding scheme conform to TEI version P5. A few attributes have been added to the schema, though, using a project-specic namespace. Partly, these are attributes that result from the XML output of linguistic annotation and that could not be mapped to TEI.
In addition, a new attribute is used to allow for the expression of degrees of importance to assignments of metadata categories (see section 3.2 below for details).

18
When compared to other TEI customizations, it can be stated that the CLiGS schema is closely related to the DTA-Basisformat (DTABf). The DTABf is a subset of the TEI that serves as the basis for the annotation of full texts in the German Text Archive (see Haaf, Geyken, and Wiegand 2014-Apart from the reference versions of the texts in TEI XML with structural annotation and embedded document-level metadata, we also provide derived versions for various usage scenarios.
We provide some sample application scenarios using these dierent formats in section 5 below.

20
In addition to the reference format, each collection is made available in a simple plain text format automatically derived from the TEI version, containing only the text included in the body of the narrative texts and plays (in particular, excluding prefaces and other paratext as well as notes) and with external metadata provided in tabular format. This format is especially suitable for direct use with the stylo package for R (Eder, Kestemont, and Rybicki 2016) and other tools operating on the surface level of texts.

21
Moreover, the collections of French, Spanish, Spanish-American, Italian, and Portuguese novels, novellas, and short stories are made available in a version combining basic structural markup (chapter and sentence divisions) with token-level linguistic annotation (including lemma, part-ofspeech, morphology, and basic semantic annotation using FreeLing and WordNet). We decided to use the tagger of the NLP package FreeLing (see Padró and Stanislovsky 2012) for the linguistic annotations because its tagset is quite ne-grained and it comprises WordNet-based sense annotation and disambiguation (on WordNet in general, see Miller 1995 andFellbaum 1998). We created a workow to integrate the results of the annotation process into the TEI les. The result is a TEI le with the same header as the reference le, but with a dierent text body. Chapter divisions are preserved to allow for chapter-level analyses. Inside each division, the text is broken up into sentences and words, carrying the results of the linguistic annotation process as attributes. The attributes that could be mapped to TEI were kept in the TEI namespace. The remaining attributes were encoded in the CLiGS namespace (see example 1).

22
The linguistically annotated format is suitable for direct import into the TXM desktop environment-a tool for text analysis that is capable of performing complex queries on such annotations (see Heiden 2010 and the website of the Textométrie project 5 ).

23
Finally, the collection of French plays is available not only in TEI, but also in the "Zwischenformat" developed by the DLINA group. 6 This format represents an abstraction from the full TEI XML in that it maintains the plays' structural division into acts and scenes but replaces the speaker text with statistics regarding the number and length of the speech acts of each speaker in each scene, which notably makes possible the ecient calculation of network characteristics of a play.

Quality Control 24
Joining many texts from various sources into one collection may lead to a group of texts that is heterogeneous in vocabulary, spelling, and text quality in general. It also entails the risk of carrying over errors from the sources that may remain undetected but might inuence the results of text analyses. General kinds of mistakes like structural and orthographic errors may be introduced by an OCR process. But there might also be other, more source-specic kinds of errors. The texts are checked for completeness (so that, for example, they contain all the chapters they should) and for conformance of the TEI encoding to the custom TEI schema. Additionally, as a simple way to check the quality of the texts that go into the textbox on the orthographic and the character level, a dedicated spellcheck routine was implemented in Python using the "pyenchant" package (see Henny and Schöch 2016).

25
To account for named entities, foreign words, and other special cases that are not covered by the standard spellchecker but should count as legitimate words in the texts, the spellchecking script was combined with several lists of exception words. The remaining errors are counted for each text and for the collection as a whole. Subsequently, they are stored in an error list. Such a list may just provide information about the reliability of the texts in terms of errors on and below the word level, or it might be the basis for correcting frequently recurring errors. To give an example, there were around 8,000 dierent errors in the collection of French nineteenth-century novels when the spellcheck was applied for the rst time. After taking into account named entities, foreign words, and some dialectal and colloquial words, most of the remaining errors did not occur more than once. The quality of the texts was better than initially thought and xing the few recurring errors was feasible.

Types and Implementation of Metadata 26
In this section we provide an overview of the kinds of metadata provided for the text collections, explain how the metadata is implemented in the TEI header section, and argue for the usefulness of this approach for further processing of the texts. When the types of metadata are presented in section 3.1, XPath expressions are given to illustrate how they are implemented, while the overall implementation strategy is explained in section 3.2.

Types of Metadata 27
Following the classication of the NISO (2004), the collections provide two dierent kinds of metadata: descriptive and administrative. The descriptive metadata document information about four main areas:

1.
Authorship: A reference to the author is kept in three dierent ways: as the full name (// Since these collections have been established to study the novel's subgenres, the most important level of this hierarchy is the subgenre. Here, it is possible to assign multiple dierent values to a given novel in order to account for cases where novels can usefully be described as hybrids of several dierent subgenres. These multiple values are also structured in a way that requires one subgenre value to be designated as the main subgenre, with the other values designating secondary subgenres. This structure represents a more nuanced and realistic representation of the relations between works and genres, but it also allows working with a single subgenre concept per text (see section 3.2 for implementation details). Together with this information, we also provide some additional descriptive information like form (prose or verse: //textClass/ keywords/term[@type='text.form']) and the primary publication format (normally, as a monograph: //textClass/keywords/term[@type='text.publication.type']).

4.
Content of the text: Finally, regarding the Spanish and Spanish-American novel collections, we collect dierent metadata about the content and meaning of the text: an optional summary to give an overview of the text (//profileDesc/ abstract); the narrative perspective, an aspect that has a great impact on the frequency of the pronouns and verbs used in the text (//textClass/keywords/ term[@type='text.narration.narrator']); the gender of the protagonist (//textClass/ keywords/term[@type='text.characters.protagonist.gender']); the kind of place where the novel's action takes place primarily (city or rural; //textClass/keywords/ term[@type='text.setting.settlement.type']). These last two metadata items are part of many denitions of subgenres of the novel, so it could be particularly useful for this information to be explicitly available.

28
Beyond descriptive metadata, the <teiHeader> also contains administrative metadata that help manage the collection and document the internal process of the creation of the documents. 1.
The name of the editor of the TEI document (identied with an identier that is used in other places in the document where the editor would like to make his or her responsibility for some information explicit (//titleStmt/principal), the legal status of the text (// publicationStmt/availability), a log of major changes, and the date when the document was created (both in //revisionDesc). 8

2.
Together with this information, an essential metadata item is documented: the text identier (//publicationStmt/idno[@type='cligs']). These identiers are built from two letters that summarize the name of the collection (for example "rd" for Romans français du dix-neuvième siècle, or "ne" for Novela española) and a number. The TEI le names are provided only with this identier and the le extension corresponding to the format. Although this makes it harder for a human user to know which text each identier refers to, we have found it extremely useful to have a simple way of identifying the text and of using its features and metadata. This allows us to write scripts that select, copy, or modify the les and prepare subcorpora made from a selection of texts in a collection and to use specic parts of the texts for particular experiments. Automatically renaming the les using a specic set of metadata is also possible, of course.

Implementation: The TEI Header with Keywords 29
With the TEI Header, the Text Encoding Initiative provides a sophisticated mechanism for recording metadata of textual resources (see Burnard 2014). The texts in the CLiGS collections use many of the TEI header's standard elements and attributes to record the information described in the previous section, especially the administrative metadata and the general descriptive metadata about the author and the work in the title statement and the source description. In the CLiGS project, further metadata are collected as a basis for the main application scenario: the classication of texts according to various factors such as author gender, author nationality, genre, subgenre, narrative perspective, and gender of the protagonist. The same kind of information is used to evaluate the results of text clusters and networks derived from textual similarities. From this perspective, specic metadata (about the author, genre, narrative strategies, and text content) contribute to the text classication. For example, if a text is written by an Argentine author, we can expect it to have dierent linguistic characteristics than a text written by an author from Spain, while a text with a rst-person narrator will show a dierent usage of personal pronouns than a text narrated in the third person.

30
The TEI oers dedicated elements for some of these aspects, but not all of them. For the text collections at hand, it was important to keep the classication-related metadata in one place in order to facilitate queries for text analysis which can access the metadata item(s) that are relevant to the research question (e.g., authorship attribution, detection of author nationality, or genre classication). While this approach does not correspond to the common strategy followed, for example, in scholarly digital editions, where the focus is on the representation of the text, we believe that it is appropriate in a digital text collection created primarily for the purpose of text analysis. We decided to use the <textClass> element contained in the prole description to hold the classication-related metadata. Inside <textClass>, the <keywords> element is used, which, according to the TEI Guidelines, "contains a list of keywords or phrases identifying the topic or nature of a text." 9 In this case, the identication of topics is not the primary concern. Instead, the nature of the text is described by a set of controlled keywords. Each keyword is contained in a <term> element and the type of keyword is specied further in the @type attribute. The types of keywords are organized hierarchically, which is reected in the structure of the attribute value: the dierent levels are separated by a dot. The main levels are author-vs. text-related keywords, followed by sublevels (for example, text.genre and text.narration), a second layer of sublevels (for example, text.genre.subgenre and text.narration.narrator), and so on. The value of the keyword is given as the content of the <term> element. Example 2 shows the encoding of metadata from the Collection of 19th Century Spanish-American Novels (1880-1916): Example 2. Detailed descriptive metadata using the <keywords> element. The example describes La novela de la sangre (1903) by Carlos Octavio Bunge.
The main language and form of the text are indicated, as well as information about the narrative perspective. In the example above, there are four terms referring to the genre of the text. In this case, the supergenre is "narrative" and the genre "novel".

32
The subgenre is not limited to a single value. Instead, two assignments are made: "historical" and "sentimental". Within the attribute @cligs:importance, numbers are given to express the importance of the assignment, a higher number meaning that an assignment is relatively more important. This attribute has three dierent possible values: "1", "2", or "3". If the text belongs to a single subgenre, the subtype value is "3". If the text belongs to dierent subgenres, as in this example, the value of the subtype can only be "2" or "1". For these cases, only one subgenre term may have a subtype with a value of "2", and all the others need to be "1". If there are only equally ranking subgenre assignments, they all have the value "1". With this system, it is possible to describe texts as a mixture of genres that can be ranked. In the example, the novel is primarily a historical novel, but can also be considered a sentimental novel. The @resp attribute serves to indicate who is responsible for the subgenre assignment and the @cert attribute is used to express how condent the editor is in the information provided.

33
The types of keywords as well as their values are controlled in a taxonomy stored in a separate TEI le (keywords.xml), linked to from the @scheme attribute. The taxonomy is published together with each collection. The type values of the terms (e.g., "genre.subgenre") correspond to the identiers of the categories in the taxonomy. The hierarchy of term types indicated in the @type attribute corresponds to the hierarchical organization of categories in the external taxonomy. An excerpt from a keywords le is given in example 3: Example 3. Some of the information included in the keywords.xml file.

Publication Strategy 36
The publication strategy for the CLiGS textbox collections relies on two infrastructures which, together, provide us with the exibility we need and the guarantees for sustainable long-term archiving and access one can rightfully expect.

Authorship Attribution 41
Among the many possible methods that can be applied to such collections are stylometric analyses for authorship attribution. For this purpose, we use the stylo package for the R statistical environment (Eder, Kestemont, and Rybicki 2016). Here, we used a custom implementation of the Cosine Delta Distance, proposed by Smith and Aldridge (2011) and discussed and tested by Evert et al. (2017). Distances were calculated based on the 5,000 most frequent words as features using the labeled plain text of the Corpus of Spanish Novels from 1880-1940 of the textbox:

Network Analysis 43
Another usage scenario supported by some of the text collections in the textbox concerns network analysis. The collection of seventeenth-century French plays is particularly relevant here, as the TEI markup makes the structure of the text (acts and scenes) as well as the interactions between speakers (who speaks how many lines and words in which scene) explicit.

44
This kind of information can be used in several ways. First of all, a network of interactions can be constructed based on how closely related the dierent characters in a play are. One way of operationalizing this notion of "relatedness" is to consider how many words a given character speaks to the other characters present in the same scene. A le format derived from the original TEI format, the so-called "Zwischenformat" (see Kampkaspar, Fischer, and Trilcke 2015), which we also make available for the collection of plays, makes this type of analysis particularly easy. Figure   2 shows the weighted network for Jean Racine's tragedy Britannicus (1669). This graph clearly shows, for example, that Britannicus and Néron interact surprisingly little despite being direct opponents. Rather, their conict, which concerns Junie, also passes via Junie.
Also notable is the fact that Agrippine interacts more intensely with Burrhus, the tutor of Néron, than directly with Néron. This type of analysis becomes more interesting, however, when looking not at individual networks, but at trends and patterns in key network indicators (such as network density or average degree; see, e.g., Newman 2003) across a larger collection of plays.

Textometric Analysis 47
Yet another way to use the text collections is to conduct stylistic analyses with TXM, a tool developed in France starting in 2007 in the framework of an approach called textométrie (see Heiden 2010). TXM is freely available 12 and supports the analysis of large text corpora on lexical and morphological levels, also taking metadata into consideration. Here, the XML format with annotations from FreeLing and WordNet is used instead of the built-in TreeTagger support of TXM (see section 2.2).

48
With the help of TXM, a corpus can be established by importing single les in one of the possible formats. The metadata may include any type of information about the texts, such as title, author, year and country of publication, or literary subgenre. This information can then be used in the analyses, for example when dividing a corpus into a partition based on certain metadata values (e.g., dierent subgenres). One type of analysis that TXM supports is called specicities. Similarly to contrastive analyses using a t-test or rank-sum test (see Lijjt et al. 2016) or measures such as Zeta (see Schöch 2018), this analysis determines forms that are distinctive for one part in comparison to other parts of a text collection, for example which word forms are specic to the subgenre "historical novel" compared to other novelistic subgenres in a text collection partitioned by subgenre.
49 Figure 4 shows an example of a specicities analysis with TXM where the collection of 24 Spanish-American novels has been partitioned by subgenre. The distinctive features of historical novels were calculated in comparison to the novels of other subgenres, using WordNet semantic classes as features (so-called lexnames, for lexicographer le names 13 ). The ve most distinctive features are shown in the gure. Nouns denoting natural objects (noun.object), people (noun.people), and body parts (noun.body) have particularly high values for historical novels.
Verbs of grooming, dressing, and bodily care (verb.body) and verbs of being, having, and spatial relations (verb.stative) are underrepresented in historical novels when compared to novels from other subgenres. Interestingly, by far the most distinctive feature (verb.stative) is one that is particularly weak in historical novels.  Postprocessing primarily connects the raw output from MALLET to metadata about each segment (regarding the novel each segment belongs to and hence information such as title, author, year of publication, or subgenre). Based on these data, visualizations can then be generated to show the topics themselves (for instance, as word clouds or treemaps, as illustrated in gure 5) and their distributional patterns in the collection (for instance, using heatmaps for topic distributions over subgenres, bar charts to show top topics for a given novel or author, or line plots to show the evolution over time of one or several topics). For an example of such analyses using a collection of French plays, see Schöch 2017, and for another example using Spanish and Spanish-American novels, see Schöch et al. 2016. 6. Conclusion

52
In this paper, we hope to have documented how we compiled, annotated, and published the collections of literary texts included in the CLiGS textbox, to have provided a rationale for why we proceeded in this way, and to have shown several ways in which the text collections can be used for research in literary studies. In conclusion, we oer a few thoughts emerging from our activities in collection-building for digital research in Romance studies.

53
When building and using the textbox collections, agreeing on common formats and procedures of text preparation and encoding has been crucial. For the CLiGS group, it has been helpful to agree as much as possible on a common strategy-and a realizable one-in order to share and bundle experiences and eorts. Also, existing best practices, such as recommended subsets of the TEI and existing infrastructure components like GitHub and Zenodo, were essential in establishing the setup for the textbox as described in this article. Another advantage of a common strategy is that investigations into a given research question can make use of several of the dierent collections at a time. The research group has already beneted from this when combining texts from Spain and Spanish-America for textual analyses, for example in the study using topic modeling for genre analysis mentioned above (see Schöch et al. 2016). Also, code developed to create, transform, and analyze textual data can be reused across all collections, which has allowed the development of the CLiGS toolbox. 15 This is the tool-oriented counterpart of the textbox: a collection of Python scripts covering various aspects of text curation, collection building, and simple analyses.

54
The current state of access to literary texts in digital format for researchers in Romance studies appears to be far from ideal. Although many texts are in principle available in an electronic format, there are a number of caveats. Often, no full text is oered, or the quality of the full text is not very good. In many cases, texts are oered as image-based PDF les. In some cases, e-books are only presented in proprietary formats (e.g., Mobipocket or Kindle). Sometimes there are access restrictions, for example based on institutional aliations or the country of residence, even when the texts are in the public domain. More generally, the landscape is highly fragmented; researchers hoping to build substantial collections need to rely on multiple, heterogeneous sources. Researchers working with literary texts in Romance languages and external to the group have already shown interest in reusing the CLiGS collections. This shows that access to a large number of historical literary texts of a certain language, region, and period which have been prepared so as to be suitable for quantitative text analyses is desirable and cannot yet be taken for granted.

55
There are at least four major challenges in providing access to research data: standardization, openness, sustainability, and discoverability. An important strategy to help mitigate the adverse eects of the fragmented landscape of available texts is the use of standardized formats for text preparation and encoding, along with maximal openness in terms of technical convenience and of licenses when publishing data and metadata. Using well-supported research data repositories to archive research data should ensure the long-term availability of the data. We have pointed out possible solutions with our text collections, and we hope to encourage others who have prepared electronic versions of literary texts for research (or plan to do so) to share their collections in a similar manner, to make the material available to a wide audience at an interdisciplinary and international scale. It is an even greater challenge, however, to improve the discoverability of text collections in a eld such as Romance studies rooted in several continents and numerous countries. Currently, there does not appear to be a way to ensure that smaller text collections like the ones presented here become and remain ndable and visible inside and beyond the community