The DTA “Base Format”: A TEI Subset for the Compilation of a Large Reference Corpus of Printed Text from Multiple Sources

In this article we describe the DTA “Base Format” (DTABf), a strict subset of the TEI P5 tag set. The purpose of the DTABf is to provide a balance between expressiveness and precision as well as an interoperable annotation scheme for a large variety of text types of historical corpora of printed text from multiple sources. The DTABf has been developed on the basis of a large amount of historical text data in the core corpus of the project Deutsches Textarchiv (DTA) and text collections from 15 cooperating projects with a current total of 210 million tokens. The DTABf is a “living” TEI format which is continuously adjusted when new text candidates for the DTA containing new structural phenomena are encountered. We also focus on other aspects of the DTABf including consistency, interoperability with other TEI dialects, HTML and other presentations of the TEI texts, and conversion into other formats, as well as linguistic analysis. We include some examples of best practices to illustrate how external corpora can be losslessly converted into the DTABf, thus enabling third parties to use the DTABf in their specic projects. The DTABf is comprehensively documented, and several software tools are available for working with it, making it a widely used format for the encoding of historical printed German text.

projects creating corpora must ensure that their resulting annotation scheme is valid against a TEI P5 schema, and also enables easy reuse of both their metadata and text data in other project contexts. In other words, corpus projects are required to provide interoperable TEI data, so that the resulting corpora compiled from dierent sources can be made exploitable by common methods and tools. Interoperability issues aect dierent aspects, from metadata exchange through the extraction and analysis of document components up to (at least for historical texts) the creation of a uniform stylesheet in order to present all corpus texts in a similar way. 2

3
A main prerequisite for interoperability between corpora is the homogeneity of text structure.
Thus, because of the intentionally high exibility of the TEI tag set, 3 it is no longer sucient to base the annotation on the TEI Guidelines in their entirety. Rather, the TEI tag set has to be narrowed down to subsets that are both extensive, considering the structural phenomena they must document, and unambiguous about how similar phenomena may be encoded. Given these requirements, the desirability of creating some common agreed-upon TEI formats for certain editing purposes-and sharing these formats with the community in order to achieve homogeneously tagged TEI texts across project borders-is evident and has already been attempted several times (see, e.g., Pytlik Zillig 2009; Unsworth 2011). 4 Interoperability problems with TEI-encoded documents can occur on several levels: the exchange of metadata, the consistent extraction of document components, and the creation of a uniform stylesheet. It is well known that the exibility of the TEI may lead to signicant structural variation in TEI-conformant headers. This forces computational methods to deal with an enormous number of dierent cases for semantically consistent information extraction. Within the transcription itself, similar problems may occur with the extraction of structural phenomena in a text collection.
For example, letters or quotations can be annotated dierently across dierent document collections. Therefore it can be very dicult or even impossible to formulate a query that retrieves "all letters" across all document collections without knowing all the solutions adopted by all the individual collections. For complex queries the problem becomes even harder. Obviously, a standardized encoding across collections would be very helpful in solving this problem.
The remainder of this paper starts with a short presentation of the project background, the Deutsches Textarchiv (DTA), which will show that a common base format was required to integrate text collections from 15 external corpus projects (section 2). In section 3, we describe the DTABf in more detail. Section 4 focuses on the dissemination of the DTABf; we explain how comprehensive documentation, continuous training courses, and the development of customized software tools interact to create a user community for the DTABf. In section 5, we present examples of good practice that illustrate how dierent external corpora can be converted into the DTABf, making such corpora interoperable in a wider context-for example, as part of the text corpora provided by the large European infrastructure project CLARIN. 4 Section 6 discusses how new structural phenomena encountered in new texts are handled within the DTABf by adding new properties to the DTABf. We conclude with a short summary and some ideas about future prospects for the DTABf. 11 The goal of the project Deutsches Textarchiv (DTA) 5 is to create a reference corpus of the historical New High German language (1600-1900) which is balanced with regard to date of creation, text type, and thematic scope, and therefore constitutes the basis of a reference corpus for the New High German language. The DTA corpora contain printed historical works from dierent genres and text types (ctional texts in prose, poems, dramas, scientic texts of numerous dierent disciplines, and functional literature such as cookbooks, handbooks, sermons, or travel books).

Project Background
The DTA acquires texts in two complementary ways: via digitization of new texts for the DTA core corpus of around 1,500 historical works (approximately 150 million tokens) and via the curation of existing historical text collections, digitized in other project contexts, which are integrated into the DTA infrastructure (currently 120 million tokens from 15 projects). 6 In order to cope with this heterogeneous text collection, the DTABf was developed as an annotation scheme that allows for collective processing by software tools including metadata harvesting, retrieval of complex document structures, and presentation of the text data in various formats, including HTML, Text, and ePUB. The DTABf is completely based on the TEI P5 tag set, reducing and further constraining the stock tag set, but not extending it in any way. The goal of the DTABf is to provide solutions for the tagging of all structural phenomena occurring in historical printed texts down to a certain annotation depth while remaining consistent and unambiguous to ensure consistent markup for the DTA corpora as a whole. Thus the DTABf is the backbone of the DTA and guarantees that all DTABf texts are interoperable within the DTA context as well as for reuse in other projects. The DTABf is a "living" TEI format. It is carefully adjusted when new texts containing new structural phenomena are integrated into the DTA corpora.
3. Description of the DTA "Base Format" (DTABf) 3.1 History and Scope 14 The DTABf emerged from the TEI format used for the annotation of the DWDS corpus, a balanced corpus for twentieth-century German (Geyken 2007). With the beginning of the DTA project, the DWDS format was adapted to the requirements of the encoding of historical texts. The DTABf was applied continuously to all texts digitized during the rst phase of the DTA project (2007-2010, approximately 700 texts dating from 1780 to 1900). In this period it was successively adapted to new phenomena which occurred in the respective texts and which had not previously been covered by the DTABf. With the beginning of DTA phase 2 (2010-2014, approximately 600 texts dating from 1600 to 1780), the DTABf was extensively revised on the basis of the annotated historical data resulting from phase 1: the treatment of structural phenomena was reconsidered and consistent solutions were determined. In the course of these eorts the formal description of the DTABf as given in a corresponding ODD 7 was further substantiated and the DTA annotation guidelines were compiled.
the curation project of WG 1 in CLARIN-D. 8 Continuous adjustments of the DTABf remain necessary in order to account for new phenomena, but we take great care to preserve consistency and avoid ambiguity. In addition, since the DTABf is now based upon a large amount of text, we are mostly able to avoid changes which disturb backward compatibility. 9

16
As a result, the DTABf forms a TEI customization which is based on a large corpus of historical texts and thus oers solutions for most structural phenomena encountered within historical printed texts. 17 The DTABf tag set not only oers tagging solutions for text structuring but also provides a specication for the description of metadata in the TEI header. As of May 2014, the DTABf consists of 50 TEI header elements and 75 text elements accompanied by limited element-or class-specic attributes and, where applicable, attribute values. See Appendix 1 and Appendix 2 for more information on the distribution of DTABf elements within the DTA corpus. 18 The DTABf TEI header is designed to cover extensive metadata information. 10 First, DTABfconformant metadata records contain bibliographic information about (1) the digital document as published by the DTA, (2) all instances which preceded the current digital edition together with (3) the persons or organizations responsible for those instances, and (4) all licenses relevant for the digital object at hand. Second, descriptions of the physical text source upon which the current edition is based are required, including information about its constitution as well as its physical location (institution, repository, and shelfmark). Finally, general information is given about the content and design of the document (e.g., language, typeface, document type) and the DTA subcorpus it belongs to.

Formal and Semantic Structuring of Text 19
The tag set for text encoding contains tagging solutions for formal as well as semantic text structures. 11 The former include page breaks, lists, tables, and gures, as well as physical layout information like forme work and dierent types of highlighting. The latter include chapters or text sections with titles, paragraphs, notes, opening or closing text parts, special text types such as poems, letters, and indices, and inline phenomena such as proper nouns or citations.
Furthermore, documented editorial interventions are possible (e.g., the correction of printing errors, the expansion of abbreviations, normalizations, and editorial comments). Tagging   20 Linguistic information-tokenization, lemmatization of historical forms, Part-of-Speech (POS) tagging-is acquired automatically by various tools and applied to the DTA texts via the stando method. 12 We decided not to adopt an inline encoding for linguistic annotations for two reasons.

Linguistic
First, the integration of token-based linguistic analyses in the texts leads to an enormous increase in the number of tags, which hinders manual editing of the transcriptions. The second reason is that postprocessing TEI texts-including postprocessing of linguistic annotations-often requires a conversion of the TEI text into a version of the text in which the reading order has been reestablished (serialization). We provide such a solution for texts encoded in the DTABf schema (DTA-Tokwrap) and we prefer to provide users with linguistic annotations for our texts by converting the TEI texts into the Text Corpus Format (TCF), 13 the stando format used within the CLARIN project. 21 The DTABf now consists of ve components: In order to ensure homogeneous text annotation over the entire DTA corpus, the tag set has to remain unambiguous; that is, for each phenomenon there should be only one possible method of encoding. 23 Thus, we made use of the possibilities of the ODD source format to restrict annotations down to the attribute value level. First, from all possible TEI modules only a subset needed for our purposes was chosen. Second, from each of the included TEI modules, only a subset of available elements needed to encode the DTA corpus texts was selected. Likewise, attribute classes or single attributes within certain classes were eliminated from the schema if they turned out to be unnecessary for our purposes. And nally, if applicable, each attribute (at the class or element level) was provided with a xed selection of permitted values. In cases where value lists could not be restricted (e.g., @n on <lg> containing the number of a stanza), we set xed data types for the respective attribute values wherever possible. There are only a few cases remaining where the restriction of xed attribute values would not be reasonable (e.g., @quantity on <gap> species the amount of text left out in the transcription for whatever reason and thus can be lled with any numeral). With the restrictions provided by the DTABf schema the exibility of the TEI P5 tag set is reduced in favor of unambiguous though still fully TEI P5 compliant solutions.

24
As stated above, the DTABf provides not only a vocabulary for text annotation but also a specication for the TEI header in order to allow for consistent metadata recording. Although those two vocabularies-the <text> tag set and the <teiHeader> tag set-are mutually exclusive within the DTABf to a large extent, the underlying TEI P5 schema allows for quite a number of the elements to be valid in both the <text> and the <teiHeader> areas.

Example 1: Tagging of Notes and Remarks
The <note> element may have dierent attribute-value pairs depending on where it is used: Within the <text> area notes may be marginal notes (<note place="right|left">), footnotes (<note place="foot">), endnotes (<note place="end">), or editorial remarks of the person working on the digital edition of the text (<note type="editorial">).
Within the <teiHeader> of a document, however, other kinds of notes are relevant, e.g., remarks about responsibilities for certain instances of the digital document (<note type="remarkResponsibility">), about the digital document as a whole (<note type="remarkDocument">), or about the constitution of its physical source (<note type="remarkSource">).

Example 2: DTABf <teiHeader> elements within <text>
There is quite a signicant number of DTABf <teiHeader> elements which the DTABf would not allow in the <text> area but which are allowed within <text> according to the TEI Guidelines. Examples are <biblFull>, <msDesc> and their descendants, <respStmt> and <resp> or the children of <persName> (e.g., <addName>, <nameLink>, or <genName>). 25 With the ODD vocabulary in itself we cannot change these constraints for elements while remaining fully TEI-compliant. Therefore, to solve this problem, until recently we provided two separate schemas: one representing the DTABf in its entirety, the other excluding the DTABf metadata tagset. 18

Example 4: Schematron rules for @facs values
For example, the DTABf constraints for the @facs attribute of the element <pb> cannot be expressed within ODD using regular datatypes, but can be described by Schematron rules: The value of @facs is a string starting with "#f", followed by a four-digit number; the @facs value of the rst <pb> element in a document should be "#f0001"; the following @facs values should increase successively by 1.

Annotation Levels 28
With the growth of the DTABf it gets increasingly dicult and time-consuming to apply the whole range of possible DTABf annotations to each DTA corpus text. Therefore, there must be a way to communicate the degree of conformity of a given TEI text with the other texts in the DTA corpus.
For such scenarios, the TEI Guidelines propose to divide the set of elements used within a certain TEI format into dierent levels according to the necessity of their usage. 21 We introduced several levels of annotation, each containing a set of elements which have to be used consistently where applicable in order to achieve conformity with the respective level. The rst three annotation levels are based on one another. 29 Level 1 (required) covers the minimal amount of text structuring which is required to be applied to a text in order to achieve DTABf conformity. Elements used at this level include <div>, <head>, <p>, <lg>, <figure>, <pb>, and <cb>.  35 The high granularity of the DTABf tag set and its coherent application to the DTA text corpora enable us to create stylesheets for the transformation of the TEI/XML documents into many other formats, including HTML, plain text with basic structural information, or ePUB. These stylesheets in turn are able to deal with the whole range of tagging scenarios the DTABf allows. In fact, the possibility of uniform presentations of all DTABf texts was one specic goal for the design of the DTABf, and is continually considered when making adjustments to it. 36 In most cases semantic annotations can be represented on the presentation level as long as they are unambiguous, consistent, and well-documented. For instance, division titles, list items, speakers, and stage directions in a drama can without any diculty be presented in such a way that users of the respective reading versions are able to recognize these textual and structural phenomena at a glance (e.g., in our case titles are presented as bold text of larger size; stage directions are printed in italics; list items are outdented; tables are rendered as tables with dividing rules between rows and cells). Also, the specications of the DTABf down to the attribute value level support the presentation of contents in underspecied elements. For example, the values "foot", "end", "left", and "right" within the @place attribute of the element <note> dene the position of a note in the source text. In addition, since the presentation of DTA texts is meant to come as close to the original layout as possible, the DTABf also includes tagging solutions for pure layout information (such as centered text, italics, changes of fonts, and boldface), which additionally support the presentation. 37 However, sometimes the semantics of structures interfere with the presentation, which should ideally approximate the layout of the source text. An example of this semantic interference is found in the tagging of title pages. The TEI denition of the <titlePage> element is quite limited in that it restricts the usage of this element to complete pages. 25 There are cases, though, that are not met by this denition where the <titlePage> element would still be reasonable. For example, usually the heading on the rst page of a newspaper edition contains bibliographic information about the edition, but does not span an entire page. We therefore considered refraining from using the <titlePage> element in newspapers and instead using <docTitle> and analogous elements for the tagging of title information in newspapers. This solution would be fully TEI-conformant.
However, it would lead to dierent encodings of semantically similar structures (in both cases we want to encode title information of a document as given in the source text; the layout is only of secondary interest here). Furthermore, the presentation of information on title pages within the DTA corpora is based on the possibility to dene the <titlePage> element as block element and hence to homogeneously render all text occurring within <titlePage> as centered, etc. So, if we left out <titlePage> in newspapers, we would lose the ability to present title information in this text type (newspapers) similarly to title information in all other text types of the DTA corpus. We therefore decided to use <titlePage> for newspapers as well, but introduced the attribute-value pair @type="heading" to the DTABf to dierentiate the area of title information in newspapers from title pages in the narrower sense. 38 The xed vocabulary of the DTABf permits several coherent presentations of DTABf texts to users who can set the parameters of their preferred text presentation themselves. 26

39
We did not make use of the TEI standard stylesheets for two reasons. The rst reason is pragmatic.
The TEI stylesheet library 27 is a very large project on its own, consisting of complex and deeply structured XSL les which try to cover most of the TEI tag set (even though some elements like <cb> are completely ignored at least by the TEI to HTML conversion stylesheets). In addition, the standard stylesheets are too generic for an adequate visual representation of historical texts in the DTA context where we attempt to achieve a presentation as close as possible to the original print sources. We distinguish some elements with regard to their presentation depending on the context of their appearance. As an example, the <p> element describes a paragraph in prose text, but within <sp> (speech in a performance text) the same element denotes a speaker's utterance (which may not be represented as a common paragraph in the original printing). Our own stylesheet library is complemented by an extensive test suite. The stylesheets are available at https://github.com/ haoess/dta-tools. 40 The DTABf TEI header can be converted automatically into other common metadata formats.

Conversion into other Common Formats
Currently, the DTA provides a Dublin Core and a Component Metadata Infrastructure (CMDI) 28 version of all metadata records which can be harvested via the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). 29 In this way, DTA metadata can be reused by other platforms and Online Public Access Catalogues (OPACs). In addition, the CMDI metadata records for the DTA corpus texts can be interpreted by the Virtual Language Observatory (VLO) of CLARIN 30 and thus become visible together with other language resources within the CLARIN infrastructure. 41 The  33 The DTABf documentation is thematically subdivided into dierent sections (formal/semantic annotation, metadata annotation, special encoding of certain text types such as journals and newspapers). Not only does it explain formal customizations of the TEI tag set as realized within the ODD and schema as well as the Schematron rule set, but it also species transcription guidelines as well as rules which could not be formalized without changing the content models of the TEI tag set and thus going beyond the DTABf schema. 43 The prose documentation is supplemented by tables containing all DTABf elements, attributes, and attribute values for text encoding on the one hand 34 and for metadata annotation on the other. 35

44
The DTABf documentation provides a description of the work completed so far, but more importantly it serves as a guideline for users working with the DTABf for the TEI-compliant structuring of historical texts. 45 The existence of comprehensive documentation is a necessary prerequisite for the usability of the DTABf and thus for its acceptance by a larger user community. In addition, the DTA oers workshops and tutorials where users learn to apply the DTABf in a consistent way. 46 Furthermore, work on text editions according to the DTABf is supported by the DTA oXygen All DTABf-compliant texts added to the DTA corpus are integrated into the quality assurance platform DTAQ 37 where their transcriptions and annotations may be proofread.

Digitization by Way of Transcription -DTABf-born Texts 48
Homogeneity within a corpus does not only concern its tagging but also crucially depends

Interchange of TEI Documents 49
The history of the DTABf coincides with the history of documents which are exchanged between collaborating projects and the DTA.

50
An example of such a collaboration may be found in the project Johann Friedrich Blumenbachonline. 39 This project is working on a digital edition of Blumenbach's printed works in German and Latin as well as his handwritten texts. All texts are prepared in a TEI P5 format. 51 The DTA integrates the digitized full texts of German monographs and selected journal articles of Blumenbach into its platforms. This process is not unidirectional but involves several processing steps in the course of which the digital documents are exchanged between and enriched by the two projects.  The TEI P5 formats used by both projects (DTABf and Blumenbach's TEI P5 format) agree semantically, so documents can be converted automatically between the two formats in the course of the exchanges described above.

55
This workow is a good example of interoperability and interchange of TEI documents, while it also shows that manual eorts remain necessary to create TEI formats compatible with one another. Often the DTABf already oers solutions which may also be applied to newly encountered phenomena.

58
For example, the DTABf provides solutions for annotating quotations and corresponding bibliographic references as well as for concatenating discontinuous text passages. The example below shows discontinuous citations where the quotation is presented inline, whereas the bibliographic citation occurs beforehand within a marginal note. Though it was new to us, we were able to handle this scenario without any extensions to the existing DTABf.

Example 5: Newspapers
The DTA core corpus consists of texts from various disciplines, text types, and genres to allow for insights about the New High German language as it was used in dierent contexts and discourses at dierent points of time in its history.
However, an important text type-which because of its enormous extent has not been included in the DTA corpora-is newspapers. We are currently extending the DTA corpus by adding historical newspapers in the course of dierent project partnerships (see Haaf and Schulz 2014). Newspaper texts from external projects are converted into the DTABf. We adjusted the DTABf to allow for the tagging of structures which are signicant for or even limited to the text type "newspaper". Most importantly, we added division types for text passages which are typical for newspapers like political news (@type="jPoliticalNews"), weather reports (@type="jWeatherReports"), nancial news (@type="jFinancialNews"), the feuilleton (@type="jFeuilleton"), and articles within those named categories (@type="jArticle"). Specics of the DTABf for newspapers are covered by separate documentation. 41 1 (Lemgo, 1777), pp. 69-70. 42

61
Changes to the DTABf are carried out only if they are consistent with the existing tag set and do not introduce ambiguities to the format. The changes mainly concern attributes or values and only rarely TEI elements or modules.

Conclusion and Further Prospects 62
In this paper we described the DTABf as a "living" TEI format for the annotation of historical written texts for the creation of large reference corpora. The DTA corpus base is still growing either through digitization carried out by the DTA team or through the addition of text collections originating from external cooperating projects. Therefore the DTABf is not static but is constantly checked and adjusted to new structural phenomena.

63
Future work will involve the adaptation of the DTABf to manuscripts. Currently, the DTA corpora almost exclusively contain printed works; only a couple of manuscripts have been integrated so far for evaluation purposes. 43 However, some important text types usually exist in handwritten form rather than as printed documents (e.g., letters, diaries, and nancial records). In order to improve the balance of the DTA corpus, it would therefore be interesting to integrate manuscripts from some of these widespread text types into our collection. Our tests with manuscripts showed that most of the structural phenomena which occur in manuscripts are similar to those in printed texts and hence can already be treated within the DTABf. However, there are some additional characteristics of handwritten texts which might be usefully encoded (e.g., ad hoc additions, deletions, or insertions of the writer, or the change of hands within one document). These additional phenomena will necessitate adaptations to the DTABf. Furthermore, they are likely to aect the presentation of the TEI transcriptions-we might not be able to imitate the manuscript facsimile on the presentation level to the same extent as we do for printed text sources. In addition, some aspects of the metadata needed for manuscripts dier from the metadata needed for print documents. The adaptations to the DTABf and its corresponding tools and services necessary for the integration of manuscripts into the DTA will be performed in a document-based way through the integration of further manuscripts into the DTA corpus.

64
Other adaptations of the DTABf will be necessary in order to integrate texts obtained via Optical Character Recognition (OCR). OCR software only recognizes basic (text) zones which eventually have to be mapped to semantically meaningful structures. Semi-automatic subsequent structuring of OCR texts is possible, but becomes increasingly complex and error-prone with greater sophistication, detail, and granularity in the target markup. Therefore, based on experiences with automatic post-structuring of OCR texts in the course of the DFG-funded project Die Grenzboten, 44 we are planning to create an additional basic structuring level for OCR texts within the DTABf, onto which semi-automatic text structuring can be implemented, and upon which further manual poststructuring can be performed.