Creating Lexical Resources in TEI P 5 A Schema for Multi-purpose Digital Dictionaries

Although most of the relevant dictionary productions of the recent past have relied on digital data and methods, there is little consensus on formats and standards. The Institute for Corpus Linguistics and Text Technology (ICLTT) of the Austrian Academy of Sciences has been conducting a number of varied lexicographic projects, both digitising print dictionaries and working on the creation of genuinely digital lexicographic data. This data was designed to serve varying purposes: machine-readability was only one. A second goal was interoperability with digital NLP tools. To achieve this end, a uniform encoding system applicable across all the projects was developed. The paper describes the constraints imposed on the content models of the various elements of the TEI dictionary module and provides arguments in favour of TEI P5 as an encoding system not only being used to represent digitised print dictionaries but also for NLP purposes.

and Samuel Johnson finished his Dictionary of the English Language in 1755. 2 The first largescale Chinese dictionary from this time period, the Kangxi zidian, dates from 1716 (Wilkinson 2000, 64).
The latest step in this long history is being constituted by the transition towards digital methods.Today, digital technology is not only used to produce print dictionaries; rather, many dictionaries exist solely in digital form.Information and communication technology has become pervasive in all stages of the modern dictionary creation process: both data acquisition and representation of lexical knowledge rely heavily on this technology.Furthermore, dictionary makers have shifted from traditional methods such as introspection and interviews of competent speakers towards more empirical methods based on lexicographic research using increasingly sophisticated digital resources such as corpora (large digital text collections that reflect real-world language usage).

The ICLTT's Dictionaries
The Institute for Corpus Linguistics and Text Technology (ICLTT) of the Austrian Academy of Sciences has been conducting a number of lexicographic projects, including both digitizing print dictionaries and creating born-digital lexicographic data.The lexicographic data produced in these projects are designed to serve a variety of purposes for both linguistic research and lexicography.To ensure that NLP tools available at the institute would work with all the data, a uniform encoding system for all projects was needed.The integration of digital corpus data with the lexicographic infrastructure has been an important goal and plays an important role in all these efforts.
The ICLTT as an institution has grown out of several projects.One of the best known results of these projects is probably the Austrian Academy Corpus (AAC), a digital collection of German language texts stemming from the 19 th and 20 th centuries.The digital texts contained in the AAC were collected with a literary, a socio-historic and a lexicographic perspective in mind, but in spite of the literary and historical focus in setting up the corpus, it is increasingly used by linguists (Moerth 2002).

Print Dictionaries
The main motive behind setting up the corpus was the institute's involvement in a longstanding text-lexicographic project which produced two dictionaries designed to ease access to one of Austria's most important works of twentieth-century literature, Karl Kraus' magazine Die Fackel.The first volume was a dictionary of idioms and idiomatic expressions; the second one a comprehensive listing and documentation of insults and invective terms.
In recent years, the institute has shifted from addressing the needs of literary scholars by focusing on particular works of literature to catering to the needs of linguists by devoting resources to smaller and more diverse projects.The ICLTT has also contributed to the production of the largest German-Russian dictionary ever produced (Dobrovolsky 2008-2010), which was published as a cooperative project of the Austrian and the Russian Academies of Sciences.texts to as many types of written language as possible.Currently, efforts are being made to make this data TEI P5 compliant.

Born-digital Dictionaries
Dictionaries are increasingly created in and for the digital world.Apart from digitizing paper dictionaries, the ICLTT has also started to create new digital lexical resources, some of which build on the department's digital text collections.These include dictionaries for doing variational linguistics on German as written and spoken in Austria, Early Modern German, and Arabic; a GUI tool for converting German Wiktionary data to TEI P5; 3 and a comprehensive Dictionary of Modern Persian Single Word Verbs to be used as the basis for a morphological analyzer.The variation among these projects has been brought about to a certain degree by the ICLTT's role as Austria's CLARIN and DARIAH coordinator.

Data Formats
In choosing a uniform encoding system for all ICLTT data, the department's staff surveyed data formats in use.Although most of the relevant dictionary productions of the recent past have relied on digital data and methods, there is little consensus on standards.A great number of divergent formats have coexisted: MULTILEX and GENELEX (GENEric LEXicon) are systems that are associated with the Expert Advisory Group on Language Engineering Standards (EAGLES). 4Other formats used in digital dictionary projects are OLIF (Open Lexicon Interchange Format), 5 MILE (Multilingual ISLE Lexical Entry), 6 LIFT (Lexicon Interchange Format), 7 OWL (Web Ontology Language) 8 and DICT (Dictionary Server Protocol), 9 the latter being an important dictionary delivery format (Faith 1997).
Another standard considered was ISO 1951 ("Presentation/representation of entries in dictionaries -requirements, recommendations and information").Although this standard focuses on encoding the presentation of lexicographical data in dictionaries for human use in what is called LEXml (Lexicographical Markup Language), it seems that after a few years of existence only few publishing houses have been using this format (such as Langenscheidt, Munich) for their dictionary production line.
Last but not least, when looking for an encoding standard for machine readable dictionaries, ISO 24613:2008 ("Language resource management -Lexical markup framework (LMF)"), the ISO standard for natural language processing (NLP) and machinereadable dictionaries (MRD), must be considered.Recently, there have been discussions about the possibility of creating a TEI serialization of LMF (Romary 2010).
In modeling lexicographic data, it has become common practice to conceptualize the underlying structures as tree-like constructs, which makes XML an ideal syntax for expressing the data.Another option, from software engineering, is UML (Unified Modeling Language) 10 which in turn can easily be serialized into an XML vocabulary.This approach was taken by the authors of LMF.
For our projects, the final "short list" contained ISO 1951, LMF and the TEI dictionary module.ISO 1951 was eschewed from the very beginning, among other reasons for lack of support in the community.LMF in turn has gained more support in the dictionaryproducing community.Given the still small amount of available data using LMF and ongoing discussions, the decision was made to move towards TEI and keep an eye on the LMF specification as it develops. 11

TEI Dictionary Module
The TEI dictionary module appears to be the de facto encoding standard for dictionaries digitized from print sources.As such, "TEI for dictionaries" has a longstanding tradition.Interestingly, the most recent versions of the TEI Guidelines contain a passage that indicates that the authors had in mind a much wider range of dictionaries: ... The elements described here may also be useful in the encoding of computational lexica and similar resources intended for use by language-processing software; they may also be used to provide a rich encoding for word lists, lexica, glossaries, etc. included within other documents.(TEI Consortium P5 2012, 247)   This passage reflects a considerable conceptual extension of the initial purpose of the module. 12However, the idea of extending the scope of the TEI dictionary module for use by language-processing software is not at all as far-fetched as it may seem at first glance.The fact that there are people interested in the issue has been documented by the large audience of the workshop "Tightening the Representation of Lexical Data: A TEI Perspective," held at the 2011 Annual Conference and Members' Meeting of the TEI Consortium (Würzburg, Germany).Actually, the TEI's ability to adapt to many types of dictionaries makes it an ideal candidate for such an endeavor.
A fundamental problem we came up against when we started to model our dictionary data was the lack of available examples against which we could compare our data.It would have been beneficial if more projects had made at least samples of their data publicly accessible. 13Many of the examples which can be found on the TEI website are repetitive and are by no means exhaustive. 14However, getting hold of examples in other encoding languages is not easy either: ISO 1951 seems to be used by a single publishing house and LMF has not won much ground in the field, though there are some data available for the latter. 15

ICLTT's TEI Schema
The following sections outline selected features of the ICLTT's customization of the TEI P5 dictionary module.The system has been used successfully for lexicographic data encoding at the department, where it is meant to be a multi-purpose system targeting both human users and software applications.The following four requirements had featured strongly in our decision in favor of TEI encoding: • Acquaintance with the overall TEI system: as the department has been working with TEI on text encoding projects, a number of colleagues are conversant with TEI and have used it from the very beginning of our dictionary projects; • Intuitiveness of the TEI system: the concise and yet expressive set of elements is definitely more easily readable to human lexicographers working on the XML source than for instance the LMF serialization proposed in ISO 24613:2008; • Consistency with other language resources contained in the same collection: the intention was to keep the encoding system of the dictionary resources in line with other textual data to be integrated with these lexicographic resources.
• Adaptability to the needs of dictionaries to be used in natural language processing (NLP).In order to make the TEI dictionary module usable for NLP purposes, it has been necessary to tighten the many combinatorial options of TEI P5-that is, to constrain the content models of various elements.

Representing Lemmas
In TEI, dictionaries are a specific type of text and are therefore encoded with <text> elements, which are made up of optional <front> and <back> matter.The dictionary entries are placed in a <body> element.Individual entries may be seen as the core of all lexicographic encoding; the structure of dictionary entries can display a great variety of different forms. 16This also accounts for the fact that the P5 version of the Guidelines (250) offer three elements to encode this type of microtexts: <entry>, <entryFree>, and <superEntry>.
The <superEntry> element can be used to group entries together and is not used in our schema.As the name implies, <entryFree> contains a single <entry> with a comparatively large number of acceptable elements that may be arranged in many different ways.In TEI P5, <entryFree> can contain 30 different elements from the dictionary module alone. 17The great flexibility of this element makes it suitable for digitizing print dictionaries, but in creating strictly defined dictionary structures to be used by software, this flexibility is of lesser value.
Simple dictionary entries invariably start with a lemma.Optionally, entries contain an indication of the word class of the lemma and one or more <sense> elements.A typical entry has a structure like this: In many cases, it is difficult for lexicographers to decide whether to integrate lexical items into one single entry or rather to make two or more entries.Lexical homonymy in TEI dictionaries is often encoded using the <hom> element, as in the following abridged example.As a basic principle, we have attempted to keep hierarchies in our encoding system as flat as possible.This is why the <hom> element has been excluded from the set of possible The same encoding pattern is applied to grammatical homonyms and polyfunctional items-that is, homographs that are semantically related but have different word classes.However, encoding homonyms in separate <entry> elements can be problematic, especially when lexical items belong to different word classes and need to be distinguished (consider an example from English: "talk" as a verb versus as a noun).For us, the deciding factor was whether the word class difference manifests itself in the semantic description, the <sense> block in TEI nomenclature.Whenever different partof-speech labels would need to be assigned to <sense> elements (such as with all grammatical homonyms), the lexical items were encoded in separate <entry> elements rather than in one.
Polyfunctionality is a very common phenomenon and has posed problems in almost all our projects.Our approach, as detailed above, has pros and cons.However, our main argument in favor of splitting entries-putting each homonym into a separate <entry> -is that it makes access to the particular lexical items more straightforward.Working along these lines, part-of-speech labels only appear on the top-most level of the entry together with the lemma, not within <sense> elements.If necessary, the relation between entries could be made explicit by <re> (related entry) elements or some system of links.
It is obvious that the decision of whether to split entries also depends on what one plans to do with a particular set of data.For some of our projects, we have plans to enrich lexical data using corpora: looking for new, hitherto unregistered word forms, doing statistics on word forms, etc.

Encoding Word Class Information
A fixed component of all single-word dictionary entries is a block containing word-class information.In early experiments, we encoded this information within the <form> element representing the lemma.While TEI allows word-class information to appear in various locations within an <entry> element, the motivation behind putting it within <form> was that it seemed to be more consistent to say that the lemma, rather than the entry, belongs to a particular word class.In addition, putting the <gramGrp> element in the lemma's <form> element allowed <gramGrp> elements containing part-of-speech information to appear inside <form> elements, yielding an additional simplification of the schema.
Over time, we have come back to a more canonical TEI encoding, abandoning this rather atypical practice.This change of attitude was, among other things, motivated by experiments of converting our data into an LMF-conformant XML serialization: in LMF, @part-of-speech is defined as an attribute of the element <LexicalEntry>. 18actical experience has also led us to change usage of elements inside the <gramGrp> element.Initially, word-class information was encoded using the <gramGrp> element, which can contain a number of other elements such as <case>, <gen>, <mood>, <pos> , and <tns>.For example: ... <gramGrp> <pos>noun</pos> </gramGrp> ... We now only allow the <gram> element within <gramGrp>, using attributes to distinguish various word-class categories.The above example can be rewritten to its <gram> equivalent like this: Choice of appropriate terminology is important when labeling lemmas with word classes.Scholars working on digital resources have long needed to maintain consistency both within a project and one agreed upon by the community at large.Nowadays, it also involves interoperability with other digital resources, especially by referring to publicly accessible frameworks (concept repositories) to make the linguistic terminology explicit.
In the field of linguistics, two such frameworks play an increasingly important role: the so-called GOLD Standard, the General Ontology of Linguistics Descriptions (Farrar and  Langendoen 2003) and ISOcat, the ISO TC37/SC4 Data Category Registry (Kemps-Snijders  et al. 2009).The most important feature of the web-based ISOcat registry is that it provides persistent identifiers (PIDs) for all the concepts registered in the database, allowing for explicit reference to terms used.

Morphosyntactic Information
Dictionary entries often contain more grammatical forms of the headword.In traditional lexicography, particular word forms are usually given in order to point the user to irregularities in inflectional paradigms.In a digital dictionary, which does not have any spatial limitations, it is not uncommon to have more comprehensive lists of word forms.

<gramGrp> vs. Feature Structures
The ICLTT has experimented with entries giving only inflectional irregularities and also those giving complete paradigms; in either case, each word form is encoded with a <form> element.Whatever the intended use of these word forms, a system is needed to In search of a more generic approach, we resorted to a system combining feature structures 19 and ISOcat grounded values.Instead of using the <gramGrp> element as a child of <form>, the @ana (analytic) attribute is added to the <form> element.

41
The labels used to construct the pointers in the @ana attribute are human-readable abbreviations.In this part of the system, we have attempted to proceed in line with the ISO TC37/SC4-related MAF (Morphosyntactic Annotation Framework) draft specification, in particular Chapter 8 on morpho-syntactic content (ISO 24611 2008, 21).The components of the value of the @ana attribute are resolved in a feature structure library:

</fLib>
42 This method of annotating morphosyntactic phenomena is not only extremely concise (the information is only referenced through links), it also allows for the assignment of multiple interpretations of the content of the <orth> element.The attribute @ana can contain an open number of so-called data.pointers, each separated by whitespace: ... <form type="inflected" ana="#v_pres_ind_pl_p1 #v_pres_ind_pl_p3 "> <orth>gehen</orth> </form> ...

A Particular Case: Encoding Roots of Semitic Words
43 Any general-purpose system such as the TEI is bound to have conceptual gaps.A particular problem of our projects involving Semitic languages was how to deal with what in Semitic studies is commonly referred to as a root.In Semitic morphology, word forms are constructed on top of two, three, or four consonants.These consonants, which function as abstract linguistic units, form what is commonly called "the root", i.e. the semantic skeleton of all morphologically derived forms.The scholars working with and on the described encoding system were very reluctant to use the TEI element <form> for the particular purpose, as this would have meant stretching the semantics of the element too much.Roots are neither word forms nor stems.In order to avoid "tag abuse", we first

Identifying Linguistic Varieties and Writing Systems
When encoding digital texts, linguistic varieties are usually identified using so-called language codes, of which there are several systems.An older (yet very versatile) system is Verbix Language Codes, which makes use of the old SIL codes. 20LS-2010 (Linguasphere language codes) is a rather recent system which was published in 2000 and updated in 2010.It contains over 32,000 codes.The most widely used standard is ISO 639.
All these systems are incomplete and, if still being maintained, continue to evolve.A downside to all of them is the lack of support coming from the many scholarly disciplines involved in their use.In addition to the high (and ever changing) number of linguistic varieties on our globe, one additional aspect has to be taken into consideration: many linguists also need codes for historic linguistic varieties as well as for living varieties.
In TEI encoding, it has become common practice to make use of the global 21 attribute @xml:lang, incorporated into the TEI from the World Wide Web Consortium's XML Specification.TEI prescribes this attribute to identify both linguistic varieties and writing systems.In this hybrid approach, the value of the attribute should be constructed in accordance with Best Current Practice 47 (BCP 47) 22 which in turn refers to and aggregates a number of ISO standards (639-1, 639-2, ISO 15924, ISO 3166). 23P 47 defines an extensible system that is sufficiently expressive to identify most standard linguistic varieties.Language tags are assembled from a sequence of components (which are also called subtags), each separated by a hyphen.All subtags except for the first one are optional and have to be arranged in a particular order.The first subtag is usually an ISO 639-2 value and indicates the linguistic variety; the second one is an ISO 3166-1 region code.For example, es-MX stands for Spanish as spoken in Mexico, es-419 for Spanish as spoken in Latin America.In addition, the ISO 639-3 threeletter language codes and ISO 15924 codes are used.One can specify, for instance, that the language being used in a particular encoded element is in the Cantonese dialect (gan) of Chinese (zh) as spoken in Hongkong (HK) and written in Latin characters (Latn): these subtags have to be arranged in the proper order: zh-gan-Latn-HK.
While identifiers for standard linguistic varieties are adequate for many text encoding projects, some of our projects in variational linguistics, especially dialectology, need to provide locational granularity beyond what is specified in the second subtag.To solve this problem, ICLTT staff make use of private use subtags (which, according to BCP 47, must be introduced with an x singleton).They help to indicate particular geographical locations and writing systems that cannot be identified by one of the standards referenced by BCP 47.Consider the following case of the representation of the lemma for Egyptian Arabic book: ... <form type="lemma"> <orth xml:lang="ar-arz-x-cairo-vicav">kitāb</orth> </form> ...
In constructing these labels, ISO standards have been applied wherever possible.The value of the BCP 47 language tag (that is, the value of the @xml:lang attribute) starts with the shortest available ISO 639 code: ar stand for Arabic.This is followed by an extended language subtag.ISO 639-3 provides 30 identifiers for what in the specification is called individual languages, which all belong to the macrolanguage Arabic. 24The threeletter subtag arz translates into Egyptian Arabic. 25Unfortunately, this is not precise enough for purposes of dialectology, as the dialects spoken in Egypt are subdivided into a great number of quite divergent dialects, which our system has to accommodate (with private use subtags, as explained above).The schema we are using constructs these subtags from two components: location and writing system.The first component (location) does not require further explanation, whereas the second component (writing system) in this example is vicav, which stands for Viennese Corpus of Arabic Varieties (transcription), a hybrid system for transcription that attempts to represent the most common current usage in the community.While this system of constructing language labels has served our purposes very well, for documentary purposes it is still recommended to specify the exact meaning of the toponym (the first component of our private use subtag) in the <teiHeader> of the dictionary. 26We hope that future standards for language tags will allow for geo-spatial references with much finer granularity.
The following example is taken from a Modern Persian dictionary entry, the English translation of the lemma is 'to go, to walk'.The two letters fa identify the language (Modern Persian, ISO 639-2), and Arab indicates the writing system (ISO 15924). 27The private use subtag indicates the system used to transcribe the Arabic characters.In this particular case, modDMG is a modified version of the system of the Deutsche Morgenländische Gesellschaft.Documentation of the system and the applied modifications are explained in the dictionary's <teiHeader>.

Etymologies
The encoding of etymologies is straightforward in TEI.As in canonical TEI, our schema allows the <etym> element as a child of entry.<etym> in turn contains one or more <lang> elements.To make the information inside the <lang> element explicit, a @sameAs attribute is added whose value points to feature structures referring to an ISO 639-2 value.

Adding Semantics
So far, we have discussed phenomena pertaining to orthography and morphology, but we have not yet touched on equivalents or translations of the lemmas.All of this kind of information is placed in one or more <sense> elements.In monolingual dictionaries, equivalents of the lemma are encoded as <def> elements.Definition in this particular sense implies synonym or paraphrase.When working on bi-and multi-lingual data, translations are encoded as <cit> elements, and the content proper is placed in <quote> elements within these. 28Translations in more than one language are encoded by means of several <cit> elements.

Grammatical Valency 56
The appropriate encoding of grammatical phenomena often called valency or government is still not entirely resolved in the TEI Guidelines.The Guidelines provide only two examples for the <colloc> element; both are encoded with a @type attribute that has the value prep (for preposition).One is an entry for French médire de, which in English translates as "to speak ill of".The second example is an entry with Chinese shuō "to speak" as lemma, followed here by the resultative particle dào, which can be rendered in this context as of or about.The solution we had in mind was something that would reach beyond what, to a majority of linguists, would be acceptable as collocate.For this reason, we decided to consider other encoding options.
A uniform system for specifying a lexical item's main complements (arguments in linguistic nomenclature) was needed.Note that this part of our encoding system is still in its infancy.However, it is important to mention that this kind of information is invariably marked up within the <sense> element.Our current encoding is illustrated by the following excerpt: ... In our customization, the <gram> element is used to list selected arguments relevant to the material of a specific project.None of the projects aims at the exhaustive coverage of arguments.We have also been thinking about making use of feature structures, as in the following example: ... <fs type="syntacticBehaviour"> <f name="coreArguments" feats="#optSubj #oblPrepObj "/> </fs> ...
The above structure will appear very familiar to readers conversant with LMF (Lexical Markup Framework).With a generic solution designed along these lines, a precise expression of valency or government is achievable.It would also be feasible to differentiate between mandatory and optional arguments.

Dictionary Examples
As explained above, all ICLTT dictionary projects are tightly interlinked with corpusbuilding activities.For this reason, the encoding of examples in dictionary entries requires particular attention.The relation between dictionary and corpus has to be seen as bidirectional: on the one hand, lexicographic data are designed to be used in the analysis of corpora, yet on the other hand, corpora are used to enhance and refine dictionaries.
One important requirement was identified at the outset of our work: dictionary examples must be reusable in different entries of a dictionary.As we did not want to duplicate data in the dictionary, the natural choice was to work with <ptr> elements to reference examples.
In TEI P5, dictionary examples are encoded as <cit> elements with @type attributes.Except for the value of the @type attribute, they look exactly like translations.The following example is taken from an isiZulu-English glossary:

Metadata at the Level of the Dictionary Entry
Recording production metadata has been a recurring issue in many of the ICLTT's encoding projects, and the lexicographic work is no exception.It is common knowledge that the TEI provides very efficient mechanisms to make statements about all kinds of responsibility in the <teiHeader> element.However, problems arise when such statements are needed on a more granular level than the whole TEI document. 29In parts of our lexicographic work, we need to make responsibility statements not only about the whole dictionary but also about particular entries.
In everyday lexicographic work, it is not enough to assign the ID of one single lexicographer to an entry; one might want to trace who did what and at what time.As neither <revisionDesc> nor <change> may be used as child elements of <entry>, we considered various options to accommodate this information in our TEI structures.The intention was not to store production-related metadata only as a separate field in the database but to preserve this data in a self-contained manner as part of the entries so that this data would be passed on whenever a digital dictionary gets distributed.
Two elements were singled out which appeared to be plausible candidates to handle metadata about revisions of entries: <div> and <note>.These elements both have sufficiently generic semantics and, most importantly, may be used as children of the <entry> element.We first tried to encode metadata on revisions like this: ... We wanted to stay as close as possible to comparable TEI structures without bending the semantics of particular elements.We decided in favor of a <div> element for revisions, containing a feature structure.This <div> element is inserted as the last element at the end of the entry.Each modification of the entry is registered by means of an <fs> element: ... The <fs> element corresponds to the TEI <change> element, and the single features ( <f> elements) correspond to the attributes of @change.Such constructs can also be used to register status information: labels carrying values such as proposal, draft, and approved can be used to control release of selected entries to the public.

Tools
So far, work on these digital lexical resources has been accomplished using a software application developed in-house.The program was initially used in collaborative glossary editing projects carried out as part of language courses at the University of Vienna.As it proved to be flexible and adaptable enough, it has been put to use in the ICLTT's dictionary projects.
At the heart of the software application is the dictionary editing client, a standalone application temporarily dubbed the Viennese Lexicographic Editor (VLE).It supports webbased editing and dictionary entries are stored on a web server.All additional software components (PHP and MySQL) are open-source and freely available.Communication between the dictionary client (VLE) and the server has been implemented as a RESTful web service.
While the dictionary editor is geared towards general use with XML data, it is particularly suitable and customized for the use with TEI-encoded data.In addition to fully customizable XSLT stylesheets, the tool includes a number of helpful built-in features described in brief below.
Configurable keyboard layouts are designed to support the input of Unicode characters usually not available in standard key assignments.Recent VLE versions allow the automatic assignment of a keyboard to particular @xml:lang attributes to spare users of manual switching between keyboard layouts.For example, when the user works on contents of an element provided with an @xml:lang="ru" attribute, VLE automatically activates the Russian keyboard layout; on entering an element with the attribute @xml:lang="de", it switches back to the German layout.
Entry-specific metadata can be generated automatically whenever an entry is saved.IDs of both entries and examples are created automatically on the basis of the contents of the respective items.
Another feature of the dictionary editor is a special module that assists with the integration of corpus examples into dictionaries.The principal idea behind this module was optimizing access to digital corpora: the corpus interface of the dictionary writing application enables lexicographers to launch corpus queries and insert them into existing dictionary entries without using the clipboard to copy-and-paste, which would inevitably result in a lot of inefficient typing or clicking. 30e validation of our dictionary data currently uses XML Schema, but the most recent versions of VLE have been delivered with a newly integrated library that is also capable of validating the data against RelaxNG schemas.

Conclusion
The heterogeneity of linguistic annotation has been and will remain a major obstacle for interoperability and reusability of language resources.Over the past few years, there has been increased awareness among developers and users of the need to achieve a higher degree of convergence in many parts of their encoding systems.ICLTT staff members' previous experiences with LMF have shaped the TEI customization, and the draft MAF specification is significantly influencing linguistically motivated TEI applications.In creating digital dictionaries, both of these ISO specifications (and others referenced by them) will continue to complement the work with the TEI Guidelines.
All of our lexicographic endeavors have been guided by a vision of an ever more densely knit web of dictionaries and more reusable, standards-based, and ideally publicly available language resources.Such resources and the respective tools for creation and access form an integral part of state-of-the-art ICT infrastructures.The ICLTT's interest in furthering the outreach of the TEI and integrating the Guidelines into the newly evolving digital infrastructures has, among others reasons, been motivated by their strong commitment to the European infrastructure projects CLARIN and DARIAH.
In conclusion, we would like to emphasize that our customization of the TEI P5 dictionary module has proved to be a solid foundation for new lexicographic projects.While there is no doubt that much work remains to be done, we strongly believe that the results of our experiments furnish ample evidence that TEI P5 can not only be used to represent digitized print dictionaries but also for NLP purposes.

Creating
Lexical Resources in TEI P5 Journal of the Text Encoding Initiative, Issue 3 | 2012

Creating
Lexical Resources in TEI P5 Journal of the Text Encoding Initiative, Issue 3 | 2012identify their function.The traditional TEI way to do this would be to enter the morphosyntactic details of a <form> in a <gramGrp> element:

Creating
Lexical Resources in TEI P5 Journal of the Text Encoding Initiative, Issue 3 | 2012

55
In addition to the <def> and <cit> elements, our schema only allows <gramGrp> and <usg> inside the <sense> element.

Creating
Lexical Resources in TEI P5 Journal of the Text Encoding Initiative, Issue 3 | 2012 experimented with the TEI's feature-structure capabilities.Here is an example taken from our Colloquial Cairene Arabic Dictionary (safar is Arabic for 'journey').However, our current practice is to encode the root of each lemma by means of the <gramGrp> element holding the word-class information.Adding an additional <gram> element to <gramGrp> appears to be a both concise and conceptually consistent solution to the problem: Creating Lexical Resources in TEI P5Journal of the Text Encoding Initiative,Issue 3 | 2012 In our TEI-encoded dictionaries, examples such as the one above are children of the <body> element.Our dictionary editing program organizes dictionaries into three basic units-one metadata record (a <teiHeader> element) for the whole dictionary, an open number of entries, and dictionary examples (which can either be multi-word expressions, phrases or sentences with respective translations)-each of which are stored as separate database entries.Examples can then be linked to particular <sense> elements through a unique identifier which is referenced via the @target attribute of a <ptr> element: Usually, one example <cit> element contains a single <quote> element.Nevertheless, in some cases multiple <quote> elements might be required, such as to give the example in several orthographic representations (with the @xml:lang attribute differentiating them).The following example is again taken from the Colloquial Cairene Arabic dictionary: 65 66