Modeling Frequency Data: Methodological Considerations on the Relationship between Dictionaries and Corpora

Academic dictionary writing is making greater and greater use of the TEI Guidelines’ dictionary module. And as increasing numbers of TEI dictionaries become available, there is an ever more palpable need to work towards greater interoperability among dictionary writing systems and other language resources that are needed by dictionaries and dictionary tools. In particular this holds true for the crucial role that statistical data obtained from language resources play in lexicographic workow—a role that also has to be reected in the model of the data produced in these workows. Presenting a range of current projects, the authors address two main questions in this area: How can the relationship between a dictionary and other language resources be conceptualized, irrespective of whether they are used in the production of the dictionary or to enrich existing lexicographic data? And how can this be documented using the TEI Guidelines? Discussing a variety of options, this paper proposes a customization of the TEI dictionary module that tries to respond to the emerging requirements in an environment of increasingly intertwined language resources.


Introduction 1
Academic dictionary writing is making greater and greater use of the TEI Guidelines' dictionary module.And as increasing numbers of TEI dictionaries become available, there is an ever more palpable need to work towards greater interoperability among dictionary writing systems and other language resources that are needed by dictionaries and dictionary tools.In a world of exponentially increasing information, the borders between dierent types of digital language resources are assuming a role that requires increased attention.Two particularly important instances of such language resources are digital text corpora and dictionaries, both of which play an important part in the TEI community.

2
The research described in this paper has been based on work accomplished in a bundle of linguistically focused projects that-among other activities-also work on glossaries and dictionaries which are intended to be usable both by human readers and by particular NLP applications.The main questions that will be addressed are:

•
How can we dene the relationship between a dictionary and other language resources such as digital corpora, irrespective of whether they are used in the production of the dictionary or to enrich existing lexicographic data?
• How can this best be documented using the TEI Guidelines?
The paper comprises two parts: in the rst, the authors give a concise overview of the scholarly background of the projects involved and their goals.The second part touches on encoding issues in the related dictionary production.We will focus particularly on the modeling of an encoding scheme for statistical information on lexicographic data gleaned from digital corpora.

The Dictionaries and Projects Involved
The projects in which the dictionaries and related technologies have been developed are tightly interlinked: they are all joint endeavors of the Austrian Academy of Sciences and the University of Vienna, and all conduct research in the eld of variational Arabic linguistics.It is important to note that Arabic is characterized by a complex polyglossic situation, with Modern Standard Arabic (MSA) on one side of the spectrum and spoken vernaculars on the other side.Linguists and lexicographers may be confronted with three or even four varieties being used by the same speakers in one and the same linguistic biotope in some regions.
The rst project to be mentioned is the Vienna Corpus of Arabic Varieties (VICAV; see gure 1), which was started two years ago with a low budget, and was intended as an attempt at setting up a comprehensive research environment for scholars pursuing comparative interests in the study of Arabic dialects.The evolving VICAV platform aims at pooling linguistic research data, including various language resources such as language proles, dictionaries, glossaries, corpora, and bibliographies.One of the main objectives of the project is the creation of a number of dictionaries of Arabic varieties that are primarily intended for comparative purposes.The second project to be mentioned here is Linguistic Dynamics in the Greater Tunis Area: A Corpusbased Approach (TUNICO).This project is nanced by a grant from the Austrian Science Fund and aims at the exploration of hitherto poorly-documented contemporary Arabic of the Tunisian capital, which is linguistically and demographically a highly dynamic region.A particular feature of the project is the importance of the dictionary-corpus interface, which will allow the researcher to navigate from the corpus to the dictionary and vice versa.The TUNICO project is producing two digital language resources: a corpus of spoken youth language and a diachronic dictionary of Tunisian Arabic.The project started in August 2013 and will run for three years.

7
The third project has grown out of a master's thesis and deals with the lexicographic analysis of the Egyptian vernacular Arabic Wikipedia (Siam 2013).Siam extracted the two hundred most frequent words from Wikipedia Masri, 1 which given the scarcity of available tools proved to be quite a challenge.The idea of incorporating statistical information gathered in this project was the initial incentive to start thinking about how to encode such information in accordance with the TEI Guidelines.
Four of the dictionaries created in the above-mentioned projects (namely Egyptian, Damascene, Moroccan, and Tunisian) can be correlated with digital corpora that already exist.This does not imply that the existing data have been compiled on the basis of these corpora, none of which are very large.Nonetheless, they are so far the only available digital text collections that can be used to underpin this dictionary-building process with empirical methods.Egyptian is the most widely used Arabic dialect.There is plenty of material on the Internet: of particular interest is the great amount of data that can be found in social media sites and on personal web pages.However, most of this data is of a very hybrid nature and intermixed with MSA.Therefore, it is dicult to use for dialectological research.The only resource we could easily avail ourselves of so far is the Egyptian Wikipedia, which has been made accessible as a corpus as part of another ACDH project working on the conversion of Wikipedias into TEI.This work is particularly interesting for under-resourced languages without other digital texts suitable for linguistic research.
VICAV contains a small corpus of samples of Damascene Arabic, which was compiled by Carmen Berlinches during her seven-year stay in Damascus.In addition, there exists the Graz Corpus of Moroccan Arabic, compiled as part of a project funded by the Austrian Science Fund, 2 and the above-mentioned TUNICO corpus which is currently being compiled. 3 of these data have already been used to enhance the existing dictionaries, in particular the Egyptian and the Moroccan ones.Many of the words, word forms, and example sentences contained in the corpora have been integrated into the dictionaries.The idea of integrating frequency data grew out of the question as to which lemmas were more important than others.More dictionaries are under preparation.The VICAV database also contains data on Sudanese Arabic, Maltese, Modern Standard Arabic, and the Shawi dialects.One overarching goal of all these endeavors is the creation of a comparative dictionary with an integrated research environment that allows access to all of these data.

A Trimmed Dictionary Schema
Using the TEI dictionary module to encode digitized print dictionaries has become a fairly common standard procedure in digital humanities.Our paper will not reprise the discussion of TEI vs.
LMF vs. LEXml vs. Lift vs. RDF vs. other standards; 4 we assume that the TEI dictionary module is suciently well developed to cope with all requirements of our projects.The basic schema used has already been tested in several projects for various languages and will furnish the foundation for the intended customizations.
Created to serve as sources for comparative research, all of the above-mentioned dictionaries have to fulll a series of requirements: Technically, they have to be processable by various tools, most importantly by several web services on which the dictionary tools build.They have to be compatible with one another and the tools used in their creation.Therefore, they have to be encoded following one single schema in order to allow electronic tools to work on them in tandem and to allow users to execute meaningful queries across all of the dictionaries.This goal has so far been achieved by applying a narrowly dened schema that imposes a number of specic constraints, which were meant to serve as a mechanism to enhance interoperability.In all design issues we have strived for a high degree of compliance with LMF.The main methods of imposing such constraints are reducing alternate constructs and matching with constructs of LMF (ISO 2008). 5e all of these data are "born digital," it is comparatively easy to ensure the structural uniformity of the dictionaries.Basically, our dictionaries are conceptualized as a specic type of text and are therefore encoded with <text> elements.Each dictionary starts with a <teiHeader> which contains the metadata of the dictionary.The <body> of the VICAV dictionaries contains two <div> elements: one typed "entries", holding all entries of the dictionary, and another typed "examples", which is populated with a series of <cit> / <quote> constructs containing example sentences. 6Treating these examples as independent units allows dictionary writers to reuse the same sentence in various parts of a dictionary.The schema uses the <entry> element, does not make use of the <hom> element, and does not allow <superEntry>, <entryFree>, or <dictScrap>.

Frequency Data
Lexicostatistical data and methods are used in many elds of modern linguistics, lexicography being only one of them.Modern dictionary production relies on corpora, and statistics play an important role in lexicographers' decisions, for instance when selecting lemmas to be included in dictionaries or selecting senses to be incorporated into dictionary entries.However, lexicostatistical data are not only of interest for the lexicographer; they might also be useful to the users of lexicographic resources, especially digital lexicographic resources.The question as to how to make such information available takes us to the issue of how to encode it.
Reecting on the dictionary-corpus interface and on the issue of how to bind corpusbased statistical data into the lexicographic editing workow, two prototypical approaches are conceivable: (1) either statistical information is statically embedded in the dictionary entries or (2) a dictionary interface provides functionalities to access services capable of providing the required data.
A group of people working on methodologies to implement functionalities of the second type is the Federated Content Search (FCS) working group, an initiative of the CLARIN-ERIC infrastructure which strives to enhance search capabilities in locally distributed data stores (Stehouwer et al.   2012).FCS is intended to work with heterogeneous data, and dictionaries are only one type of language resource to be taken into consideration.In view of the growing prevalence of more dynamic digital environments, the second of the above-mentioned approaches is more appealing.
In practice, the digital workbench will require both options.This is particularly true given that corpora change and grow over time.Resolving polysemy and grouping instances into senses remain tasks that cannot be achieved automatically-yet for the sake of veriability these should be as accountable as possible.The prevalence of digital workows in lexicographic editing in combination with the availability of large-scale data storage at reasonable cost provide the technological prerequisites to envision systems that keep track of such editorial processes and make the lexicographers' decisions more transparent and veriable.
Journal of the Text Encoding Initiative, Issue 8, 07/12/2015 Selected Papers from the 2013 TEI Conference

Documenting Consulted Corpora
If traceability is to be one of the fundamental benets of a digital lexicographic workow, documenting the provenance of the language data which editorial decisions rely upon becomes a basic requirement.Among the various possible elements to accommodate such metadata in the current TEI Schema, the dictionary's <sourceDesc> element with <bibl>elements (or their nergrained variants <biblStruct> or <biblFull>) might seem an obvious t.This solution, however, is far from ideal, at least in cases where the digital dictionary stems from a printed original, as their dierent relations with respect to the <text> would be indiscernible in the markup.Adding @type attributes to both <bibl> elements would not help either, as that would merely classify the bibliographic records, not the function of the entities they describe. 9Thus, the distinction between two kinds of "sources" should be made on a more general level.The following construct would be formally valid: relying on attribute values on parallel constructs to encode such a fundamental dierence is not as expressive as one might wish-especially in the case of a much-used attribute like @type.
More importantly, does this kind of markup (or any other of the approaches mentioned above) suciently denote the specic type of relationship between the digital dictionary and a corpus, which obviously cannot be reduced to one of a "source" and its "derivation?"This conceptual problem pertains to nearly all possible solutions we could think of.
One solution would be embedding this kind of metadata into the <editorialDecl> element, which "provides details of editorial principles and practices applied during the encoding of a text," 10 or even into <samplingDecl>, where "the rationale and methods used in selecting texts, or parts of text, for inclusion in the resource" 11 are documented.Although this seems more accurate with respect to the role of language resources in dictionary writing, the denition of their common ancestor <encodingDesc> ("documents the relationship between an electronic text and the source or sources from which it was derived") 12 explicitly refers to a relation of dependency of one on the other, which seems inappropriate in our case.
Surprisingly, it is the Critical Apparatus module which provides an appropriate solution for our problem.Originating in the tradition of textual criticism, this module denes the phrase-level element <app> to embed various versions of a passage in-line, optionally declaring one as the preferred reading.The resulting TEI <text> is a compound object that does not have one single "non-electronic" counterpart, but documents a multitude of fragments from various resources.
Likewise, a dictionary, which is closely intertwined with data from language resources, can be conceptualized as the abstraction of this instance data.In order to express the specic nature of its "sources," the textcrit module has to dene a new child to <sourceDesc>, the <listWit> element.
Following this example, we propose the introduction of a <listResource> element to hold a list of any language resources from which the dictionary in the document's <body> draws its statistical information.This list consists of one or more <resource> elements, which provide relevant metadata about each language resource and include pointers to its content.
• <resource> describes a language resource of any kind (including, but not limited to, text corpora) that has been used as source material in the creation of a dictionary.
• <listResource> (language resource list) contains a list of language resources of any kind that have been used as source material in the creation of a dictionary.
By choosing a deliberately broad term as the new element's name, we try to keep the range of possible language resource types as open as possible without conning their possible function to that of a statistical source.Making <listResource>a child element of <sourceDesc> tries to address both kinds of corpusdictionary relations we have come across in our projects: it takes into account cases where a "borndigital" dictionary draws most of its material from language resources-making them thus an important, but still intermediate source-while possibly drawing a clear line between a source of a dictionary's <text> and the source of statistical data the <text> has been enriched with.
The purpose of the <resource> element is twofold: rst of all, it enables a user to locate and access the language resource personally.Secondly, it provides a basic description of it through a series of appropriate properties.Although this information is likely to be part of the metadata held alongside the corpus data itself, it seems reasonable to keep a summary of it with the dictionary, especially as corpora may become inaccessible or evolve over time.The TEI Guidelines already provide the components for this in the <bibl> element's content model, with the <extent>/<measure> construct being a natural candidate for the representation of corpus size.Modeling <resource> after <bibl> helps us address some specic limitations of the language resources we currently have to make do with: it lets us distinguish between various kinds of data via the @type and @subtype attributes, while the @statusattribute indicates whether the data in question are expected to be subject to change or not.Since we consider it important to document the essential properties of a corpus (status, size, date of its content) in a consistent manner, we decided to narrow down the possible components of <bibl> signicantly and create a specially tailored version from it.
Many other aspects of language resources may be desirable to record-especially considering questions of mid-and long-term preservation.Since data used in lexicographic production are most likely to be distributed over a wide range of organizations and locations, accessibility cannot be assumed in all cases.In order for a user to be able to assess the relevance of any language resource in the context of the nal dictionary, a multitude of parameters has to be taken into consideration.For instance, it would be important to identify inherent biases in the sociologic, geographic, or diachronic sampling of language data or technical limitations in its markup.The German LAUDATIO Project has developed a comprehensive, <teiHeader>-based corpus description specication which is directly aimed at this purpose. 13In light of such issues it seems advisable to allow the inclusionof the <teiHeader> of a corpus (or possibly any descriptive metadata) inside our <resource> element, to document its state at the time of the dictionary's publication.

Documenting Corpus Queries
So far, we have dened a way to describe the language resources we want to refer to in the dictionary's <teiHeader>.This leaves the following questions to be addressed: which information is to be encoded when documenting our corpus queries?Which parts of those entries do we need to attach this information to?And, nally, how can we establish the linkage between our description of the corpus instance data, the dictionary, and the corpus metadata held in the <listResource>element?

The Tenets of Frequency Information
What is needed is a denitive system to register quantications of particular items represented in dictionary entries.This of course raises the question as to which parts of a dictionary entry can be considered relevant.First to come to mind, of course, are headwords.But there are many other constituents of dictionary entries that might be furnished with frequency data: inected word forms, collocations, multiword units, and particular senses are relevant items in this respect.
The data model should not only provide elements to encode frequencies within elements describing the above-listed constituents of entries, but also allow indication of the source from which the data were gleaned and how the statistical information was created.The basic constituents of our model should contain these items: • Value (number of occurrences of the particular item) • Rank • Provenance (source from which the data is taken) • Retrieval method (how the statistical information was created) • Query type

• Evaluation mode
Ideally, persistent identiers should be used to identify not only the corpora but also the services involved in creating the statistical data.

Spoilt for Choice
In our attempts to design a viable solution to our encoding problems, we went through three stages: (1) we tried to make use of some TEI elements with very exible semantics and to provide them with @type attributes; (2) we tried to apply TEI feature structures; and ( 3) we started to work on a new customization.

Catch-all Elements
As is well known, there are some TEI elements which can be used for almost anything by furnishing them with @type attributes.The most commonly used ones are <note>, <ab>, and <seg>, which readily lend themselves to purposes such as ours through their very general semantics.Early attempts of ours to model frequency data also made use of <list> and <item> elements, resulting in constructs such as those in the example below: In this example, the statistical information indicated by means of <item> elements refers to the <form> elements.This attempt soon seemed unsatisfactory: although suciently versatile in its content, <list> is intended only to be placed in specic "wrapper" elements of the dictionary module (<entry>, <form>, and<sense>amongst others) but is disallowed in a lot of other contexts where frequency information is potentially relevant.In particular, this would have excluded the possibility to embed frequency data in elements containing grammatical information (<gram> and its syntactic-sugar equivalents <case>, <gen>, <number>, etc.) as well as <usg> and <def>.The other constructs mentioned above proved equally problematic: the denition of <seg> ("represents any segmentation of text below the 'chunk' level ") 14 hardly permits arbitrary data structures like the ones we needed; <note>, on the other hand, was too likely to semantically interfere with editorial notes (footnotes, marginal notes) in a retro-digitized dictionary; and <ab> would have forced us to use bulky pointing mechanisms to express the relationship between the various parts of an entry and the attached frequency information, since it is only allowed as a sibling of <entry>.

Feature Structures 40
In a second approach, we used feature structures, a very versatile, suciently well-explored tool for formalizing all kinds of linguistic phenomena.One of the advantages of the <fs> element is that it can be placed inside most elements used to encode dictionaries.modeling abstract structures and their relations.However, they are not very human-readable, nor are frequencies a "feature" of the entry's components in the strict sense of the word.

Attempting Customization 42
All these "conservative" attempts adopted existing elements and resulted in solutions which appeared to be far from perfect, especially <item> and <fs> being void of relevant semantics.This -in the end-made us think of alternatives by customizing our dictionary scheme and adding a set of objects (attributes and elements) to describe frequencies in context.With a wide range of dierent application scenarios in mind, we attempted to design something like a statistical crystal that could also be reused beyond our particular projects.

43
We named the root element to carry frequencies "statistical information."We chose a generic name rather than making use of narrower terms such as "corpusFrequency," as the data we wanted to use might come from other language resources than text corpora, such as word lists, other dictionaries, or databases aggregating statistical data from external sources.The chosen term appeared to be semantically correct and would allow us to keep options open to other scenarios.
• <statInfo> (statistical information) contains statistical information about instances of any component of a dictionary entry in one or more language resources.

44
The next step was to nd a way to indicate where the statistical information came from.
Intuitively, one might expect a <source> element.However, <source> already exists in the Manuscript Description module (and can only be used as a child of a <recordHist> element).As we considered it good practice to avoid denominational ambiguities, we eschewed using "source" in the namespace of the customization, but introduced a <dataset> element which, through membership of the att.canonical class, inherits the attributes @key and @ref, with the latter providing the mechanism to point to a <resource> element in the <teiHeader>.

</form> </entry>
In the eld of our research, we are still far away from reliable reference corpora in the proper sense of the word.At the moment, we are instead in a situation where we have to integrate anything available in default of anything better.For comparative purposes it might therefore be important to have a list with several <dataset> elements to give users a more complete picture of the available data beyond the resource proper.
The statistical information itself would remain in the TEI namespace.This is exactly the same construct which we have already proposed above.
<measure commodity="tokens" quantity="6" unit="count" type="absolute"/> <measure type="rank" quantity="2456"/> A key issue here is the access mode.The element <retrievalMethod> has been proposed to accommodate information regarding the query and possible modications to the result set.This information is dealt with in two child elements: <query> and <evalMode>.
<acdh:retrievalMethod> <acdh:query type="CQP">lemma="go"</acdh:query> <acdh:evalMode>manual</acdh:evalMode> </acdh:retrievalMethod> The <query> element contains the query string.It should also have a @type attribute indicating the applied query language.In the example above, CQP (Corpus Query Processor) refers to the query language of the IMS 15 Corpus Workbench.The element <evalMode> can be lled with either "none", which implies that the data was retrieved automatically, or "manual", which should be applied when some kind of postprocessing has been done.
For purposes of reproducibility it may be desirable to document the various steps that produced the nal set of records with ner granularity.In this case, <evalMode> could be replaced by an <evalDesc> element, containing a series of <filter> tags with one child <query> each.

51
Of course, this kind of detail is not attainable without the help of specialized software.Aiming toward tighter integration of language resources with dictionaries, we imagine a next generation of dictionary editors providing facilities to query language resources and keep track of userdriven modications of the results, and oering functions to embed them in the markup.Until implementations have reached this level of integration, we have to rely on a combination of components to support this functionality.An example for this is the commercial product Sketch Engine, 16 which includes a web interface for querying language corpora.In particular, it oers the ability to download the resulting frequency data (as well as concordances) in its own, vendor-specic, XML format.This format also contains the sequence of query expressions that leads to the nal result set.Thus, a simple XSL stylesheet can be used to transform this into the format we have proposed above.To take all of this further, we will rst create more real-world data with the customized schema and test them in applications currently under development.In addition, it will be necessary to keep discussing the customized dictionary schema with the community.A nal goal (the realization of which is admittedly not yet near at hand) is the integration of a viable solution into the TEI Guidelines.

Conclusions
A major interest that has accompanied our experiments is the clearly discernible phenomenon of blurring boundaries between digital language resources.Data available in one resource can be integrated into others; creating new resources from pre-existing ones has become much more feasible.We strongly believe that the permeability between language resources will also change our way of how we look at corpora and dictionaries.In the digital world the two grow closer and closer.Not only do they depend on each other (dictionaries need corpora to be compiled, while corpus tools need dictionaries for annotation); users will increasingly want to use them together, ideally in the same interfaces.
Some of the problems described in this paper have not been dealt with so far because readily and freely accessible language resources are not as abundant as one might assume.However, funding agencies increasingly insist on open access not only to research results but also to research data.It is therefore to be hoped that the situation with respect to openly accessible lexicographic data as well as to electronic corpora will improve in the years to come.Solutions for integrating these data and/or accessing various resources simultaneously will become even more important.Thinking about how particular language resources can interact and working on appropriate interfaces is an indispensable prerequisite for more linked (open) data and service-based architectures.The more such data become available, the more important it will become for the TEI to provide viable solutions for dealing with them.6 While entries and example sentences were originally located directly inside the <body>, we have now settled for a clearer distinction between the two.Since most of our dictionary data is created and maintained using the Viennese Lexicographic Editor and held in a relational database (described in Budin, Majewski, and Mörth 2012), format changes like this can be applied transparently on all of our dictionaries, making it easy to ensure structural homogeneity.
7 For more detail see Budin, Majewski, and Mörth 2012.8 The example above exhibits some features which might seem idiosyncratic at rst glance, namely the usage of the @ana attribute and the values in the @xml:lang attributes.The fragment identier in @ana refers to a feature library in the <teiHeader>, providing a concise notation for morphosyntactic annotations.In the example above, it denes the inected <form> to bear the features noun and plural.The composition of the @xml:lang attributes is an extension to the BCP 47 standard tags, which proved necessary in order to provide a higher degree of locational granularity (see Budin, Majewski, and Mörth 2012).
9 One might wish to have a @role attribute at hand to express this dierentiation, yet given its already fairly vague semantics that does not seem advisable either.While @role is predominantly used in the realm of names and named entities, it is also dened in att.tableDecoration,where it indicates whether a cell holds actual data or just a label.Dening it locally on <bibl> would only overload it with another, highly specialized sense and seems a makeshift strategy.
Grouping those two kinds of bibliographic references in separate listBibl elements with appropriate @type attributes-for instance "printedSource" and "consultedCorpora"-solves the issue of ambiguity at least on the surface.There remain concerns, however.First of all, Journal of theText Encoding Initiative, Issue 8, 07/12/2015 Selected Papers from the 2013 TEI Conference In order to avoid ambiguities, we propose an alternative encoding style: instead of specifying the relation between the instance data and the lexicographic description by nesting the former inside the latter, we suggest using the linking module's @corresp attribute to point from the Journal of the Text Encoding Initiative, Issue 8, 07/12/2015 Selected Papers from the 2013 TEI Conference dictionary to a <statInfo> element which can be kept in the dictionary's back matter or in another document instance.Although this might add some processing overhead to an application, it improves maintainability and readability signicantly by dividing the two layers of information.