Analyzing and Visualizing Uncertain Knowledge: The Use of TEI Annotations in the PROVIDEDH Open Science Platform

The underlying uncertainty in digital humanities research data aects decision-making and persists during a project’s lifecycle. This uncertainty is inevitable since most empirical claims cannot be assessed against an absolute truth (Drucker 2011; Binder et al. 2014). This situation has been previously recognized together with the need to report the degrees of uncertainty that accompany such claims (Blau 2011). Although TEI makes it possible to annotate text with notions of certainty or precision, examples of actual projects taking advantage of this are scarce. There are many possible explanations for uncertainty’s lack of visibility in computationally supported humanities research; among them, the need for tools specically designed to address the goal

solutions tend to provide faithful representations of the underlying statistical models with the aim of enabling better comprehension by end users, ultimately allowing them to make more informed decisions based on the reality of the observed data. 2 In this regard, the digital humanities (DH) community has approached the problem from various angles, such as the development of data quality metrics for digital cultural collections (Windhager et al. 2019), the quantication and categorization of uncertainty in DH analysis processes (Martín-Rodilla and González-Pérez 2019; Rocha Souza et al. 2019), the improvement of existing strategies for the development of digital research tools and environments for history (Edmond 2019), and from the perspective of condence (Franke et al. 2019). In the course of this work, manual and automatic annotation have emerged as one of the junctions in DH workows where uncertainty can be introduced. Despite its importance, the consideration of uncertainty in the annotation task has not been thoroughly addressed in the literature. Thus, this situation has brought the attention of interdisciplinary teams of researchers seeking to understand the implications of bringing uncertainty to the surface of DH projects. This is the case with PROVIDEDH (PROgressive VIsual DEcision-Making in Digital Humanities), 1 an EU-funded four-year project that aims to provide DH scholars with an online space to consider uncertainty related to the completeness and evolution of digital research objects in order to enhance the development and corroboration of hypotheses in a wide range of use cases. One of the project's outcomes is the development of a collaborative platform centered around the collaborative, uncertainty-aware annotation of TEI texts. 2 users to upload already annotated data sets in which @degree attributes are lled in algorithmically (because it is likely impossible for humans to ascertain a dierence between, for instance, 0.67 and 0.68 degrees of condence). 4 The ultimate goal of this implementation of <certainty> and its attributes is to provide, for teams which collaboratively edit texts, an instrument to explicitly document their assumptions and conclusions regarding uncertain passages in their shared text. Alternatively, these tags can assist in making TEI-encoded texts reusable beyond the FAIR (ndable, accessible, interoperable, and reusable) criteria for data. This deployment of <certainty> can instead make the data more epistemically available, with the provenance or prior interpretive eort that has been applied to it, integrated into the editing process that shaped it. Taxonomy   5 One of the rst tasks of the PROVIDEDH project was to describe and to dene the various sources of uncertainty in text-based DH research objects, with the aim of compiling a taxonomy of uncertainty for visualization and representation of uncertainty in DH. This taxonomy is now stable, though it remains under development as a living document. In its current iteration, it is divided into two main categories: "user-recognized" and "machine-generated" uncertainty. The former describes the manner in which users perceive and categorize uncertainty in digital texts and is related to previous work in visualization and the humanities (Fisher 1999;Simon, Weber, and Sallak 2018). The latter concerns cases when the annotation (including about uncertainty) is done by an algorithm. For a complete explanation of the motivations behind the proposed taxonomy and its categories we refer the reader to Therón Sánchez et al. (2019). Although initially the taxonomy proposed a closed, xed set of categories, we are currently departing from this approach because of a series of recent ndings resulting from interactions with real users. During these preliminary evaluations, we discovered, for example, that the optimal naming of the categories uctuates depending on the project's specic humanities content and its end user's academic background (e.g., certain users prefer to employ "imprecision" while others refer to the same category as "vagueness"). Furthermore, recent ndings are pointing us to think that the number of humanassigned categories (high, medium, low, and unknown) could, in some cases, not be exactly four but rather vary slightly around this gure. Finally, and in order to augment the expressive capabilities of the taxonomy, we are also supporting a exible (or fuzzy) implementation that allows users to combine two or more categories in a single statement. In summary, although the capture of user-perceived uncertainty in digital interfaces is a very sensitive matter that necessitates implementing important validation strategies, in this paper we report on preliminary strategies at the data format level that we know will be required for our platform to meet the identied user needs.

6
Finally, the taxonomy and platform are currently being tested in two separate scenarios: (1) free annotation of texts, and (2) the normalization/unication of entities, which involves the assertion by a human actor, with some degree of certainty, that two or more entities are indeed the same. In our tests, we employed two historical, TEI-encoded data sets that are described in the next section. 7 We are aware that marking uncertainty levels and categories is a functionality unlikely to be used by many individual scholars working within current paradigms, and indeed we would imagine many potential users might not take advantage of the aordances of such an option set. Most scholars may just want to mark identications or readings as uncertain but will not want to qualify this uncertainty further. But our approach emerged within a project and platform context where the convergence of a number of dierent emerging research paradigms demonstrated a strong case for the inclusion of such values. First, they can play a role in the support of collaborative and indeed distributed scholarship. The characterization of something as uncertain always implies a contextual richness that markup is ill-equipped to capture, in particular as uncertainty is ultimately an epistemic state, and therefore resides in the individual consciousness of the scholar.
Where that richer tacit information may not be directly available because, for example, two scholars are working together without regular direct communication, markers of the degree or categories of uncertainty can take on a signicant signaling function without needing to capture too much detail. Second, the decision to include these tags emerged within a project that was testing the potential of progressive visualization as a research tool for historians. Considering and assigning degrees and categories of uncertainty may not seem to be a particularly valuable exercise in and of itself, unless or until a tool such as a visualization can allow these attributes of uncertainty to be viewed in a comparative framework and used to weigh alternative explanations, distinguish norms from outliers, and prioritize future eorts.

Historical Recipes 9
A second case study used to test our annotator was a data set of historical recipes of the Baroque era from the former area of Austria and beyond. 5 The recipe collection was previously established and pertained to the citizen science project Salzburg zu Tisch, 6 carried out and led by our cooperation partners, sta members of the Center for Gastrosophie, 7 the History Department at the University of Salzburg, Austria. The Gastrosophie recipe collection counts around ten thousand historical recipes from dierent cookbooks and dierent authors, with the majority being in nonstandard German and a small number in other languages such as English. During this previous project, the recipes were digitized, entered into a WordPress instance, and annotated with the help of interested citizens. In this project, they the recipes have been converted to TEI 8 and collectively further annotated (see gure 2).

Collaborative Platform 10
Given the broad spectrum of named entities that a project may require, based on the data set being used and the variation in what dierent users consider to be a meaningful certainty-category naming, we identied a need to allow the user to specify the entities and uncertainty categories within the project being created in the platform. Such a specication determines what entities and their properties will be available and how users will assess the correctness of the documents. This must be set up before working with the TEI documents, during project creation.

11
For this purpose, we provide an interactive interface (see gure 1) that abstracts how the corresponding TEI formatting will be done, and which allows the user to specify both the entities and the taxonomy as well as the colors and icons that will be used across the application to encode this information. Working with existing TEI entities (such as "person," "place" or "event") and user-dened ones (such as "ingredient" or "utensil") is tightly integrated into the proposed system, therefore making the whole process eortless to the user, which in turn makes the platform usable in a broad range of dierent DH research elds.

13
The conguration of the project requires handling the denition of new entities and the creation of a taxonomy. The rst is accomplished by using the TEI <object> element with its @type attribute.
Custom certainty categories are handled by creating a project-specic taxonomy which is then referenced in annotations. In addition to the name, optional descriptions can be added for each category to help users understand how uncertainty is being conceived and used in their project.
The platform supports les in UTF8-encoded TEI P5 format; if a user prefers to start with plain text or use another encoding, the les will be converted to UTF8-encoded TEI P5 during the upload.
If TEI les already contain annotated entities, the platform will assign them @xml:id so that they can be referred to in further work. Files changed in this way are stored in the platform as a second version of the uploaded les, so the user can track the history of les. In addition, existing annotations and entities are entered into the database, making them easier to manage.

Uncertainty Annotator 16
In the aforementioned data sets, we identied a large number of cases where uncertainty can be annotated. This implies that there are a large number of users with some amount of prior knowledge of XML editing and the TEI format, and these uses cases provide a good opportunity to make use of interactive interfaces to ease working with data sets and collaborating within the documents. 17 We, therefore, designed and integrated into the platform an annotator tool that allows for both the reading of TEI documents in a user-friendly manner and the annotating them using simple text-selection interactions. This interface strongly emphasizes collaborative editing, making use of visual encodings to easily distinguish the authorship of annotations at rst glance. This collaboration is possible thanks to the use of the TEI format, enabling precise targeting of the source of uncertainty not only regarding the tag name, attributes, or values for entities but also for other people's annotations.

19
The Annotator adds all the changes related to uncertainty to the TEI header. Each user can annotate the same pieces of text in their own way; if we put dierent users' tags in the same places in the text, it would make the document quite dicult to read. We decided to place all the uncertainty annotations in <certainty> elements in the header and only refer to them from the text. We further decided not to use other TEI options for tagging uncertain alternative texts, like <choice>, in order to manage uncertainty annotations in a coherent way within the platform. We also considered storing the uncertainty annotations in a <standOff> element, which we use to store entities occurring in the text, but ultimately, we decided to store them together with the annotators in the <profileDesc> element, as shown in example 2. managed within the platform. First, let's present a modal window that appears after the user selects a text fragment and chooses the "annotate uncertainty" option. It is shown on the left side of gure 3 below. For case A, the user selects "Value" from the "Target" combo box and optionally can select the "Certainty level" and "Categories" (from the dened taxonomy of the project). As a result, the selected text is annotated as a segment and the corresponding uncertainty annotation is added in the <profileDesc> element. Simultaneously, the annotator (the author) is added to the list of annotators, if missing (see example 2).

23
Case B is a simple extension of A. The user provides an asserted value in the "Value" text input eld, which in their opinion should replace the uncertain piece of text (see the right side of gure 3 and the result in example 2). Example 2. Sample uncertainty annotations made by two users referring to questionable text.
<classCode scheme="http://providedh.eu/uncertainty/ns/1.0"> <certainty xml:id="certainty-A" locus="value" cert="unknown" target="#seg-1" resp="#annotator-1"/> <certainty xml:id="certainty-B" locus="value" cert="medium" ana="https:// providedh.ehum.psnc.pl/api/projects/1/taxonomy#credibility" target="#seg-1" resp="#annotator-2" assertedValue="many"/> Use case C, along with the remaining types of annotations, concerns the entity types chosen during project creation. For these types of annotations the Annotator uses not <seg> but <name>. It creates an entity of the provided type, adds it to the list of annotated entities, and annotates the selected text as <name> with a @ref attribute that points to the created entity. This entity list (or rather these lists, because each entity type has its own list) makes it possible to keep track of named entities in a document and spot possible duplications, and allows opportunities for unication and the easy exploration of the corpus. These lists are added to the <standOff> element and are visible to users directly above the text in the Annotator (see gure 2).

25
This use case focuses on text annotation without uncertainty, illustrated with the following scenario: a user selects "Ambrose Bedell" in the following TEI document and annotates the name as a person.  In the text, we use <name> to annotate the selected text and @ref to refer to the entity.  When annotating with uncertainty a piece of text which becomes a reference to an entity, the corresponding uncertainty annotation points to the added <name>. This allows the user to express their uncertainty regarding the spelling of the entity name. The user does not express any doubts regarding the type or properties of the entity.
Example 6. Sample uncertainty annotations referring to questionable entity name without and with asserted entity name.

Use Case F: Annotating an Erroneous Text for an Existing Entity with an
Optional Alternative Value 31 This use case is similar to previous use cases D and E, but an instance of the entity already exists in the text (it exists in the dedicated list of entities and its name is annotated in the text). Another user can select the annotated text and add an uncertainty annotation with their doubts and an optional alternative (asserted) text. Modal windows are again presented in a wizard form. Example 7. Another sample uncertainty annotation referring to a questionable entity name.

<textClass> […]
<classCode scheme="http://providedh.eu/uncertainty/ns/1.0"> <certainty xml:id="certainty-G" locus="name" cert="high" ana="https:// providedh.ehum.psnc.pl/api/projects/1/taxonomy#credibility" target="#name-2 #name-3" match="@ref" resp="#annotator-1" assertedValue="#utensil-1"/> </classCode> </textClass> 36 As we can see, the "Asserted value" "utensil" is only a request parameter that is resolved in the TEI annotation to the @match attribute with the value "@ref", i.e., the XPath expression that identies the @ref attributes of two <name> elements. In the TEI annotation, @assertedValue points to "utensil-1" in the list of utensils, but a @locus attribute value of "name" indicates the tag name (i.e., type) of "utensil-1". Use case H is similar to G. In this case a user wants to directly assign an asserted value to a property of an existing entity about which they have concerns, or simply to express those concerns. In the wizard they have to choose "Attribute" in the "Target" combo box. Then they will be able to provide the property name and an optional asserted value for this property. Example 11. Sample uncertainty annotation referring to questionable entity property.

<textClass> […]
<classCode scheme="http://providedh.eu/uncertainty/ns/1.0"> <certainty xml:id="certainty-H" locus="name" cert="medium" target="#person-2" match="persName/roleName" resp="#annotator-2"/> </classCode> </textClass> 5.7 Use Case I: Annotating a Text Fragment as an Occurrence of an Existing Entity 38 Case I is very similar to case C, where we annotate a piece of text which becomes a reference to the entity added to the list. Here we want to annotate that another piece of text refers to the same entity, as in example 12: Example 12. Sample of a partially annotated fragment of the historical recipe data set.
In a similar manner as for cases I and J, a user can annotate with some degree of uncertainty that two entities are the same. The dierence is that we use @sameAs instead of @ref. For instance when a user unies a person with the identier "person-3" from example 4 with a person with the identier "person-100" from the le "file-1", then the following uncertainty annotation is created in the TEI le with "person-3": Example 15. Sample annotation that unifies two entities with some degree of uncertainty.

43
Here, it is worth mentioning that these and other annotations are not added directly to TEI les, as we present them above, but they are stored in a database in the form of commits (batches of changes in the project state). These commits are assigned to the le versions they are based on.
This way, while rendering the project state to the user, the platform can use XML content from a given le version. In addition, the platform renders the changes from commits associated with this and previous versions, eectively showing to the user the state of the project from any given moment in time.  This last case is all about annotating with uncertainty other uncertainty annotations, a big part of how users can interact with each other in the process of working with TEI les. This enables a process of collaborative disambiguation and editing that starts with a user selecting an uncertain fragment from the list of annotations and choosing "Annotate this." After lling uncertainty parameters ("Certainty level" and "Categories"), the following annotation is created: Example 16. Sample uncertainty annotation referring to another uncertainty annotation.

Conclusion 46
Identifying and tracing uncertainty sources and types is an important part of the task of communicating uncertainty in DH research objects, and a process in which the use of the TEI standard can be of great help. Therefore, we identied the need for promoting the use of TEI encoding and providing the tools (both formal and technological) to describe and manage uncertainty throughout the lifecycle of a project. In this paper, we have also presented our approach to creating a taxonomy, the data sets used for its testing, and a collaborative platform that incorporates tools for working with the TEI standard with an emphasis on making uncertainty more present and facilitating collaborative work within the project. We also presented and described how TEI can be used to approach a broad spectrum of use cases where uncertainty can be specied, and how the specication of uncertainty can be modeled in TEI.
Finally, another branch of the PROVIDEDH project developed by the Austrian Centre for Digital Humanities and Cultural Heritage is based on annotations of such aspects of historical recipes as ingredients, preparation time, and spiciness. Recipe similarities are calculated based on the co-occurrence of ingredients, their quantities, and their associations. Ingredient unications are made with the use of mapping tables, where simple and canonical forms of ingredients are preferred rather than fancy mentions like "crisp potatoes," or "fresh meat." The project aims to investigate the similarities and interactions between cuisines and, therefore, cultures. 11 This is a good example of using entity annotations and their unications. Another example of a software system that manages uncertainty (about the location of toponyms in North Africa as they appear in historical sources of medieval and modern times) is described by Martín-Rodilla, Pereira-Fariña,

JENNIFER EDMOND
Jennifer Edmond is associate professor and co-director of the Center for Digital Humanities at Trinity College Dublin. She holds a PhD in Germanic languages and literatures from Yale University, and applies her training as a scholar of language, narrative, and culture to the study and promotion of advanced methods in, and infrastructures for, the arts and humanities. Jennifer serves as President of the Board of Directors of the pan-European research infrastructure DARIAH-EU. Additionally, she represents this body on the Open Science Policy Platform (OSPP), which supports the European Commission in developing and promoting open science policies.

Cezary
Mazurek is a director of Poznań Supercomputing and Networking Center. He received his PhD in Computer Science from Poznań University of Technology in 2004. His expertise and experience is focused on broadly understood ICT applications using research infrastructures. For over twenty-ve years of professional activity Cezary has led interdisciplinary teams in which computer scientists and researchers from dierent domains (e.g., biomedicine, humanities, and earth science) have worked together to address scientic challenges using advanced e-infrastructure services. Recently, he has worked on methods and models for big data processing, the Next Generation Internet initiative (https://www.ngi.eu), and digital humanities.

EVELINE WANDL-VOGT
Eveline Wandl-Vogt is is foundress and orchestra of "exploration space" (2017-) at the Austrian Academy of Sciences. She is foundress and director of the Ars Electronica Research Institute "knowledge for humanity (k4h+)" (2019-) and aliated to metaLab (at) Harvard. Eveline is an experimentalist, knowledge designer, and digital strategist, working against a background of Art Driven Innovation. She has a multidisciplinary university background, including arts and expertise in knowledge and innovation management. She serves as an expert in various global initiatives, mainly in the area of technical and social infrastructures, and participatory methodologies, such as ADHO, ALLEA, COST actions, DARIAH, and ECSA, as well as standardization bodies. She is an experienced knowledge transfer ocer, bridging the gaps between academic knowledge and (social) applications aligned with the SDGs.