correspSearch – Connecting Scholarly Editions of Letters

The web service correspSearch collects correspondence metadata from diverse scholarly editions by using the TEI XML–based “Correspondence Metadata Interchange Format” developed by the TEI Correspondence SIG. The web service provides the aggregated data not only on a website, but also via an API in dierent machine-readable formats and under a free license. The API also makes it possible to automatically refer or even link a letter in one digital edition to related letters provided by other editions. Collecting edited letters in correspSearch and linking them via the API helps overcome certain methodological disadvantages of scholarly editions of letters, since it enables the contextualization and connection of single letters within an entire correspondence network

databases do not oer an open, standard-based, and documented way to provide and update data.
And nally, the data cannot be retrieved via an open Application Programming Interface (API) under a free license for subsequent use.A general problem seems to lie with the conception of a single, centralized database, into which all the letter data have to be imported and where the data must be maintained.Considering the huge numbers of letters and potential research focuses, this concept may not be realistically pursuable.This could be one reason, why the BRIEFkasten project was discontinued after a couple of years.In contrast, an approach where decentralized data is simply aggregated via a standardized interchange format seems more promising (Rapp 2009, 205-6; Herres and Neuhaus 2002, 18)especially if we would like to allow editors to use their own appropriate data model (e.g., in digital scholarly editions, the specic underlying TEI encoding) for their dierent purposes.This consideration also led Romanticism specialist Wolfgang Bunzel to request the creation of a decentralized, preferably open digital platform, based on HTML/XML and operating with minimal TEI standards, which is extensible in dierent directions and allows for existing web portals and websites to contribute at the lowest possible cost.This doesn't require some kind of superstructure which covers the entire amount of letters from the Romantic era (the size of which could not be estimated exactly, anyway) but rather an intelligent linking system, which associates existing documents with one another.The creation of such a nexus will naturally lead to research options reaching from searches for persons and places to specic keyword-based searches … (Bunzel 2013, 123, my translation) The web service correspSearch takes a step in this direction, collecting the letter metadata from various editions by using a standardized interchange format and making the data available through open interfaces, based on TEI XML and under a free license.The initiative for this web service arose from the workshop "Editions of Letters from around 1800: Finding and Connecting Interfaces," which was organized in February 2014 by Anne Baillot and Markus Schnöpf.

Concept of correspSearch
The web service correspSearch aims to help scholars by oering a central place to search for edited letters, and by guiding them to the original publications.Only the metadata is aggregated from letter indexes of scholarly editions.Legal problems which could arise while retrieving full texts can thus be avoided; furthermore, scholarly editions found only in print form can be included without expensive digitization of entire publications.CorrespSearch is conceptually open in the sense that there is no focus on a particular research subject, time period, or region.This allows for new kinds of research questions to be explored.
The web service correspSearch is designed to aggregate correspondence metadata from digital indexes of letters which are hosted elsewhere on the web.These digital indexes of letters are created and provided by various scholarly edition projects according to a specic TEI XML-based interchange format (see section 3 below) and were registered to the web service with their URL.
The metadata provided via these letter indexes is then retrieved by the web service in periodic intervals.Thus, it is very straightforward and easy to add indexes of letters to the web service and update them if necessary.The data not only can be searched via a website, but also can be queried automatically via an Application Programming Interface (API) that is open, well documented, and supports dierent formats.As the data gathered is provided under free licenses, the web service oers all aggregated data under a CC-BY 4.0 license as well, thus making the data available for further reuse.
There already exist a variety of digital infrastructures regarding correspondence.The web service correspSearch is therefore designed to t into the existing infrastructure landscape and complement existing services in a useful way.For information regarding purely archival resources, for example, there are already digital formats and services available, such as the Encoded Archival Description (EAD) format and the German digital union catalogue Kalliope. 4is is why correspSearch focuses more specically on edited letters-correspondence which has undergone scholarly editing, meaning that there is a published abstract or transcription, maybe even including a commentary.With further developments we envision that the data aggregated by correspSearch can be combined with data from other resources.
The technologies behind correspSearch are the open source software eXist-db 5 and the programming languages XQuery and XSLT.

Correspondence Metadata Interchange Format 10
In order for correspSearch to be able to aggregate the metadata of letters from diverse sources, these metadata have to be provided in a standardized and machine-readable format.The TEI XML format seemed appropriate in this context, because it had been used in digital editions for several years, and has now become a de facto standard.Furthermore, it is possible to rely on the work of the TEI Correspondence Special Interest Group (SIG) which developed the element <correspDesc> ("correspondence description") for the TEI Guidelines.This extension of TEI provides a tagset for the recording of correspondence-specic metadata within the TEI Header, such as sender, addressee, and place of writing (Stadler, Illetschko, and Seifert 2016; Stadler 2014).After going through a few changes, <correspDesc> was integrated into the TEI Guidelines in April 2015. 6t therefore seemed straightforward to base the exchange format needed here on TEI XML and to use the new element <correspDesc> in this context (Stadler 2014).In order to allow for the provision of digital indexes of letters, the "Correspondence Metadata Interchange Format" (CMIF) is currently being developed within the TEI Correspondence SIG. 7 The essence of a CMI document consists of multiple <correspDesc> elements, each describing one letter.However, the <correspDesc> elements are used in a signicantly reduced and restricted manner in order to allow for further automatic processing.For instance, the <correspAction> element may only be of types "sent" and "received" and can only contain the elements <persName>, <placeName>, and <date> as children.In addition, if applicable, each <correspDesc> element may contain a reference to the corresponding digital edition which is available online.Only a few elements from the TEI Header are relevant for the CMIF besides the <correspDesc> element, for example, a bibliographic reference about the corresponding edition of the letters.Using a very reduced and restricted subset of the TEI for the CMIF is necessary in order to enable interoperability without any human intervention (besides registering the URL of the CMI le with correspSearch). 8wever, this restriction applies only to the CMI le, but not to the TEI encoding of the texts and their metadata in the scholarly edition itself.At this point in the development of the web service, only basic metadata about letters can be recorded: information about sender, addressee, places of writing and receiving, and dates.In the future, it will be possible to record additional metadata such as people or events mentioned in a letter, or whether some references are uncertain, or the type of textual basis of a letter edition (e.g., manuscript or draft).It will then also be possible to browse through these categories in the correspSearch web interface.We also plan to allow the encoding of the corresponding archival resource in CMIF with the help of ocial URIs from digital archival catalogues (in Germany, for example, from Kalliope).These further developments raise the question whether CMIF should continue to be fully conformant to the TEI Guidelines or if it might be necessary to stray slightly from the TEI in certain aspects.These and other issues regarding the further development of CMIF are discussed in detail in "Perspectives of the Further Development of the Correspondence Metadata Interchange Format (CMIF)" (Dumont 2015).

Authority Files 13
The CMIF specication strongly recommends the usage of IDs from authority les to identify persons and places across projects and publications. 10Authority les are databases which were usually created and are still maintained by cooperative consortia of libraries, often headed by a major or national library, such as the Library of Congress or the German National Library.They are designed to "establish forms of names" and other metadata "used on bibliographic records … to provide uniform access to materials in library catalogs and to provide clear identication of authors and subject headings." 11For each entity there exists a record with a unique, projectindependent and persistent identier, which can be referenced by everyone and is widely used as a reference in scholarly communities.In the context of correspSearch, the use of authority les avoids the limitations that would be created by referring to specic strings.Names can occur in dierent variants (e.g., without a certain rst name) and vary orthographically.Furthermore, they can be ambiguous (e.g., "John Smith").Thus, correspSearch searches are based on the IDs from authority les as encoded in the CMI les.The benet of using authority-controlled IDs was proven by the German scholarly community via the application of the simple BEACON 12 format in diverse encyclopedias or digital editions (Baillot and Busch 2016).

14
The web service correspSearch supports several authority les for persons.Currently, these are the Integrated Authority File (GND) of the German National Library, 13 the Library of Congress Authorities, 14 the Autorités de la Bibliothèque Nationale de France (BNF), 15 as well as the Web Authorities of the New Diet Library (NDL) in Japan. 16The Virtual International Authority File (VIAF) hosted by OCLC is also supported. 17With the help of VIAF, correspSearch maps IDs from one authority le to the IDs of the others.In the end, it does not matter which one of the ve supported authority les is used in a CMI le to identify senders or recipients: the web service automatically looks up all available IDs of a person.

15
Using IDs from authority les is not entirely unproblematic.First, in some cases entities may have been recorded multiple times, which means that there are several IDs available for one person.
This complicates the selection of the appropriate ID, but orientation can be gained by assessing the validity of the dierent options: oftentimes, one of the records available is more enriched than the others and should therefore be used.Second, it also happens that entities, especially persons, that are of interest to research in general do not have any record in an authority le.In such cases it is necessary to add new records to an authority le, which usually has to be done by the sta of the maintaining library or organization.Because of limited human resources in libraries, completing this step can take some time after a new record has been suggested.Despite this lack of optimization, authority les remain the best approach currently available to connect information about persons, places, etc., across projects and institutions. 18Especially in the context of semantic web technologies, the usage of standardized unique identiers (i.e., URIs from authority les) for common entities is crucial in order to ensure the sharing of research data as linked open data.We can expect the maintenance of authority les to be enhanced in the future.

16
For the identication of places, the support of GeoNames was implemented in correspSearch.
GeoNames is a free licensed database, which consists of over ten million geographic names that are categorized in nine "feature classes" and 645 "feature codes." 19For each place name, additional data such as geographic coordinates are available.Place names that are not already available in GeoNames can be added by any registered user.If necessary, it is possible to note alternate names for a place, for example, names in other languages or historical names.It is also possible to categorize as a "historical site" a place that no longer exists.Despite these features, the handling of historical places in GeoNames is limited because it is a contemporary gazetteer: that is, it is not possible to record the time period during which a place (or place name) existed.It is a major limitation that the data about state boundaries or the state to which a city belongs only reect the current situation and have no historical depth whatsoever.This is challenging, but not impossible to address, when it comes to correspondences since the places mentioned in the context of historical correspondence are often names of cities or villages.They may today be part of a larger municipality but remain addressable because they still exist as a separate data record with a geographic point and not with a certain area, which can change over time.All in all, the usage of GeoNames in CMIF and correspSearch up to now has been benecial and raised only limited issues.

Data
As of the beginning of February 2017, correspSearch has gathered the metadata of more than 26,000 letters from almost 110 publications.Letters from digital editions (e.g., Carl Maria von Weber -Collected Works, 20 Letters and Texts: Intellectual Berlin around 1800, 21 or the project Alfred Escher's Correspondence 22 ) are not only registered to the web service but also directly available to the user via a link.Additionally, many letters from printed scholarly editions have been recorded in CMIF and are thus available through a digital catalogue for the rst time: for example, the letters of the natural scientist and anatomist Samuel Thomas Soemmerring or the correspondence between Karl August Varnhagen von Ense and Friedrich de la Motte Fouqué.Besides scholarly editions, the web service records edited letters that are published in appendixes of monographs or in journal articles.
Examples are the open access journal HiN: HiN.Alexander von Humboldt im Netz.International Review for Humboldtian Studies, 23 where edited letters of the famous scientist are published periodically for the rst time, or the monograph Studien zum geistlichen Werk Otto Nicolais, 24 where Klaus Rettinghaus published a number of edited letters in appendix B.
The aggregated data is growing and will continue to grow in the future; multiple scholarly editions have announced their upcoming contributions.Such participation is crucial: it is only through the willingness of relevant projects to provide correspondence metadata that a web service like correspSearch can be valuable to the scholarly community.Every scholarly project is welcome to provide its metadata in the CMI format.Interested scholars and projects can nd step-by-step instructions, documentation of the CMI format, and a FAQ section on the website 25 as well as templates and examples in the GitHub repository of the TEI Correspondence SIG. 26 Providing a CMI le has multiple advantages for the participating research projects: First, it enables them to make their edited letters more easily available to the community and to connect their editions of letters (automatically-see below) with other scholarly editions.Second, they help to facilitate research that involves a broader examination of correspondences, as mentioned above.Furthermore, by publishing their metadata, they can get useful feedback to correct and rene their metadata, as the experience with CMIF and correspSearch shows so far: more users read and use the metadata and therefore errors can be more easily discovered. 27

APIs 20
Researchers-and of course editors working on their own scholarly editions-can search the correspSearch data pool easily via the web interface.The search functionalities are currently restricted to the essential elds, but they will be extended in the future.The aim from the very beginning, however, was to provide the data not only in a human-readable but also in a machine-readable way in order to allow for automated queries and subsequent use.Therefore correspSearch oers multiple application programming interfaces.The APIs support all search parameters available in the web interface. 28The results can be retrieved in several formats.First, correspSearch oers all data as TEI XML in the CMI format.Second, to enable as much reuse as possible, the output is also provided in a JSON serialization.Mixed-content XML elements cannot be well converted into JSON, but such markup almost never occurs in the CMIF.Besides TEI XML and TEI JSON, the web service oers, third, an experimental CSV output, which is intended for individual use of search results (which is why this API is also implemented in the graphical web interface).Finally, a BEACON API enables the cross-linking of search results to a specic person in correspSearch. 29

21
Via the APIs, scholars can exploit the aggregated data using the technologies and software they favor.Therefore, with suciently extended data and suitable software it will be possible to perform research on, for instance, social or correspondence networks based on correspSearch.The developers from LAB110, for example, have imported correspSearch data into their visualization tool "nodegoat" to visualize a correspondence network (gure 3). 30igure 3.A visualization in "nodegoat" from correspondence metadata gathered by correspSearch.

22
Thanks to the API, it is also possible to automatically refer or even link from one digital letter edition to related letters provided by other editions.This function has already been implemented in the edition humboldt digital 31 (gure 4).Whenever a user visits a certain letter on the website, the digital edition automatically performs a query via the correspSearch API to nd out if there are any other letters from Humboldts's correspondent (in the pictured case, Samuel Thomas Soemmerring) in other scholarly editions.In the pictured case the results from the correspSearch API oer-among others-a hint to a letter from Georg Christoph Lichtenberg, which was sent to Samuel Thomas Soemmerring two days before.This feature helps researchers avoid methodological problems when interpreting a piece of correspondence: when analyzing a letter they usually consider the preceding and following letters in the correspondence between the sender and addressee as well.Their interpretation often does not include the letters which the correspondents sent to or received from other persons within the same context, although the content of those letters may be highly relevant to the letter they are interested in.With this feature, the broader correspondence context of historical letters can be explored in a greater depth than ever before.

24
Besides TEI XML, TEI JSON, and CSV, the correspSearch API also oers a preliminary RDF version of the data by using the web service XTriples developed by the Academy of Sciences, Humanities and Literature in Mainz.With the help of XTriples, the CMIF is converted into RDF statements. 32oviding the aggregated correspondence metadata as RDF enables users to analyze the data with the help of semantic web technologies.Furthermore, the RDF data can be used as Linked Open Data in new contexts and in combination with data from other sources in order to answer questions beyond the correspondence-specic issues (Grüntgens and Schrade 2016).For instance, with data from multiple universities' catalogues of professors, lecturers, and students, scholars could research whether or not a correspondence network (or part of it) reects a certain "school."The number of letters addressed to Humboldt probably exceeded 100,000, even though only 3,400 of those letters are known today (Schwarz 2002, 194-95; Suckow and Schwarz 1998, 119).From the beginning, this extraordinarily large number of documents has deterred Humboldt researchers from planning an edition of Humboldt's complete letters.They have concentrated instead on providing editions for certain parts of his correspondence. 34The focus on specic correspondents of Humboldt is legitimate from an economical and methodological point of view.However, these selective editions also entail problems (Suckow and Schwartz 1998).First, the choice of criteria applied to guide the text selections is of decisive importance, as well as the choice of whose correspondence to prioritize, since, after all, the aim of completeness will never be achieved, though it still exists in principle. 35Secondly, users of selective editions may lose sight of the fact that individual letters not only form part of a correspondence between two people, but rather are embedded in a much larger network of letters.This is the case in particular when it comes to Humboldt, who wrote various letters to dierent people at the same time, and often his statements on similar issues or people vary depending on the addressee.Humboldt possessed great diplomatic skills and knew very well that he might risk making enemies had he expressed critical opinions too openly towards the person concerned (Werner 2000, 2).

26
Collecting the edited letters of Humboldt in correspSearch (a task that has not yet been completed) helps overcome a methodological disadvantage of selective editions, since it enables the contextualization of individual letters within the entire correspondence of Humboldt: on the one hand for the editor during the creation of a scholarly edition, on the other hand for the user while exploring and analyzing edited letters.By proceeding like this, the practice of selective editions, which is straightforward with regard to other aspects, may be continued.Therefore, the possibility of automatically connecting letters in scholarly editions with the help of correspSearch, as shown above, was implemented within the edition humboldt digital. 36The connection is especially useful for this edition project because the selection of letters and texts considered for the edition is partially based on thematic categories.In addition, further development of correspSearch will allow for improved accessibility to the various topics of Humboldt's correspondence.

27
However, the web service correspSearch is not only meant to provide renewed access and contextualization for historical persons who are already known.It is also intended to draw attention to less well-known correspondence partners.The aggregation of correspondence metadata from diverse editions enhances the research options and makes it possible to evaluate networks and their presumably central gures.This method not only draws attention to persons who have not previously been taken into consideration by a scholarly edition, but it also facilitates the evaluation of their importance, because it focuses not only on one or two persons but makes it possible to display a whole network of, for example, academics.

Conclusion 28
Especially for editions of letters, the digital age has brought signicant progress that allows scholars to address a series of known challenges like incompleteness and contextualization of letters, or the discovery and disclosure of correspondence networks.The web service correspSearch is a building block for improved solutions in this context.We plan to develop correspSearch further in order to enhance its search functionalities (e.g., for persons and publications mentioned in the text) and to make use of the extensions planned for the CMI format.
In the course of this development, support for further authority les will be implemented.Besides the search functionalities, we plan to simplify the creation of CMI les through diverse tools.The aggregated data should grow further.Recently, new metadata records have been added: among others, records from the online platform of the critical edition of the works of Richard Strauss Werke 37 and from the digital edition Briefe der Fruchtbringenden Gesellschaft und Beilagen. 38The more correspondence metadata is published and may be included in correspSearch, the more useful the web service will become to creators and users of scholarly editions.We welcome additional contributions! 39 4

Figure 2 .
Figure 2. Schematic rendering of the different parts of a <correspDesc> element in the CMI format.The example originates from an example provided by the TEI Correspondence SIG.

9 12
Journal of the Text Encoding Initiative, Issue 10, 14/02/2018 Selected Papers from the 2015 TEI Conference correspSearch -Connecting Scholarly Editions of Letters 10

Figure 4 .
Figure 4. Screenshot of the digital scholarly edition edition humboldt digital, which presents among other manuscripts some letters to and from Alexander von Humboldt.In the top right corner, links to letters in other scholarly editions are presented with the help of the correspSearch API.Source: http://edition-humboldt.de/ H0002729.

Figure 5 .
Figure5.This diagram, created by Torsten Schrade,33 illustrates how the correspondence metadata aggregated by correspSearch could be used as Linked Open Data (LOD) with information from other resources like authority files (here GeoNames and the Integrated Authority File (GND)).This simple example illustrates the idea of using data from correspSearch with further LOD resources.