Digital Texts in Practice

As a student of intellectual, religious, and cultural developments in areas of the Chinese cultural sphere, my initial motivation for engaging with digital texts thirty years ago was to open up the new possibilities that the digital medium oered to researchers, without losing any of the aordances of a traditional printed edition. This requirement includes use of texts for reading, translating, annotating, quoting, and publishing, thus integrating with the whole of the scholarly workow. At that time theories of electronic texts started to appear and the Text Encoding Initiative had already begun to create a common text model and interchange specication, based mainly on European languages. For East Asian texts, things were much more complicated because of dierent and quickly evolving character encoding standards, dierent textual traditions and approaches to text editing, as well as dierent institutional embedding. In this paper, I will look back at these developments, rst to recount some of the history, albeit from a strictly personal perspective, but also to take stock of the situation and consider where we are now, how we got there, and what remains to be done to realize the dream of the universal digital text, easily shared and annotated, but still tractable, veriable, and authoritative.


Introduction 1
In this paper, I will look back at dierent stages of my work with digital texts. I do this not so much because I consider what I did particularly important, but rather because I think it represents some important milestones in the development of digital texts, the theoretical basis for them, and ways of working with them. And of course I do it because I know a bit more about my own work than about other things, even if these other things might be more important.
2 I am a Sinologist by training; my interest lies mostly in Buddhist and Daoist texts from China of the eighth to twelfth centuries, but also includes intellectual history and poetry of many periods.
However, right here I encountered my rst diculty. The laptop I had bought did not have the standard VGA resolution of 640 x 480, but had the slightly dierent resolution of the Hercules graphics card, which should be 720 x 350. However, the display of this laptop had only an emulation of the Hercules graphics mode, which displayed 720 x 480 dots on the screen. The software I was using, ETen Chinese System (倚天中⽂系統; see gure 1), was designed for either the standard VGA or the Hercules mode, but to start it for this specic hardware, it needed a special command line switch that was not documented in the manual. In the previous paragraph, you may have noticed many highly specialized terms that even then were hardly known among technically minded people, let alone aspiring Sinologists. Since I had absolutely no knowledge of the workings of computers or how to operate them, it took me six weeks to get far enough in the manual, which was written in Chinese, to gure out how to start the Chinese mode. I had to nd the batch le which contained the appropriate command-line switch before I could type Chinese characters. I then immediately started to type in the poems I was working with for my master's thesis in order to create a critical edition, analyze them, and make a concordance. These were about three hundred poems from manuscripts found in the Central Asian oasis city of Dunhuang at the beginning of the twentieth century, with some fragments from collections in Japan (see gure 3 for an example of such a manuscript fragment). Fortunately for me, there was already a modern edition, but it was not very reliable, and many of the characters in the manuscripts were not in the modern standard form used to create the fonts in the Chinese system I was using. Luckily, since I had studied the manual so thoroughly, I knew that there was a little utility program that allowed users to create additional characters if the standard ones did not suce. Apparently, this problem was common enough to already have a ready-made solution. I learned that a separate region in the codespace had been allocated for the denition of such character shapes and they could even be added to the input system, thus making them rst-class citizens indistinguishable from the other characters.

Codes and Coding Systems 8
At this point, it might be useful to very briey introduce character encoding. Computers were invented in England, but the development into the kind of machines we use today was done in the United States. At rst every system had its own way of assigning the internal representation, that is, numbers to visible shapes (i.e., characters). Since this was not sucient when computers started talking to each other, a standard for encoding was needed. This came to be called the American Standard Code for Information Interchange (ASCII), which contained only the most urgently needed characters (in the version of 1962 only the uppercase letters of the English alphabet; the lowercase rows were added in 1963), since memory allocation was an important consideration. ASCII assigns 127 slots (see gure 4), most to visual characters, but also a signicant number to control characters. These slots are represented internally as a xed sequence of "on" and "o" states, usually represented in writing as 1 and 0. The numbers are thus represented as integers with base two. A binary number with 7 digits can encode up to 128 slots. This came to be called a "byte," while every individual digit is a "bit." Internally the bits are represented in memory in xed spaces of length 8, but one of the bits was used for control purposes, so only 7 bits were available for encoding in ASCII. All coding systems that have been developed since then bear some kind of relationship to ASCII: most involve an extension of the codespace; some also a slight change in the assignment of non-alphabetical character shapes. The system I happened to start my computing life with was developed in Taiwan and thus used the Big5 encoding, a coding system for traditional Chinese, which enlarges the ASCII codespace by using two ASCII characters to represent one Chinese character. Big5 is an industry-dened character encoding system (I avoid the word standard here, since there are actually a number of competing, slightly dierent versions and there is no denite standard document 1 ), while most of the other character encoding systems used in East Asia are dened by the government bodies charged with dening standards. Big5 also allows text in plain ASCII to mingle with Chinese, making it much easier to use in practice than the rst batch of ocial encodings, which assume a completely localized system with no room for plain ASCII. Since characters can occupy either one position (English) or two positions (Chinese), the processing becomes more complex and requires specialized software.

The Missing Characters 9
Back to the poems for my master's thesis. I could type and display all the characters easily, and even print them with my dot-matrix printer. But later, when I started to use the quite nicer outline fonts that allow for much more detailed renderings of the characters, which is especially necessary for kanji, I discovered that these customized characters were missing, since there was no way to add them to an existing font. That was the rst time I encountered a problem I would spend a lot of time with in later years: attempting to make these privately dened characters in some ways interchangeable across boundaries of character encodings, operating systems, users, and language environments.

ZenBase 10
After receiving my master's degree, I set out for graduate studies in Kyoto and soon found myself busy helping Urs App and his Zen Knowledgebase project at the International Research Institute for Zen Buddhism of Hanazono University. He was trying to create a text database of all the texts Urs was also a founding member of the Electronic Buddhist Text Initiative (EBTI), which was started by Lew Lancaster at UC Berkeley in 1993. Lew travels a lot and picks up on new trends early. One souvenir of a visit he made to our Kyoto Institute that year was a thick tome titled The SGML Handbook (Goldfarb and Rubinsky 1990), which landed on my desk and claimed to be important. I started to dig into it, but it proved to be very hard to read. Looking for easier access points and other people struggling to understand SGML, I found references to something called the "Text Encoding Initiative" and discussions of "P2," which appeared to be a set of draft documents on an FTP server in Oslo, Norway. This was a real revelation and I immediately saw that TEI would be the future of digital texts: "These Guidelines … are addressed to anyone who works with any text in electronic form. They provide a means of encoding those features of a text which need to be identied in some way in order to aid the processing of that text by computer programs" 3 (Sperberg-McQueen and Burnard 1993). And this information about the documents could also be dened in a machine-readable and machine-exchangeable form! Eureka! Another result of this discovery was that we decided we needed an introduction to TEI and invited the editors to the next EBTI conference, which was being held at Haienji, a remote mountain monastery in Korea, in October 1994, to discuss a "Buddhist DTD" (as we then called it, without really understanding what a DTD was or why it was useful).

12
As you might expect, there were also many setbacks, especially since applying TEI to East Asian texts with their various encodings and dierent writing conventions was still quite a tall order, but the connection was made, and for the next twenty years, applying TEI to East Asian Texts formed a considerable part of my professional life. Urs understood the necessity of all this markup and SGML very well (and wrote enthusiastically about it in the newsletter: see App 1993;1995a, b, c), but he was still reluctant to burden our users with it right away, since a toolchain for working with these SGML encoded texts was still largely missing and what software was available would not work with our Chinese texts. So the texts on the ZenBase CD 1 (App, Wittern, and Fujimoto 1995), which was published in June 1995, were still plain text without TEI markup, but each of them had a TEI header attached, because we realized that having the metadata of the texts in machine-readable form was itself a big win. I did manage to sneak in one text I had encoded in TEI P3 (see gure 5), just to demonstrate the usefulness of this new technology, but we needed to provide a program that would strip out the markup to make it usable in the same way as the other texts. Use of the CD depended upon a grep-like tool, which extracts matching lines from a text le, to locate a search term in the text. Although a primitive method limited in many ways, this tool had made digital texts usable in practice for many scholars as far back as the early 1990s, and even today the regular expression tool grep is loved by colleagues of my generation in Chinese Buddhist Studies. And the TEI SGMLencoded version of the Wudeng Huiyuan (WDHY), despite the beauty of all these angled brackets, did not see much use, at least not any use that would tap into its true potential. The screenshot in gure 6 shows the beginning of the body of the text. What I was really proud of and always used as my sales pitch for TEI was the use of the <RS> 4 tag and its @KEY attribute.
This text recounts the sayings and doings of about two thousand Zen masters, all of them referred to in the text as "師," which means "the master." Using the <RS> tag allowed me to assign a key to each of these masters and thus distinguish the utterances, nd the sayings of specic masters, and much more. Of course, the software to actually do this was not available then, but the concept was still appealing.

CBETA 15
PhD in hand, I went to Taiwan in February 1998 to attend the founding meeting of the Chinese Buddhist Electronic Text Association (CBETA), and then immediately decided to move there with my family. This provided me with an opportunity to work with a team of very dedicated people on a new digital version of the Chinese Buddhist canon. At that time some lay Buddhists and some researchers in Buddhist studies had started to type texts into their computers and made them available on the internet, then in its infancy, thereby allowing free sharing of these texts. As can be expected, there was a lack of authoritative editions and scholarly rigor; most texts did not indicate the source of a digital transcription and there was no way to easily compare one to a printed version to make sure it was accurate. CBETA set a goal of providing a digital version of high quality, one that could be useful to both scholars and believers, based for the rst batch of common texts on the most authoritative and widely used edition of the Chinese Buddhist canon, the Taisho Shinshu Daizokyo compiled in Japan during the 1920s and 1930s.

16
Such a project could not be undertaken without a carefully designed workow and a well-dened text format. In my view, the only possible candidate for this was TEI. Nobody else on the team had ever heard of it, but fortunately they accepted my suggestion and since then the master text format (gure 7 shows an excerpt from such a text) and a number of published versions derived from it have all been encoded in TEI, with the most recent version in TEI P5 and Unicode. 5  , was held in Pisa, Italy, in November 2001, and I was elected a member of the newly formed Technical Council.

18
A long list of things urgently needed to be addressed, including the XML-compatible version of the Guidelines, which was eventually published as P4 in 2002, and a complete overhaul of the way character encoding was addressed within the TEI. For the latter undertaking, I was tasked to form a working group. 6 The TEI Character Encoding Workgroup's eorts resulted, among other things, in the so-called Gaiji module of P5, 7 which allows users of the TEI Guidelines to encode characters within their texts which have not yet been added to Unicode. As the name suggests, the creation of this module was initially driven by the desire to address the problem of widely varying orthography in texts printed in the premodern era across East Asia, but it proved applicable to many other cases as well.
contain markup had to be reformulated as elements, where the value could be textual content, so that the same feature could be expressed without using attributes, and this in turn paved the way to creating elements such as <choice>, which describe dierent logical paths through a document.

20
The old assumption underlying the early work of the TEI (at least in some people's minds), that markup is what is in angled brackets and if one takes that away, the pure, plain text would be left over for the reader to inspect (which had not been entirely true anyway), completely fell apart.
Even the generation of such plain text now involves proper parsing and processing of the text with the XML toolchain, not just a few regular expressions that a Perl programmer might come up with.

Council: P5 21
After sitting on the TEI Technical Council since 2001, to my surprise I was appointed chair in 2003.
Our task was to invent and implement the major architectural changes that were required for P5, including a whole new templating infrastructure based on what came to be known as ODD. 8 Fortunately, the never-tiring Sebastian Rahtz (1955-2016) was a core member who did the lion's share of the work, along with Lou Burnard, James Cummings, and Laurent Romary, so my job was simply to keep the ball rolling, schedule meetings, and the like. Nevertheless, when we actually managed to publish P5 on schedule at the TEI meeting in Maryland in 2007, it was a great relief and allowed me to shift focus to other work.

Daozang jiyao 22
Sometime around 2005, we started a project at our institute in Kyoto, led by the late Monica Esposito (1962Esposito ( -2011, which aimed at research in Daoism during the Qing period, focusing on one of the major text collections of that period, the Daozang jiyao (DZJY). As happens frequently with premodern texts and text collections in China, the content is somewhat uid: after the initial printing, there have been additions, reorderings, and other changes to the overall appearance, as well as a reprint at the beginning of the twentieth century, which again was uid in its composition.
In this project, a digital edition was to re-create this complex history and allow researchers to inspect the dierent versions, compare the content, and trace the development of texts. That was the plan.
After more than ten years of working with markup and digital texts, it was obvious to me that this had to be done in TEI. I started to train our small team of researchers with a pilot project while designing the workow. Some of the digital transcriptions were acquired from a company specializing in the production of text databases, others were keyed in by our partners in Taiwan or China, and some were produced in-house. In any case, the texts came in a plain-text format, where every line in the text le represented a line of the original woodblock print. We converted these text les to structured XML les, which were then proofread and enhanced with information about textual variants. Or such was the plan. Then reality intervened.

24
As it turned out, mixing the two very distinct tasks of proofreading the digital text against its original source and enhancing this digital text with readings from other sources was not making the work easier, but rather required a constant switching of contexts and became a burden that slowed down progress. On top of that, translating the observations our researchers made on the texts into the correct markup constructs also required much more training than expected and was undertaken less than unenthusiastically by those involved.

25
This caused intense discussions among members of the team and we started to look in dierent directions for solutions. At that time, I also started to dig further into editorial theory, scholarly editing, and the way these elds were changed by the introduction of digital tools. In the end, we decided to separate these concerns and created separate les for each of the versions we editedboth a digital facsimile and a documentary transcription-and separately a new edition, which also included punctuation and standardized orthography that reected the editorial views of the team.

26
The work on the Daozang jiyao thus proved to be a good opportunity to further rene both the theoretical approach and the practical implementation of working with premodern Chinese digital texts. Interestingly, at the same time, the TEI community was updating and rening the TEI textual model, 9 which used to consist of "the one and true" digital representation of the text in the form of a transcribed text with markup applied to it, to which text model they added the notion of a digital facsimile as a separate representation of the text, which shortly afterward was followed by a way to document the physical representation of the text, via <sourceDoc>. This development shows how the practice of working with digital text and markup evolves over time and improves our understanding of both the textual features and how markup should be done, a fact that is also documented on an almost daily basis in the discussions on the TEI mailing list.

Kanseki Repository 27
By the year 2010, the practice of using separate text les for dierent witnesses of a text had become well established in our workow. For tracking changes to these les, we had used version control tools from the start. At some point, we realized that the modern distributed variety of these tools, Git and GitHub, not only had the potential to solve the problem of keeping track of changes made to a le, but could also be used to hold all witnesses of a text in one repository, each of them represented as a "branch." (In the terminology of version control software, a branch is one current state in the editing history of the le, which has been given a name to make it easy to address it and to track changes along a specic trajectory.)

28
The distributed nature of this toolchain, which unlike earlier version control systems does not require a central authority, also seemed to have the potential to solve another problem I had been trying to solve almost from the beginning of my work with digital texts. As stated already, one of the aims of my work from the outset was to make a digital version of a text at least as versatile as a printed scholarly edition. For me, this also included taking ownership of one specic copy of such an edition and tracking the work by adding marginal notes, comments, and references directly into the book. With GitHub as a repository for texts and Git as a means to control the various maintenance tasks, researchers interested in a text could clone the text, add their own marginal notes, then make their version of the text available to us or any other researcher to integrate, if we so chose.

29
A Git workow can use any kind of digital material, but it works better with textual material as opposed to images or videos, and even better for texts that use lines as a structural element. This again is where the plain text we used in the Daozang jiyao project worked better than did the XML tree structure, which is at the core of every TEI le.

30
When I rst presented this idea at the TEI conference in Würzburg in October 2011, I got this comment via a tweet from one of the most respected members of the TEI community (gure 8: @rahtz: interesting that @cwittern thinks <> is hard, Git is easy. #tei2011). Figure 8. Aurélien Berra retweeting Sebastian Rahtz's tweet. 31 As described in that talk (published as Wittern 2013), the text format used here is not simply plain text, but rather an extended form of the text format used in the Emacs Orgmode, 10 in spirit comparable to the much more frequently seen Markdown, but better. The dening dierence here is the more elegant and functional choice of markup elements, and the fact that the format was originally conceived as the base for a note-taking and scheduling application, so the markup itself and the software that operates on it are essentially one unit, and the development of the software (which is itself community driven) informs the choices and considerations for markup constructs.
For the DZJY project, we added a few more conventions, to accommodate our specic needs, but without changing any of the essential features. Org mode uses what I called an "implicit markup," which is exactly the opposite of XML. Org mode's markup is as short as possible and in many cases derived from context. An asterisk * followed by a space at the start of a line indicates a heading of level one, instead of TEI's <div> followed by a <head> 11 (and the corresponding closing tags to convey this information).

32
From the beginning, the DZJY was in my view itself a pilot project for a much larger project, on which preparatory work started in earnest in 2012: the Kanseki Repository (GitHub username @kanripo).
Building on the experience of the DZJY, the Kanripo project sought a rm theoretical foundation for the creation of digital textual artifacts, based mostly on the German tradition of scholarly editing and its distinction between "documentary edition" and "interpretative edition." These two types are distinguished through naming conventions for the Git branches. Documentary editions are also represented through digital facsimiles, which can be called up to be displayed side by side with the transcribed text. Interpretative editions may normalize the characters used to modern forms, add punctuation, and also make it possible to add translations and semantic annotations.

34
From earlier textual projects, such as ZenBase, CBETA, and DZJY, but also from other sources available on the Internet, we have compiled an initial catalog of about 10,000 titles to be included in a rst phase of the project; this catalog is also being supplemented by users who deposit whatever texts they are interested in into the repository. Since the initial publication on GitHub in September 2015, and the launch of a dedicated website in March 2016, usage has been increasing slowly but steadily.

Kanripo Project Details 35
All the texts are freely available on GitHub in their source form. This repository of texts can be accessed through the kanripo.org website, but also through a module of the Emacs editor called Mandoku. This allows users to query, access, clone, edit, and push the texts directly from their own computer. Reading, commenting, and editing do not require internet access.

36
While not yet a full realization of the original vision, this project is currently the best compromise I know of between allowing the researcher (user) to take complete ownership of a text-not just in the technical sense, but also in a practical sense of being in a position to actually be able to edit the text in a way that is meaningful in the context of their aims-and authoritative vetting and editorial quality assurance.

37
Figures 9, 10, 11, and 12 demonstrate the concept and functions of the Kanseki Repository. On the website, users can search for texts or browse the catalog. Once a text is found, the webserver reads it from the GitHub repository and serves it to the user. For most texts, there are dierent editions to choose from; usually both documentary and interpretative versions exist. For many texts, there is database, with no direct access to it for the user. KR does it in this way (1) to allow the user control over their data and (2) so that the user's preferences and settings can be applied to dierent applications with which the user might access the KR.) When the user selects a text for display that they had previously cloned to their own account, the text shown will be their own private version, with all changes and annotations, not the public one from @kanripo. See gure 12 for an example.
Other customizations and options become available once logged in.

KR-Shadow 40
What I have shown so far is for users interested in exploration or close reading. For distant reading, text analysis, and similar purposes, a separate account @kr-shadow 13 has been created on Github.
You will nd here the texts of the "master" branch, which is usually the normalized and edited version of the text in a form that makes it easy to download the whole archive at once.

Mandoku 41
As mentioned, the texts can also be accessed from the text editor Emacs, which is available on all major platforms. This is intended for people who work intensely with a text, for example as the topic for a PhD thesis. The Emacs module Mandoku 14 provides ways to search the KR, clone texts, create new branches, and many other functions. All other Emacs extensions and modules can also be used. Figure 10 shows an example of a text with its digital facsimile, and gure 11 shows the same poems, rearranged by line, with a translation added. In the middle there is an example of an inline note. And nally, gure 12 shows the same text, pushed to the user's account and displayed from there on the Kanripo website.  7.4 Relationship to TEI 42 I see the approach described here not as an alternative to TEI but as a useful extension. The original concept for the Kanseki Repository did include a transformation tool that would take the dierent versions of one text and wrap them together into one XML le. This is a "simple" transformation, but there has so far been no opportunity to actually implement it. The other direction, from TEI P5 to a repository of a text with witnesses as branches, does exist, but only for internal use, as it requires a lot of assumptions about source and target. Seen from this angle, the KR can also be envisioned as a postproduction tool, which opens up a new publication avenue for the texts and for users' engagement with them.

In Parting 43
As shown in the above examples of projects I have been involved with over almost thirty years, we have come a long way in sophistication of textual awareness and technical means to model it. The TEI conferences, as well as job advertisements for early career academic positions, for example, show that there is a thriving community of experts who engage in a high-level discourse that further advances our understanding of complex features of texts and ways to represent such features in machine-readable form, readily available for display and analysis. This community is building a bridge right over the deep ridge between researchers in humanities on the one side and computer and technical science on the other.

44
For me, the question still remains: how can we make sure that the fruits of our eorts, the sophisticated digital editions of texts, published online in most cases, actually reach the researchers that want to make use of them? And do they reach them in a way that is most benecial to them? For example, a PhD student working on the development of medical terminology in premodern China, or a researcher interested in changing concepts of relationships as reected in British novels of the eighteenth and nineteenth centuries-for this kind of research, direct access to the source les of texts, and ideally also ways to select, group, and annotate such les, would be necessary, as well as knowledge on how to lter out exactly what is needed for the question at hand and discard the rest.

45
Do we envision such work to be done by the researcher themselves? Or does this require a team of specialists from dierent backgrounds? To what degree should we open the black box which a web-based critical edition is to many of its users, and allow them (in fact force them) to look beyond the surface?

46
As might be obvious by now, to me the answer to this question is: yes, we need to train young researchers in humanities elds to have a minimum understanding of text encoding, text processing, and the underlying technologies. This should be a general requirement, not just an optional minor or a postgraduate-level course. And in an ideal world, I think I would like to introduce them to Emacs, or to a similar tool that allows them to take control of their digital life, rather than being limited to pressing a few colorful icons. App, Urs. 1993. "Guidelines for the Creation of Large Chinese Text Databases." Electronic Bodhidharma 3.