Encoding Disappearing Characters: The Case of Twentieth-Century Japanese-Canadian Names

The Landscapes of Injustice project seeks to encode mid-twentieth-century documents by and about the Japanese-Canadian community so they are accessible to modern audiences. The fundamental problem is that some of the kanji used at that time have been replaced since then by dierent kanji, and others have been removed from lists of formally acceptable characters. This report documents our eorts with two technologies designed to address this situation. The rst is the Standardized Variation Sequence (SVS) feature of Unicode. Our work revealed that this set of variation sequences does not completely cover the old and new glyph pairs identied by the Japanese authorities, and that the pairs formally identied by the Japanese authorities do not completely cover all the new glyph forms in general use. We turned to TEI’s <charDecl> , <glyph> , and <mapping> elements as a second technology to augment the support provided by Unicode. Lastly, we dealt with the issue of


The Problem of Disappearing Japanese Characters 1
The Landscapes of Injustice 1 project seeks to integrate data from various sources (such as oral histories, court records, government minutes, land title documents, maps, community directories, and personal letters) to capture multiple perspectives on events aecting Canadians of Japanese descent in the 1940s and to create products based on that research for modern academic and public audiences. The Japanese-language documents (for example, community directories) used kanji (Chinese characters used in Japanese script) which were perfectly acceptable at the time, but which have since been superseded (either ocially or practically) by other kanji glyphs. The project's concern with the changing forms of kanji over the twentieth century is primarily practical, rather than a scholarly focus. 2 In 1946 and 1981 the Japanese government specied simpler forms (known as shinjitai kanji) for certain characters and deprecated their older, traditional forms (known as kyūjitai kanji) for many purposes such as education and government publication (Agency for Cultural Aairs 2010).
Although the kyūjitai kanji were not banned, the obsolete kyūjitai kanji have become unreadable to more and more readers over time, thus making texts including them dicult for modern readers, but at least there is a recognized mapping from new form to old form. In addition to the ocially recognized shinjitai-kyūjitai pairs of kanji, there are other forms which are outside the lists of current kanji as identied by the Japanese government in Agency 2010. These hyōgaiji kanji may still appear (particularly in names), or in some cases may have counterpart modern forms. We have so far found just over 1,000 instances of what I call non-conventional kanji, consisting of just over 110 dierent shinjitai-kyūjitai pairs (some of which appear more than once in our documents) and about 5 hyōgaiji (all of which are single instances).
Our pre-1945 source documents include both classes of non-conventional kanji forms, particularly in personal names. Personal names are especially problematic as they are proper nouns, and as such the correct reading is dependent almost entirely on the characters in the name rather than grammatical or other context clues. The project is particularly sensitive to the representation of names as the community involved was largely erased as a community from Canadian society in the 1940s. Changes to the kanji thus risk the names of the individuals aected being "disappeared" from the historical record we are creating in a way which echoes the disappearance from history suered by the actual community. More practically, people searching for specic names may not nd the records they seek due to a mismatch of kanji, and similarly for people reading results who do not recognize a name.

Representing Disappeared Characters in Unicode 4
The project's focus is on the historical treatment of the Japanese Canadian community, and not the evolution of the Japanese language, so we sought the simplest solution that would meet our needs. Initial research suggested exploiting features in the Unicode character encoding standard.

5
Unicode has a remarkably complex treatment for mapping certain non-conventional to conventional kanji (Unicode Consortium 2018a, 23.4, 872-74), the full details of which are beyond the scope of this paper. It uses what are known as Standardized Variation Sequences (Unicode Consortium 2018b). Even the following simplied consideration raises problems with this approach for our situation. 6 We want to preserve the forms as found yet maintain an association with a conventional form where one exists. A Standardized Variation Sequence consists of one entity for the conventional form of the kanji (e.g., &#x793E;) followed immediately by one of several other entities (&#xFE00;, &#xFE01;, and so on), yielding, for example, &#x793E;&#xFE00;. Unicode also species lookup tables to map from the conventional form to the non-conventional form. Note that the nonconventional form is not explicitly encoded in the document, so this approach precludes an application normalizing a non-conventional form to a conventional one in inconsistent or unpredictable ways-which of course is helpful to us. However, we are still at the mercy of (1) font developers and the degree of support they have built in to their fonts for variation sequences, and (2) application developers and the extent to which the application tries to locate a font that The Firefox implementation is at the time of writing more sophisticated than the other browsers in that it can search for a font supporting the SVS and display the correct form; the other browsers require that a font supporting the SVS be specied.

7
Dierences in support are apparent in searching, too. We searched the ve encodings listed above for each of the two kanji. Chrome and Safari ignore the variant sequence and thus treat the two glyphs as interchangeable (whether searching for "社" or "社," all ve instances of either character are found). That is generally the desired behavior for all but scholars of historical Japanese. Firefox pays attention to the variant sequence, but it also fails to normalize as it should, so when we searched for "社" we got no hits, but when we searched for "社" we got three hits, one of which was the Standardized Variant, which as just noted is displayed to the user as "社." These ndings are summarized in table 1:

Representing Disappeared Characters in TEI 9
We were already using TEI to encode the documents, so we needed to nd and implement TEI markup to capture the three classes of problematic kanji. Specically, we employed the gaiji module's <charDecl>, <g>, <glyph>, and <mapping> elements to represent each non-conventional kanji, the conventional kanji associated with that non-conventional kanji (if one exists), and whether the mapping appears in the kyujitai-shinjitai list and/or the Standardized Variant list (TEI Consortium 2017, sec. 5.2). 2

10
We created a TEI le named chars.xml consisting of a character declaration (<charDecl>) element which contains a <glyph> element for each non-conventional form (kyūjitai or hyōgaiji) to describe it and its conventional equivalent. Within each <glyph> element, we use a <mapping> element with a specic value for the @type attribute for each variant of the glyph. In the body of the data le, we use a <g> element to encode the kanji with an @xml:id attribute which points to the appropriate <glyph> element in the chars.xml le. This approach allows us to capture the three classes of pairs of non-conventional and conventional forms consistently, as shown in the following three examples (note that some characters may not display properly on some user agents).

11
Example of kyūjitai with shinjitai counterpart and in Unicode Standardized Variation Sequences: The values we used for the @type attribute ("kyūjitai", "shinjitai", and "hyōgaij") reect our circumstances; for anyone not already familiar with the twentieth-century history of kanji, their meanings would be explained by a simple search for those terms in Wikipedia. The specic values we have used for the @type attribute may not be semantically accurate for other languages or other eras of Japanese. However, the utility of the approach does not depend on those specic values, so it could easily be implemented using more appropriate values for the @type attribute tailored to the specic circumstances.

Training Encoders of Texts Containing Disappeared
Characters 18 Having established a data model, we then turned to the job of applying that model to the  (1) what is to some degree arcane Japanese, especially for second-language users and those outside Japan; (2) the Unicode standard, especially Standardized Variants; and (3) TEI XML and specically the elements described above.

19
An important aspect of the project is engagement with the Japanese-Canadian community and providing that community with a sense of editorial input, if not authorship, of the material.
Clearly the most critical skill set is facility with the non-conventional kanji forms. In general, it is usually better to start with someone with subject matter expertise and train them in the technical and workow skills. In our circumstance, and after substantial consultations with colleagues at our partner Japanese-Canadian museum, we concluded that the most suitable candidate to do the volume of work we required to an adequate level of competence would be a student who is reasonably uent in Japanese, knowledgeable about the history, and technically competent.
That person would focus on improving their facility with the various forms of kanji within the documents. This approach has proven workable given that our project's primary scholarly focus is not on the evolution of kanji, though it has approximately doubled the amount of time required to encode the document.

Conclusions 20
Our goal is to encode documents containing non-conventional forms of kanji so that all forms are available for processing and for use by human users. A potential solution based on Unicode Standardized Variation Sequences did not cover enough of the instances we encountered. Of the problematic forms in our data, the proportion of kyūjitai-shinjitai pairs was much lower than we expected, and the proportion of hyōgaiji much higher. We therefore decided to encode the variant glyphs explicitly, using the features provided in the gaiji module in TEI. This allowed us to specify type attributes to describe dierent classes of kanji forms and the Unicode Standardized Variant in our encoding of the document. It was dicult to nd people with all the necessary skills to do this encoding. The best solution for us was to train an otherwise competent encoder of Japanese to recognize and accurately encode the non-conventional kanji.

21
We now have a robust and consistent encoding which covers all the instances in our data. The next phase of the project will focus on processing the TEI to represent the characters in output products for use by researchers and by the public. The project will produce not only web-based outputs, but also print-based and museum installations, and for these we will need to make careful editorial decisions about which kanji to use to balance our wish to honor the names (as they were at the time) of the people who suered the injustices presented by the project, and our wish to ensure that those names (and the people they represent) do not disappear to modern readers.

BIBLIOGRAPHY
Agency for Cultural Aairs, Government of Japan. 2010. "Academic Index of Kanji
2 Ken Lunde has pointed out that while it is straightforward to provide this kind of mapping in TEI, in fact the Unicode Consortium, through its Unihan Database, already has a mechanism for mapping equivalences such as these, and it would be worthwhile to propose updates to the Unihan Database for any mappings it does not yet handle. Coincidentally, at the TEI 2018 conference in Tokyo, Duncan Paterson proposed a new <uniHan> element for TEI, which would be a child of <charProp>, and whose content would be one of the Unihan Database properties (Jenkins, Cook, and Lunde 2018). So by using existing properties and proposing new ones where necessary, then capturing those properties through the <uniHan> element, these relationships could be eciently encoded.

STEWART ARNEIL
Stewart Arneil is a programmer/consultant at the Humanities Computing and Media Centre at the University of Victoria, Canada. He holds an MA in computational theory and certications in instructional design and in project management. He has thirty years of experience in the private and public sectors managing academic projects and developing software, databases, and websites for research and educational purposes in collaboration with language and subject matter experts.