Unicode and Japanese Kanji

White Paper

2009-10-10

Revision 1 (2009-10-10)
Revision 2 (2009-10-10)

Tony Pottier http://www.tonypottier.info

Appendix 1: List of Japanese Kanji by Unicode Order
Appendix 2: List of Japanese Kanji by Learning Order

 

Introduction

In this white paper, I will discuss about Unicode and the Japanese language. I will locate and classify all the Japanese Kanji inside the Unicode CJK range.

Thanks

Thanks to Jim Breen (http://www.csse.monash.edu.au/~jwb/ ) for pointing out errors and inaccuracies in the first version of this paper.

Problem

In the old history of computing, text used to be written using 8 bits (256 values); which is more than enough to store all of our alphabet plus some more. However, with the globalization hitting the doors of information technology, a standard to write any language needed to be created. This is Unicode. In this paper, I will refer to the UTF-16 variant as it is how Windows systems are currently natively storing wide character strings (in fact, 8 bits ANSI strings are still in the heart of the Windows application programming interface, but most of functions also have a UTF-16 counterpart).

Now the problem is that with Japanese, things are not so easy. Unicode hasn’t become the de-facto standard for Japanese. They’ve been using the JIS X 0208 (created in 1978) and their little brother (For Internet, SHIFT JIS may ring some bells) long before Unicode was even created. Nowadays, those Japanese-specific characters encoding are still used alongside Unicode. For occidental users, this is a source of problem as the coding is not compatible with the latin1 characters in Unicode. There are, however, conversion tables between Japanese encoding and their Unicode version.

I realized that Unicode was not a good solution for Japanese when writing a small OpenGL game that would have a Japanese version. For those who are not familiar with game programming, in order to display text on screen there’s a technique called “bitmap font”. This is fact just a texture containing all the characters that we’ll print on screen.

(Figure 1: a bitmap font texture for the latin1 character set)

This is where the problem really starts. The CJK Unicode range is over 20000 characters; and this is wasted memory as they won’t be used. The Japanese language is written with two syllabaries (Hiragana and Katakana), but it also uses Chinese characters (Kanji). But not all Chinese characters, just a small subset named “jōyō kanji (常用漢字)” that is maintained by the Japanese Ministry of Education. As of 2009, there are 2,131 of them. More accurately, there are more than ten thousands Chinese characters in use for the Japanese language but on this paper; we’ll focus on the official list every Japanese-educated person should know.

Problems:

·         The 2,131 Japanese kanji are inside the CJK code pages; ranging from U+4E00 to U+9FA5 (20,902 characters, that’s roughly only 10% of them!)

·         On those 2,131 Japanese kanji, only half are truly essentials (Most complex ones can be written as Hiragana in modern Japanese)

·         The Unicode Consortium provides a list of Japanese kanji (http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/) but it is hardly exploitable as it lists every one of them that appear in the Japanese encoding. The “Unihan” Unicode papers about CJK characters are a generic documentation that does not focus on Japanese.

Mathematical Background

What we are going to do to extract the CJK characters is a simple intersection in the set theory. Given that:

CJK: Set of all CJK characters
J: Set of all Japanese kanji
Sn: Set of all Japanese kanji learned in nth grade, [1, 6]

And knowing that:

J CJK
n [1, 6], Sn J

Experiment

In order to find all the Japanese Kanji inside the CJK mess that The Unicode Consortium created, we will first need a complete list of this code page. In order to quickly achieve this, I created a small C# program that outputs all the Unicode range:

for(uint i=0x4E00;i <= 0x9FFF;i++)

{

string line = String.Format("{0};U+{1:X4};{2}\r\n", (char)i,i,i);

}

Figure 2: C# code excerpt used to printout the whole CJK range.

The created string contains the character, the Unicode standard value U+xxxx and the corresponding decimal value. It is voluntarily separated by semicolons in order to quickly import them into a spreadsheet. Microsoft Excel was used for this but this could have been done with any other similar tool.

Char

Unicode (hex)

Decimal

U+4E00

19968

U+4E01

19969

U+4E02

19970

U+4E03

19971

U+4E04

19972

U+4E05

19973

U+4E06

19974

U+4E07

19975

Figure 3: An excerpt of the imported result in a spreadsheet.

The next step is to get a list of all legal Japanese kanji “jōyō kanji” and to create a cross reference between the two lists (intersection). But the cross reference will be more intelligent than that. It will also tell whether the Kanji is essential or not. In order to achieve this, we will use the official recommendation of the Japanese Ministry of Education. Indeed, this administration publishes the order in which Japanese students should learn the Chinese characters. This order is from easiest/most commonly used to difficult/almost unused ones. More precisely, during their six years before high school, Japanese students are taught each year a certain set of kanji. When they reach High School; they should know the whole “kyōiku kanji” subset (1,006 kanji). The other kanji are learned in high school and they contain characters that are mostly used in literacy and newspapers.

We can therefore determine a “Learning Order”, ranging from 1 (first learned, most important characters) to 7 (last learned, high school level).

The cross-reference with the 21,000 CJK characters will be done in several step.

1.       Establish a list of the kyōiku kanji subet

2.       Establish a list of the Jōyō kanji

3.       Establish the final list of “Learning Order”

We first establish the list of the kyōiku kanji subset and add the year when it’s supposed to be learned:

Kanji

Grade

1

1

1

1

1

1

6

6

6

6

Figure 4: The kyōiku kanji subet (first and last characters)

Then we try to find each CJK character in this list. If it’s found we display the grade. In Excel, this is done with VLOOKUP.

Char

Unicode (hex)

Decimal

Kyōiku kanji

U+4E00

19968

1

U+4E01

19969

3

U+4E02

19970

#N/A

U+4E03

19971

1

U+4E04

19972

#N/A

U+4E05

19973

#N/A

U+4E06

19974

#N/A

U+4E07

19975

2

Figure 5: the first CJK characters, matched with the Kyōiku kanji subset.

The second operation is simpler: we will determine if the kanji is a legal Japanese character by trying to find it in the official list (Jōyō kanji). Once again this is a VLOOKUP in Excel and some Boolean logic ie: =NOT(ISNA(VLOOKUP(A136;'Jōyō kanji'!A:A;1;FALSE)))

U+4E86

20102

#N/A

TRUE

U+4E87

20103

#N/A

FALSE

U+4E88

20104

3

TRUE

U+4E89

20105

4

TRUE

U+4E8A

20106

#N/A

FALSE

U+4E8B

20107

3

TRUE

U+4E8C

20108

1

TRUE

U+4E8D

20109

#N/A

FALSE

U+4E8E

20110

#N/A

FALSE

Figure 6: An excerpt of the resulting table.

By combining the two, we can obtain a final list. The logic is as following:

If there is a value in the Kyōiku kanji
    print the value
Else
    If the kanji is a legal Japanese character
        print 7 (high school level kanji)
    Else
        print “Not Japanese”

This, in Excel and for the first character, was written as follow for this experiment:

=IF(ISNA(D2);IF(E2;7;"Not Japanese");D2)

The experiment with Unicode and Japanese is now complete.

Char

Unicode (hex)

Decimal

Kyōiku kanji

Jōyō kanji

Japanese Learning Order

U+4E00

19968

1

TRUE

1

U+4E01

19969

3

TRUE

3

U+4E02

19970

#N/A

FALSE

Not Japanese

U+4E03

19971

1

TRUE

1

U+4E04

19972

#N/A

FALSE

Not Japanese

U+4E05

19973

#N/A

FALSE

Not Japanese

U+4E06

19974

#N/A

FALSE

Not Japanese

U+4E07

19975

2

TRUE

2

U+4E08

19976

#N/A

TRUE

7

U+4E09

19977

1

TRUE

1

U+4E0A

19978

1

TRUE

1

U+4E0B

19979

1

TRUE

1

U+4E0C

19980

#N/A

FALSE

Not Japanese

U+4E0D

19981

4

TRUE

4

U+4E0E

19982

#N/A

TRUE

7

Figure 7: The first results

Findings

·         The complete Unicode Japanese kanji list can be found as an Appendix to this paper.

·         Although most characters tend to be in first pages of the Unicode CJK range, there is absolutely no logical order for the location of them. A “level 3” character can be found at the end of the range for instance (U+9F3B).

Going Further

This paper does not include “Jinmeiyō kanji”, which is another Japanese subset of Chinese characters that are used to write some Japanese personal names.

Because the underlying mathematical operations of the whole experiment are just a couple of intersections in the set theory; this can also be done using a relational database and JOIN operations.

References

Wikipedia
The Unicode Consortium

Background

Tony Pottier (1986). French graduate from ESIEA (“Grande Ecole d’Ingénieurs”); specialized in Information Technology. E-mail: contact ~at~ tonypottier ~dot~ info.