Unicode and Japanese Kanji
White Paper
2009-10-10
Revision 1 (2009-10-10)
Revision 2 (2009-10-10)
Tony Pottier http://www.tonypottier.info
Appendix 1: List of Japanese Kanji by Unicode Order
Appendix 2: List of Japanese Kanji by Learning Order
In this white paper, I will discuss about Unicode and the Japanese language. I will locate and classify all the Japanese Kanji inside the Unicode CJK range.
Thanks to Jim Breen (http://www.csse.monash.edu.au/~jwb/ ) for pointing out errors and inaccuracies in the first version of this paper.
In the old history of computing, text used to be written using 8 bits (256 values); which is more than enough to store all of our alphabet plus some more. However, with the globalization hitting the doors of information technology, a standard to write any language needed to be created. This is Unicode. In this paper, I will refer to the UTF-16 variant as it is how Windows systems are currently natively storing wide character strings (in fact, 8 bits ANSI strings are still in the heart of the Windows application programming interface, but most of functions also have a UTF-16 counterpart).
Now the problem is that with Japanese, things are not so easy. Unicode hasn’t become the de-facto standard for Japanese. They’ve been using the JIS X 0208 (created in 1978) and their little brother (For Internet, SHIFT JIS may ring some bells) long before Unicode was even created. Nowadays, those Japanese-specific characters encoding are still used alongside Unicode. For occidental users, this is a source of problem as the coding is not compatible with the latin1 characters in Unicode. There are, however, conversion tables between Japanese encoding and their Unicode version.
I realized that Unicode was not a good solution for Japanese when writing a small OpenGL game that would have a Japanese version. For those who are not familiar with game programming, in order to display text on screen there’s a technique called “bitmap font”. This is fact just a texture containing all the characters that we’ll print on screen.

(Figure 1: a bitmap font texture for the latin1 character set)
This is where the problem really starts. The CJK Unicode range is over 20000 characters; and this is wasted memory as they won’t be used. The Japanese language is written with two syllabaries (Hiragana and Katakana), but it also uses Chinese characters (Kanji). But not all Chinese characters, just a small subset named “jōyō kanji (常用漢字)” that is maintained by the Japanese Ministry of Education. As of 2009, there are 2,131 of them. More accurately, there are more than ten thousands Chinese characters in use for the Japanese language but on this paper; we’ll focus on the official list every Japanese-educated person should know.
Problems:
· The 2,131 Japanese kanji are inside the CJK code pages; ranging from U+4E00 to U+9FA5 (20,902 characters, that’s roughly only 10% of them!)
· On those 2,131 Japanese kanji, only half are truly essentials (Most complex ones can be written as Hiragana in modern Japanese)
· The Unicode Consortium provides a list of Japanese kanji (http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/) but it is hardly exploitable as it lists every one of them that appear in the Japanese encoding. The “Unihan” Unicode papers about CJK characters are a generic documentation that does not focus on Japanese.
What we are going to do to extract the CJK characters is a simple intersection in the set theory. Given that:
CJK: Set of all CJK characters
J: Set of all Japanese kanji
Sn: Set of all Japanese kanji learned in nth grade, ∈
[1, 6]
And knowing that:
J ⊂ CJK
∀n∈ [1, 6], Sn ⊂
J
In order to find all the Japanese Kanji inside the CJK mess that The Unicode Consortium created, we will first need a complete list of this code page. In order to quickly achieve this, I created a small C# program that outputs all the Unicode range:
for(uint i=0x4E00;i <= 0x9FFF;i++)
{
string line = String.Format("{0};U+{1:X4};{2}\r\n", (char)i,i,i);
}
Figure 2: C# code excerpt used to printout the whole CJK range.
The created string contains the character, the Unicode standard value U+xxxx and the corresponding decimal value. It is voluntarily separated by semicolons in order to quickly import them into a spreadsheet. Microsoft Excel was used for this but this could have been done with any other similar tool.
|
Char |
Unicode (hex) |
Decimal |
|
一 |
U+4E00 |
19968 |
|
丁 |
U+4E01 |
19969 |
|
丂 |
U+4E02 |
19970 |
|
七 |
U+4E03 |
19971 |
|
丄 |
U+4E04 |
19972 |
|
丅 |
U+4E05 |
19973 |
|
丆 |
U+4E06 |
19974 |
|
万 |
U+4E07 |
19975 |
Figure 3: An excerpt of the imported result in a spreadsheet.
The next step is to get a list of all legal Japanese kanji “jōyō kanji” and to create a cross reference between the two lists (intersection). But the cross reference will be more intelligent than that. It will also tell whether the Kanji is essential or not. In order to achieve this, we will use the official recommendation of the Japanese Ministry of Education. Indeed, this administration publishes the order in which Japanese students should learn the Chinese characters. This order is from easiest/most commonly used to difficult/almost unused ones. More precisely, during their six years before high school, Japanese students are taught each year a certain set of kanji. When they reach High School; they should know the whole “kyōiku kanji” subset (1,006 kanji). The other kanji are learned in high school and they contain characters that are mostly used in literacy and newspapers.
We can therefore determine a “Learning
Order”, ranging from 1 (first learned, most important characters) to 7 (last
learned, high school level).
The cross-reference with the 21,000 CJK characters will be done in several step.
1. Establish a list of the kyōiku kanji subet
2. Establish a list of the Jōyō kanji
3. Establish the final list of “Learning Order”
We first establish the list of the kyōiku kanji subset and add the year when it’s supposed to be learned:
|
Kanji |
Grade |
|
一 |
1 |
|
二 |
1 |
|
三 |
1 |
|
四 |
1 |
|
五 |
1 |
|
六 |
1 |
|
… |
… |
|
難 |
6 |
|
革 |
6 |
|
頂 |
6 |
|
骨 |
6 |
Figure 4: The kyōiku kanji subet (first and last characters)
Then we try to find each CJK character in this list. If it’s found we display the grade. In Excel, this is done with VLOOKUP.
|
Char |
Unicode (hex) |
Decimal |
Kyōiku kanji |
|
一 |
U+4E00 |
19968 |
1 |
|
丁 |
U+4E01 |
19969 |
3 |
|
丂 |
U+4E02 |
19970 |
#N/A |
|
七 |
U+4E03 |
19971 |
1 |
|
丄 |
U+4E04 |
19972 |
#N/A |
|
丅 |
U+4E05 |
19973 |
#N/A |
|
丆 |
U+4E06 |
19974 |
#N/A |
|
万 |
U+4E07 |
19975 |
2 |
Figure
5: the first CJK characters, matched with the Kyōiku kanji subset.
The second operation is simpler: we will determine if the kanji is
a legal Japanese character by trying to find it in the official list
(Jōyō kanji). Once again this is a VLOOKUP in Excel and some Boolean
logic ie: =NOT(ISNA(VLOOKUP(A136;'Jōyō
kanji'!A:A;1;FALSE)))
|
了 |
U+4E86 |
20102 |
#N/A |
TRUE |
|
亇 |
U+4E87 |
20103 |
#N/A |
FALSE |
|
予 |
U+4E88 |
20104 |
3 |
TRUE |
|
争 |
U+4E89 |
20105 |
4 |
TRUE |
|
亊 |
U+4E8A |
20106 |
#N/A |
FALSE |
|
事 |
U+4E8B |
20107 |
3 |
TRUE |
|
二 |
U+4E8C |
20108 |
1 |
TRUE |
|
亍 |
U+4E8D |
20109 |
#N/A |
FALSE |
|
于 |
U+4E8E |
20110 |
#N/A |
FALSE |
Figure 6: An excerpt of the resulting table.
By combining the two, we can obtain a final list. The logic is as following:
If there is a value in the Kyōiku kanji
print the value
Else
If the kanji is a legal Japanese
character
print 7 (high school level kanji)
Else
print “Not Japanese”
This, in Excel and for the first character, was written as follow for this experiment:
=IF(ISNA(D2);IF(E2;7;"Not Japanese");D2)
The experiment with Unicode and Japanese is now complete.
|
Char |
Unicode (hex) |
Decimal |
Kyōiku kanji |
Jōyō kanji |
Japanese Learning Order |
|
一 |
U+4E00 |
19968 |
1 |
TRUE |
1 |
|
丁 |
U+4E01 |
19969 |
3 |
TRUE |
3 |
|
丂 |
U+4E02 |
19970 |
#N/A |
FALSE |
Not Japanese |
|
七 |
U+4E03 |
19971 |
1 |
TRUE |
1 |
|
丄 |
U+4E04 |
19972 |
#N/A |
FALSE |
Not Japanese |
|
丅 |
U+4E05 |
19973 |
#N/A |
FALSE |
Not Japanese |
|
丆 |
U+4E06 |
19974 |
#N/A |
FALSE |
Not Japanese |
|
万 |
U+4E07 |
19975 |
2 |
TRUE |
2 |
|
丈 |
U+4E08 |
19976 |
#N/A |
TRUE |
7 |
|
三 |
U+4E09 |
19977 |
1 |
TRUE |
1 |
|
上 |
U+4E0A |
19978 |
1 |
TRUE |
1 |
|
下 |
U+4E0B |
19979 |
1 |
TRUE |
1 |
|
丌 |
U+4E0C |
19980 |
#N/A |
FALSE |
Not Japanese |
|
不 |
U+4E0D |
19981 |
4 |
TRUE |
4 |
|
与 |
U+4E0E |
19982 |
#N/A |
TRUE |
7 |
Figure 7: The first results
·
The
complete Unicode Japanese kanji list can be found as an Appendix to this paper.
· Although most characters tend to be in first pages of the Unicode CJK range, there is absolutely no logical order for the location of them. A “level 3” character can be found at the end of the range for instance (U+9F3B).
This paper does not include “Jinmeiyō kanji”, which is another Japanese subset of Chinese characters that are used to write some Japanese personal names.
Because the underlying mathematical operations of the whole experiment are just a couple of intersections in the set theory; this can also be done using a relational database and JOIN operations.
Wikipedia
The Unicode Consortium
Tony Pottier (1986). French graduate from ESIEA (“Grande Ecole d’Ingénieurs”); specialized in Information Technology. E-mail: contact ~at~ tonypottier ~dot~ info.