Tuesday, November 04, 2008

Handle Chinese in Java

(This is one of my old Google Notes.)

In Java, Reader and Writer are used to handle character (char) stream, and InputStream and OutputStream are used to handle byte streams. InputStreamReader and OutuptStreamWriter are bridges between byte streams and character streams.

Characters can have many different coding schemes, such as ASCII, GB2312, UTF-8 (Unicode Transformation Format), when they are represented in bytes. While characters in Java, char or String, are Unicode only.

A Charset is a named mapping between sequences of sixteen-bit Unicode, which is the character representation in Java, and sequences of bytes. A Charset knows how to convert a byte sequence to a (Unicode) char sequence, and vice versa, following the standard it implements.

Not surprisingly, both InputStreamReader and OutuptStreamWriter can be configured to use a specified Charset, but there is no concept of Charset in Reader and Writer.

The following Java program demonstrates how to handle Chinese in Java.

import java.io.BufferedWriter;
import java.io.OutputStreamWriter;

public class ChineseTest {
public static void main(String[] args) throws Exception {
String chinese = "\u4eca\u65e5\u83dc\u6839\u8c2d"; // 1
System.out.println(chinese); //2
BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(System.out, "GB2312")); //3
bw.write(chinese); //4
bw.close(); //5
}
}

Code comments:
  1. This is the Unicode for 今日菜根谭, generated by "native2ascii -encoding GB2312 c.txt", where the content of c.txt is 今日菜根谭 encoded in GB2312. The utility native2ascii converts a file with native-encoded characters (characters which are non-Latin 1 and non-Unicode) to one with Unicode-encoded characters.
  2. Can't print 今日菜根谭, by using the platform's default Charset.
  3. Create a Writer using the GB2312 Charset.
  4. Now we can print out 今日菜根谭.
  5. Flush the output. Required.

No comments: