|
The Java Specialists' Newsletter
Issue 036 2001-11-23
Category:
Language
Java version: Using Unicode Variable Namesby Dr. Heinz M. Kabutz
Welcome to the 36th edition of "The Java(tm) Specialists'
Newsletter". This week, we will look at the strange things that
happen when we try to use unicode characters in our code.
I am sitting outside in my garden, with beautiful sunshine and a
pitbull terrier at my command ;-) Approximately a month ago, the
biggest software vendor in South Africa went bankrupt, severely
affecting the availability of software in this country.
Fortunately for me, I have friends in convenient places: I
purchased the software that I needed (Dragon NaturallySpeaking)
from Amazon in Germany and had it shipped to infor AG, who I have
spoken about in other newsletters - they very kindly shipped it
down to the end of the earth.
As a result of using Dragon
NaturallySpeaking, you will probably notice that my newsletters
will have an even more conversational style than before. I am
always looking at ways in which I can improve my newsletters and
serve you better. Please remember to forward this newsletter to
friends and colleagues who are interested in Java.
A special welcome to country No 56, Malta! My wife's previous
boss at a hotel was the Maltese ambassador for Cape Town, which
was really cool, as he had diplomatic immunity from parking fines
and speeding fines. Mind you, traffic laws are rather lax in
this country, I have only had one speeding fine in my life, and I
drive an Alfa Romeo!
South Africa has just become the cheapest country in the world!
We are the first country where a Big Mac costs less than US$ 1.
It is cheaper here even than in the Philipines and China. I had
a good response to my advert for my Java Course (thank you for
your patience in this regard) and so I definitely want to develop
the idea of running courses in South Africa, combined with a
holiday :-)
1707 members are currently subscribed from 56 countries
Would you like to really understand Java concurrency? Join us for an
in-depth study of how threading works in Java. During the course,
you will learn how to write correct and fast multi-threaded Java code.
Please
click here if you would like to learn more. Using Unicode Variable Names
A few months ago, I was reading a book written by the authors of
Java, when I stumbled across a piece of code that was using
Unicode characters as variable names. Being the curious type, I
immediately tried writing a piece of code that used funny
characters. Easier said than done! I don't know of any Java IDE
that supports Unicode. The common e-mail systems in this world
would also choke like a dog on a chicken bone if I sent you a
newsletter containing Unicode characters ;-)
Before I get into how we could use Unicode characters
in our variables, let's just take a step back and think about it:
Imagine being called in by a Japanese company who has got a
memory leak in their program which they want you to fix (one of
the most common tasks I have been asked to perform), and imagine
if in their company they used Japanese characters for their
variables. Yes, it would compile if you follow the ideas in this
newsletter, but what would the result be for me? I would
probably pack my bags and head back home! It's bad enough having
to read code where the variable names are in German or in
Afrikaans, I cannot imagine trying to understand code where I
don't even know the characters used in variable names!
Since I could not find an IDE that supported Unicode, my first
job was to write a Unicode editor. Also easier said than done.
I had learned many years ago that Writers
and Readers are used for Unicode characters, but I had never
really used Unicode before. My first approach at reading and
writing Unicode files looked something like this:
public void load() throws IOException {
BufferedReader in = new BufferedReader(new FileReader(filename));
String s;
while((s = in.readLine()) != null) {
// ...
}
}
Did you know that FileReader extends InputStreamReader? In its
constructor it constructs a FileInputStream that it passes to
its parent. The InputStreamReader has a constructor that takes
as argument the encoding used for reading files. FileReader
unfortunately does not expose the constructor that takes the
encoding as an argument, it simply uses an operating-system
dependent encoding. One cannot but wonder what the author of the
FileReader had been smoking the day he/she wrote that code ...
(Actually, when I wrote the Sun Microsystems Java programmer
examination a few years ago, the only none-GUI question that I
got wrong was a question relating to reading ISO-8859-1 data.
Perhaps there has always been a hole in my knowledge regarding
this topic.)
Should you want to use the FileReader to read an encoding
different to the standard one, you would have to do the following:
public void load() throws IOException {
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream(filename), "UTF-16BE"));
String s;
while((s = in.readLine()) != null) {
// ...
}
}
Without further ado, here is the code for a Unicode text editor.
It allows you to insert Unicode characters by entering their
decimal values and pressing the appropriate button. For the
design, I have followed an approach I saw a few years ago on
jGuru, where all the GUI elements are created lazily. It makes
the GUI code very nicely maintainable, as you never have to worry
in what order elements are constructed.
import java.awt.*;
import java.awt.event.*;
import javax.swing.*;
import java.io.*;
public class UnicodeEditor extends JFrame {
private JPanel buttonPanel;
private JScrollPane editorPanel;
private JTextArea editor;
private final String filename;
private final String encoding;
public UnicodeEditor(String filename, String encoding)
throws IOException {
this.filename = filename;
this.encoding = encoding;
getContentPane().add(getButtonPanel(), BorderLayout.NORTH);
getContentPane().add(getEditorPanel(), BorderLayout.CENTER);
load();
}
protected JPanel getButtonPanel() {
if (buttonPanel == null) {
buttonPanel = new JPanel();
JButton unicodeInsert = new JButton("Insert Unicode:");
final JTextField unicodeField = new JTextField(8);
JButton saveExit = new JButton("Save & Exit");
unicodeInsert.addActionListener(new ActionListener() {
public void actionPerformed(ActionEvent e) {
getEditor().insert(
"" + (char)Integer.parseInt(unicodeField.getText()),
getEditor().getCaretPosition());
}
});
saveExit.addActionListener(new ActionListener() {
public void actionPerformed(ActionEvent e) {
try {
save();
System.exit(0);
} catch(IOException ex) { ex.printStackTrace(); }
}
});
buttonPanel.add(unicodeInsert);
buttonPanel.add(unicodeField);
buttonPanel.add(saveExit);
}
return buttonPanel;
}
protected JTextArea getEditor() {
if (editor == null) {
editor = new JTextArea();
}
return editor;
}
protected JScrollPane getEditorPanel() {
if (editorPanel == null) {
editorPanel = new JScrollPane(getEditor());
}
return editorPanel;
}
protected void load() throws IOException {
BufferedReader in = new BufferedReader(new InputStreamReader(
new FileInputStream(filename), encoding));
StringBuffer buf = new StringBuffer();
int i;
while((i = in.read()) != -1) buf.append((char)i);
in.close();
getEditor().setText(buf.toString());
}
protected void save() throws IOException {
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(filename), encoding));
char[] text = getEditor().getText().toCharArray();
for (int i=0; i<text.length; i++) out.write(text[i]);
out.close();
}
public static void main(String[] args) throws IOException {
if (args.length < 1)
throw new IllegalArgumentException(
"usage: UnicodeEditor filename [encoding]");
String encoding = (args.length == 2)?args[1]:"UTF-16BE";
UnicodeEditor editor = new UnicodeEditor(args[0], encoding);
editor.setSize(500,500);
editor.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
editor.show();
}
}
By default this uses the UTF-16BE format, standing for
Sixteen-bit Unicode Transformation Format, big-endian byte
order. You can specify any encoding when you start
the editor, such as UTF-8, ISO-8859-1, etc. But, before we use
this editor, we first need to have a file containing Unicode
characters. I've written a code generator that generates two
files, MathsSymbols.java and MathsSymbolsTest.java:
import java.io.*;
public class UnicodeVariableGenerator {
public static void generateMathsSymbols() throws IOException {
PrintWriter out = new PrintWriter(new OutputStreamWriter(
new FileOutputStream("MathsSymbols.java"), "UTF-16BE"));
out.println("public interface MathsSymbols {");
out.print( " public static final double ");
out.print((char)960);
out.println(" = 3.14159265358979323846;");
out.print( " public static final double ");
out.print((char)949);
out.println(" = 2.7182818284590452354;");
out.println("}");
out.close();
}
public static void generateMathsSymbolsTest() throws IOException {
PrintWriter out = new PrintWriter(new OutputStreamWriter(
new FileOutputStream("MathsSymbolsTest.java"), "UTF-16BE"));
out.println("public class MathsSymbolsTest implements MathsSymbols {");
out.println(" public static void main(String args[]) {");
out.println(" System.out.println(\"The value of PI is: \" + \u03C0);");
out.println(" System.out.println(\"The value of E is: \" + \u03B5);");
out.println(" }");
out.println("}");
out.close();
}
public static void main(String[] args) throws IOException {
generateMathsSymbols();
generateMathsSymbolsTest();
}
}
I won't include the code for MathsSymbols.java and
MathsSymbolsTest.java, please run the UnicodeVariableGenerator
class to generate that code. I already bomb out enough mailing
systems by sending my newsletters in HTML (*evil grin*), no use
in causing more trouble by using Unicode. Once you've run the
UnicodeVariableGenerator, please load the MathsSymbols.java
file with the UnicodeEditor, using UTF-16BE and have a look at
it: you should see the Greek symbol for PI.
The last "trick" you need to know about is how to compile the
MathsSymbols.java and MathsSymbolsTest.java. If you open the
files with notepad or vi, you will probably see a rather strangely
formatted file, with two bytes being used per character. When
you compile these files, you therefore have to specify the
character encoding used:
javac -encoding UTF-16BE MathsSymbols*.java
That's it! And it has kept me busy longer than just about all
the other newsletters to try and get it right. Another
interesting variation of this is where David Treves (who I met
through a really cool advanced Java chat list - JavaDesk on
YahooGroups - where you get shouted at if you ask beginner
questions) tried to write/read Hebrew to the Database. He
doggedly tried to get it working until eventually he succeeded -
after I had given up hope of ever figuring it out. Stay tuned
for the next few weeks to see how he did it.
Until next week, when we celebrate our first anniversary as the
most interesting Java newsletter on the Internet ;-)
Kind regards
Heinz
Language Articles
Related Java Course
Discuss at The Java Specialist Club
|