Abstract: Unicode is the most important computing industry standard for representation and handling of text, no matter which of the world's writing systems is used. This newsletter discusses some selected features of Unicode, and how they might be dealt with in Java.
Welcome to the 209th issue of The Java(tm) Specialists' Newsletter, written in Vienna by our friend Dr. Wolfgang Laun who works for Thales Austria GmbH. I went to the CONFESS Java conference in Vienna last week together with my 11 year old daughter as a travel companion. Imagine our surprise when we arrived on the 3rd of April to a snow covered Vienna! Wolfgang had warned me that it was c-o-l-d, but I thought he was exaggerating. Our first port of call was to buy some trekking boots for my girl to keep her feet warm. We are not used to sub-zero temperatures in Crete. On Thursday morning, Wolfgang took us on an interesting walk around Vienna, showing us the original area of the Roman fort, then the architectures through the various eras - Romanesque, Gothic, baroque, etc. We learned a lot about Vienna and saw things that most tourists would miss, such as jutting out stones in narrow passages to keep the wagon wheels away from the wall.
Internationalization is tricky, due to the inexact nature of human communication. I'll never forget the time I tried to transfer £502.46 to my printers. Unfortunately my European banking system discarded the "." and initiated a transfer of £50246! Here in Europe they use the comma as a decimal point and the dot as a thousand separator. Fortunately I was able to cancel the transfer before it had a chance to go very far. Similarly, exchanging text can be surprisingly tricky. Thanks Wolfgang for taking your time to write this article on Unicode for us. I certainly learned a lot from it.
Administrative: We have moved over to a new mailing list, powered by Infusionsoft. Most links in my newsletters will from now on start with "https://iw127.infusionsoft.com/". Don't be alarmed.
javaspecialists.teachable.com: Please visit our new self-study course catalog to see how you can upskill your Java knowledge.
   The first Unicode standard was published in 1991, shortly
   after the Java project was started. A 16-bit design was
   considered sufficient to encompass the characters of all the
   world's living languages.  Unicode 2.0, which no longer
   restricted codepoints to 16 bits, appeared in 1996, but
   Java's first release had emerged the year before. Java had to
   follow suit, but char remained a 16-bit type.
   This article reviews several topics related to character and
   string handling in Java.
   
System.out.println("To be or not to be\u000Athat is
         here the question");\uHHHH)?Character.MIN_VALUE is 0 and
      Character.MAX_VALUE is 65535, how many different
      Unicode characters can be represented by a char
      variable?String s has length 1, is the
      result of s.toUpperCase() always the same as
      String.valueOf(Character.toUpperCase(s.charAt(0)))?
      
   The use of "characters" in Java isn't quite as simple as the
   type char might suggest; several misconceptions
   prevail. Notice that the word "character" goes back to the
   Greek word "χαράζω" (i.e., to scratch, engrave) which may be
   the reason why so many scratch their head over the resulting
   intricacies.
   
   Several issues need to be covered, ranging from the
   representation of Java programs to the implementation of the
   data types char and
   java.lang.String, and the handling of character
   data during input and output.
   
"[Java] programs are written using the Unicode character set." (Language specification, § 3.1) This simple statement is followed by some small print, explaining that each Java SE platform relates to one of the evolving Unicode specifications, with SE 5.0 being based on Unicode 4.0. In contrast to earlier character set definitions, Unicode distinguishes between the association of characters as abstract concepts (e.g., "Greek capital letter omega Ω") to a subset of the natural numbers, called code point on the one hand, and the representation of code points by values stored in units of a computer's memory. The Unicode standard defines seven of these character encoding schemes.
It would all be (relatively) simple if Unicode were the only standard in effect. Other character sets are in use, typically infested with vendor specific technicalities, and character data is bandied about without much consideration about what a sequence of storage units is intended to represent.
Another source of confusion arises from the limitation of our hardware. While high-resolution monitors let you represent any character in a wide range of glyphs with variations in font, style, size and colour, our keyboards are limited to a relatively small set of characters. This has given rise to the workaround of escape sequences, i.e, a convention by which a character can be represented by a sequence of keys.
   A Java program needs to be stored as a "text file" on your
   computer's file system, but this doesn't mean much except
   that there is a convention for representing line ends, and
   even this is cursed by the famous differences between all
   major OS families. The Java Language Specification is not
   concerned with the way this text is encoded, even though it
   says that lexical processing expects this text to contain
   Unicode characters. That's why a Java compiler features the
   standard option -encoding encoding. As
   long as your program doesn't contain anything else but the 26
   letters, the 10 digits, white space and the special
   characters for separators and operators, you may not have to
   worry much about encoding, provided that the Java compiler is
   set to accept your system's default encoding and the IDE or
   editor play along. Check https://docs.oracle.com/javase/7/docs/technotes/guides/intl/encoding.doc.html
   for a list of supported encodings.
   
Several encodings map the aforementioned set of essential characters uniformly to the same set of code units of some 8-bit code. The character 'A', for instance, is encoded as 0x41 in US-ASCII, UTF-8 and in any of the codes ISO-8859-1 through ISO-8859-15, or windows-1250 through windows-1258. If you need to represent a Unicode code point beyond 0x7F you can evade all possible misinterpretations by supplying the character in the Unicode escape form defined by the Java language specification: characters '\' and 'u' must be followed by exactly four hexadecimal digits. Using this, the French version of "Hello world!" can be written as
package eu.javaspecialists.tjsn.examples.issue209; public class AlloMonde { public static void main(String... args) { System.out.println("All\u00F4 monde!"); } }
Since absolutely any character can be represented by a Unicode escape, you might write this very same program using nothing but Unicode escapes, as shown below, with line breaks added for readability:
\u0070\u0075\u0062\u006c\u0069\u0063\u0020\u0063\u006c\u0061\u0073 \u0073\u0020\u0041\u006c\u006c\u006f\u004d\u006f\u006e\u0064\u0065 \u0020\u007b\u000a\u0020\u0020\u0020\u0020\u0070\u0075\u0062\u006c \u0069\u0063\u0020\u0073\u0074\u0061\u0074\u0069\u0063\u0020\u0076 \u006f\u0069\u0064\u0020\u006d\u0061\u0069\u006e\u0028\u0020\u0053 \u0074\u0072\u0069\u006e\u0067\u005b\u005d\u0020\u0061\u0072\u0067 \u0073\u0020\u0029\u007b\u000a\u0009\u0053\u0079\u0073\u0074\u0065 \u006d\u002e\u006f\u0075\u0074\u002e\u0070\u0072\u0069\u006e\u0074 \u006c\u006e\u0028\u0020\u0022\u0041\u006c\u006c\u00f4\u0020\u006d \u006f\u006e\u0064\u0065\u0021\u0022\u0020\u0029\u003b\u000a\u0020 \u0020\u0020\u0020\u007d\u000a\u007d\u000a
So, the minimum number of keys you need for such an exercise is 18: the 16 hexadecimal digits plus '\' and 'u'. (On some keyboards you may need the shift key for '\'.)
   The preceding tour de force contains several instances of the
   escape \u000a, which represents the line feed
   control character - the line separator for Unices. By
   definition, the Java compiler converts all escapes to Unicode
   characters before it combines them into a sequence of
   tokens to be parsed according to the grammar. Most of the
   time you don't have to worry much about this, but there's a
   notable exception: using \u000A or
   \u000D in a character literal or a string
   literal is not going to create one of these characters
   as a character value - it indicates a line end to the lexical
   parser, which is a violation of the rule that neither
   carriage return nor line feed may occur as themselves within
   a literal. These are the places where you have to use one of
   the escape sequences \n and \r.
   Heinz wrote about this almost 11 years ago in
   newsletter 50.
   
   Attentive readers might now want to challenge my claim that
   all Java programs can be written using only 18 keys, which
   did not include 'n' and 'r'. But there are two ways to make
   do with these 18 characters.  The first one uses an
   octal escape, i.e., \12 or
   \15. The other one is the long-winded
   representation of the two characters of the escape sequence
   by their Unicode escapes: \u005C\u006E and
   \u005C\u0072.
   
   Another fancy feature of Java is based on the rule that
   identifiers may contain any character that is a "Java letter"
   or a "Java letter-or-digit". The language specification (cf.
   § 3.8) enumerates neither set explicitly, it delegates the
   decision to the java.lang.Character methods
   isJavaIdentifierStart and
   isJavaIdentifierPart, respectively. This lets
   you create an unbelievable number of identifiers even as
   short as only two characters. Investigating all
   char values yields 45951 and 46908 qualifying
   values respectively, and this would produce 2,155,469,506
   identifiers of length two! (We have to subtract two for the
   two keywords of length two, of course: do
   and if.)
   
   The decisions which character may be start or part of a Java
   identifier exhibit a good measure of laissez-faire. Along
   with the dollar sign you can use any other currency sign
   there is. (Isn't ¢lass a nice alternative
   to the ugly clazz?) More remarkable is the
   possibility of starting an identifier with characters that
   are classified as numeric, e.g., Ⅸ, the Roman numeral
   nine, a single character, is a valid identifier. Most
   astonishing is the option to use most control characters as
   part of an identifier, all the more so because they don't
   have printable representations at all.  Here is one example,
   with a backspace following the initial letter 'A':
   A\u0008. Given a suitable editor, you can create
   a source file where the backspace is represented as a single
   byte, with the expected effect when the file is displayed on
   standard output:
   
public class FancyName { public static void main( String[] args ){ String = "backspace"; System.out.println(); } }
   We may now try to answer the question how many character
   values can be stored in a variable of type char,
   which actually is an integral type. The extreme values
   Character.MIN_VALUE and
   Character.MAX_VALUE are 0 and 65535,
   respectively.  These 65536 numeric values would be open to
   any interpretation, but the Java language specification says
   that these values are UTF-16 code units, values that
   are used in the UTF-16 encoding of Unicode texts. Any
   representation of Unicode must be capable of representing the
   full range of code points, its upper bound being 0x10FFFF.
   Thus, code points beyond 0xFFFF need to be represented by
   pairs of UTF-16 code units, and the values used with these
   so-called surrogate pairs are exempt from being used
   as code points themselves. In
   java.lang.Character we find the static methods
   isHighSurrogate and isLowSurrogate,
   simple tests that return true for the ranges
   '\uD800' through '\uDBFF' and
   '\uDC00' through '\uDFFF',
   respectively. Also, by definition, code units 0xFFFF and
   0xFFFE do not represent Unicode characters. From this we can
   deduct that at most 65536 - (0xE000 - 0xD800) - 2 or 63486
   Unicode code points can be represented as a char
   value.
   
   The actual number of Unicode characters that can be
   represented in a char variable is certainly
   lower, simply because there are gaps in and between the
   blocks set aside for the various alphabets and symbol sets.
   
   It is evident that the full range of Unicode code points can
   only be stored in a variable of type int. This
   has not always been so: originally, Java was meant to
   implement Unicode characters where all code points could be
   represented by a 16-bit unsigned integer. Since that time,
   Unicode has outgrown this Basic Multilingual Plane (BMP), so
   that Java SE 5.0 had to make amends, adding character
   property methods to java.lang.Character, in
   parallel to existing ones with a char parameter,
   where the parameter is an int identifying a code
   point.
   
   When a character can be encoded with a single 16-bit value, a
   character string can be simply encoded as an array of
   characters.  But the failure of char to cover
   all Unicode code points breaks the simplicity of this design.
   Accessing a string based on the progressive count of code
   points or Unicode characters isn't possible by mere index
   calculation any more, because code points are represented by
   one or two successive code units.
   
Given that we have a String value where surrogate pairs occur intermingled with individual code units identifying a code point, how do you obtain the number of Unicode characters in this string? How can you obtain the n-th Unicode character off the start of the string?
The answers to both questions are simple because there are String methods providing an out-of-the-box solution. First, the number of Unicode characters in a String is obtained like this:
public static int ucLength(String s) {
  return s.codePointCount(0, s.length());
}
   Two method calls are sufficient for implementing the
   equivalent of method charAt, the first one for
   obtaining the offset of the n-th Unicode character in terms
   of code unit offsets, whereupon the second one extracts one
   or two code units for obtaining the integer code point.
   
public static int ucCharAt(String s, int index) {
  int iPos = s.offsetByCodePoints(0, index);
  return s.codePointAt(iPos);
}
When the world was young, the Romans used to chisel their inscriptions using just 21 letters in a form called Roman square capitals. This very formal form of lettering was not convenient for everyday writing, where a form called cursiva antigua was used, as difficult to read for us now as it must have been then. Plautus, a Roman comedian, wrote about them: "a hen wrote these letters", which may very well be the origin of the term chicken scratch.
Additional letters, diacritics and ligatures morphing into proper letters are nowadays the constituents of the various alphabets used in western languages, and they come in upper case and lower case forms. Capitalization, i.e., the question when to write the initial letter of a word in upper case, is quite an issue in some languages, with German being a hot contender for the first place, with its baffling set of rules. Moreover, writing headings or emphasized words in all upper case is in widespread use.
As an aside, note that the custom of capitalizing words (as used in English texts) may have subtle pitfalls. (Compare, for instance, "March with a Pole" to "march with a pole", with two more possible forms.)
   Java comes with the String methods
   toUpperCase and toLowerCase.
   Programmers might expect these methods to produce strings of
   equal length, and one to be the inverse of the other when
   initially applied to an all upper or lower case word. But
   this is not true. One famous case is the German lower case
   letter 'ß' ("sharp s"), which (officially) doesn't have an
   upper case form (yet). Executing these statements
   
Locale de_DE = new Locale( "de", "DE" ); String wort = "Straße"; System.out.println( "word = " + wort ); String WORT = wort.toUpperCase( de_DE ); System.out.println( "WORT = " + WORT );
produces
wort = Straße WORT = STRASSE
   which is correct. Clearly,
   Character.toUpperCase(char) cannot work this
   small miracle.  (The ugly combination STRAßE
   should be avoided.) More fun is to be expected in the near
   (?) future, when the LATIN CAPITAL LETTER SHARP S (U+1E9E)
   that was added to Unicode in 2008 will be adopted by trendy
   typesetters (or typing trendsetters), like this:
   STRAẞE.
   
   Care must be taken in other languages, too. There is, for
   instance, the bothersome Dutch digraph IJ and ij.
   There is no such letter in any of the ISO 8859 character
   encodings and keyboards come without it, and so you'll have
   to type "IJSSELMEER". Let's apply the Java
   standard sequence of statements for capitalizing a word to a
   string containing these letters:
   
Locale nl_NL = new Locale( "nl", "NL" ); String IJSSELMEER = "IJSSELMEER"; System.out.println( "IJSSELMEER = " + IJSSELMEER ); String IJsselmeer = IJSSELMEER.substring( 0, 1 ).toUpperCase( nl_NL ) + IJSSELMEER.substring( 1 ).toLowerCase( nl_NL ); System.out.println( "IJsselmeer = " + IJsselmeer );
This snippet prints
IJSSELMEER = IJSSELMEER IJsselmeer = Ijsselmeer
which is considered wrong; "IJsselmeer" would be the correct form. It should be obvious that a very special case like this is beyond any basic character translation you can expect from a Java API.
Kind regards
Wolfgang
In our next part to be published in May 2013, we will look at combining diacritical marks, collating or sorting strings, supplementary characters, property files and show how to write text files with the correct encoding. This whole subject is surprisingly tricky to get right, considering how long humans have been engraving their initials on whatever surface they could.
We are always happy to receive comments from our readers. Feel free to send me a comment via email or discuss the newsletter in our JavaSpecialists Slack Channel (Get an invite here)
We deliver relevant courses, by top Java developers to produce more resourceful and efficient programmers within their organisations.